AWS Incident from 2017

LePecheOriginel May 3, 2025

I read this interesting note from AWS about their S3 outage sometime ago. They learnt multiple lessons:

1) Limit what operational tools can do and soften their impact - I think they already had ample abstraction in how they manage their services.

2) Partition the software, to have a quicker recovery time - That is interesting, one of their findings is that the services took long to recover. This would help to reduce the blast radius (I suppose the operational tool makes that distinction) and smaller services are easier to restart.

So, they are thinking about why something happened, how they can ensure this doesn't happen again and how they would be able to recover quicker from a similar event in the future. That is a comprehensive framework on how we should approach similar outages.