History's Major Downtimes: Lessons from the Biggest Outages

evrimsel@lemmy.world · 1 month ago

History's Major Downtimes: Lessons from the Biggest Outages

somebodysomewhere@lemmy.world · 1 month ago

reason for that is isolation and reduncancy though. Most incidents/outages are the result of a change and in the cases you mentioned they are mitigated by the fact that not all instances receive updates at the same time. Presumably, the error is noticed in one place and traffic is then served by healthy instances.

By all accounts these are practices that significant service providers follow. In fact AWS typically rolls out updates to us-east-1 before updating other regions to use it as a canary to warn against issues.

With federated services, this is less of a conscious decision and tends to happen only because instance maintainers update on different schedules.

Blue-green deployments and failover are common mitigation strategies and mature organizations actively employ these. Conversely, these patterns are integral to the decentralized nature of the fediverse and other distributed solutions such as cdn.