They finally posted the full details

Hussein Nasser

974 views • yesterday

Video Summary

Amazon experienced a massive outage on October 19th, 2025, primarily affecting the US East 1 region, which lasted over 24 hours. The root cause was a race condition in a DNS automation tool, leading to null A records for critical services like DynamoDB. This failure cascaded, causing network load balancers to become overloaded due to failed health checks and subsequent removal and re-addition of EC2 instances. An interesting fact is that the automation's cleanup process deleted stale DNS plans, inadvertently removing essential records.

The prolonged outage stemmed from a complex interplay of factors, beginning with a critical DNS automation error that left essential services unreachable. This domino effect led to overloaded network load balancers, which then struggled with health checks for newly provisioned EC2 instances. The situation was exacerbated by a cleanup process that removed valid, albeit older, DNS records. Ultimately, throttling APIs was the primary method to restore functionality, highlighting the critical importance of robust automation and locking mechanisms in large-scale systems.

Short Highlights

The October 19th, 2025 AWS US East 1 outage lasted over a day, with DNS being the primary failure point.
The outage began when A records for DynamoDB US East 1 became null due to a race condition in a DNS automation tool.
A subsequent surge of new EC2 instances, attempting to connect to DynamoDB, overloaded the network load balancer, causing its health checks to fail.
The automation's cleanup process deleted stale DNS plans, including the crucial DynamoDB record, resulting in null A records and an inability to resolve IPs.
The outage was resolved by throttling APIs, taking over 24 hours, and underscores the need for robust locking mechanisms in automation.

Key Details

Amazon US East 1 Outage: Initial Failures [00:00]

The outage on October 19th, 2025, stemmed from a DNS issue that persisted for over a day.
Unlike previous outages where authoritative name servers were down, this event was attributed to a flaw in a DNS automation tool.
At 11:48 p.m., users attempting to create new EC2 instances and other services requiring DynamoDB connections started failing.
DNS resolution failed because the A record for DynamoDB happened to be null, meaning no IP addresses were returned.

"The main reason at least, but we didn't know what caused the DNS to go down."

Cascading Failures and Network Load Balancer Overload [03:25]

The initial DNS failure led to a "thundering herd" of retries and new requests.
This surge overloaded the network load balancer (NLB) as it attempted to manage newly created EC2 instances.
The NLB's health check process became overwhelmed, causing it to fail and subsequently remove EC2 instances from its pool.
This led to a continuous cycle of instances being removed and added back, further stressing the NLB.

"The network load balancer itself to get overloaded and the health check itself the process of health check to get overloaded with caused the health check itself to fail."

The Root Cause: DNS Automation Race Condition [05:33]

The core issue was a DNS automation process that updated A records for services like DynamoDB.
A DNS planner generated lists of IP addresses for network load balancers, and a DNS enactor updated the DNS entries based on these plans.
The enactor had a rule: only overwrite an old plan with a newer one.
A race condition occurred when two DNS enactors, using different plan IDs, attempted to update the same record simultaneously.

"This is a classic database problem, guys."

The Race Condition Unfolds [09:22]

One DNS enactor picked up plan 101 and began updating records, including the DynamoDB entry.
It performed a check, noting the plan was old (100), and intended to write 101, but then got delayed.
Meanwhile, a second enactor picked up a newer plan, 102, and successfully updated many records, including DynamoDB, with 102.
When the first enactor finally woke up, it proceeded to write its plan (101), overriding the newer 102.

"The old guy, the DNS enactor, remember it read the 10 100 and then got frozen and then everything got updated with 102."

The "Cleaner" Process and Null Records [11:35]

The second enactor, which had successfully written plan 102, had a cleanup logic to delete stale plans (anything less than 102).
This cleaner process erroneously deleted the DynamoDB record that had been set to 101 by the first enactor.
This deletion process is how the A record for DynamoDB US East 1 became null, rendering the service unreachable.
The issue could have been prevented with a simple lock mechanism on the record being updated.

"The cleaner kicks in and says, 'Okay, let's all let's all delete anything less than 102.'"

EC2 Instance Creation and Network Load Balancer Re-overload [13:37]

Once the DynamoDB DNS issue was partially resolved, EC2 instances began to be created.
However, the process of assigning IPs and updating network load balancer configurations for these new instances heavily loaded the network manager.
This overload on the network manager led to the third major failure: network load balancer errors.
The NLB became busy assigning IPs, syncing configurations, and performing health checks for millions of EC2 instances.

"The network manager that's a responsible for network manager and that instance got overloaded which led to the third major failure network load balancer errors my friends right."

Health Checks Failures and "Panic Mode" [17:09]

The overloaded network load balancers experienced slow DynamoDB queries because they were busy with configuration and health checks.
Health checks began to fail because the system was trying to check EC2 instances that were not yet fully networked or reachable.
The health check system itself became overloaded by the sheer volume of checks required for the new instances.
A feature like "panic mode" in systems like Envoy could potentially prevent the NLB from removing unhealthy instances during such a crisis.

"The network load balancer removes all of these bad unhealthy networks."

Resolution and Lessons Learned [19:53]

The cascading issues were primarily resolved by throttling APIs, a process that took nearly a day.
Kudos were given to Amazon engineers for their efforts in resolving a complex, multi-faceted outage.
The video emphasizes that while preventing outages is nearly impossible, being prepared for them is crucial.
Key takeaways include the importance of robust automation, proper locking mechanisms to prevent race conditions, and the need for visibility and readiness.