US-EAST-1 is humanity’s weakest link…

Fireship

708,063 views • yesterday

Video Summary

A massive, catastrophic cloud outage on October 21st, 2025, crippled over 2500 companies reliant on Amazon Web Services (AWS). The incident, stemming from a misconfigured DNS setting in the US East 1 region, caused widespread service disruptions, impacting everything from streaming services and gaming platforms to financial apps and even Amazon's own e-commerce site. The outage highlighted the profound dependency of modern digital infrastructure on a single provider and the cascading effects of even seemingly minor technical errors. The event spurred discussions on reducing reliance on centralized cloud providers and the potential impact of AI-driven code on system stability. An interesting fact is that the outage was so severe it was likened to regressing society 50 years, and the speaker nearly starved due to the inability to access food delivery apps and even Amazon.co

Short Highlights

Over 2500 companies were affected by a catastrophic AWS cloud outage on October 21st, 2025.
The root cause was a misconfigured DNS setting in the US East 1 region affecting API endpoints, particularly Amazon DynamoDB.
The outage caused cascading failures, impacting services like Netflix, Reddit, PlayStation, Fortnite, and even Amazon.com.
Services experienced increased error rates and latencies, with some issues persisting for hours due to accumulated serverless job queues.
The incident underscores the risks of over-reliance on a single cloud provider and raises questions about AI's role in code deploym

Key Details

The Great AWS Cluster of 2025 [00:00]

Numerous prominent applications and services, including Netflix, Reddit, PlayStation, Roblox, Fortnite, Robin Hood, Coinbase, Venmo, Snapchat, and Disney, went down simultaneously.
This widespread failure was attributed to the most catastrophic cloud outage in history, affecting over 2500 companies.
The common thread among these affected apps is their reliance on Amazon Web Services (AWS), the largest cloud provider globally, with an estimated 350 massive data centers.
The speaker humorously recounts personal struggles, nearly starving due to the McDonald's and DoorDash apps being down, and even Amazon.com itself being inaccessible.
The outage's severity was such that the speaker's AI girlfriend, also running on AWS, was unavailable, and even educational apps like Duolingo were affected.
The New York Times being down meant that the impact of a single misconfigured DNS setting in US East 1 regressed society by 50 years, according to the speaker.
The platform X was one of the few unaffected services, leading to commentary from Elon Musk and DHH.

The entire world goes to hell.

AWS Infrastructure and the US East 1 Region [01:18]

AWS operates hundreds of data centers globally, powering the multi-trillion dollar internet economy.
Even seemingly simple services like sending a Snapchat message rely on AWS infrastructure for processing.
AWS data centers are organized into geographic regions, with US East 1 in Northern Virginia being one of the oldest and most critical.
A cloud region comprises multiple data centers and at least three availability zones, designed for redundancy and fault tolerance, with each zone having independent power, cooling, and networking.

When you receive an unsolicited dickpick on Snapchat, it's Amazon, not Snapchat, that's the one using the electricity to process those pixels.

The DNS Resolution Failure and Cascading Effects [02:08]

At 9:07 p.m. Eastern time, AWS reported increased error rates and latencies in US East 1 services.
The root cause was identified as a subsystem failure related to DNS resolution for API endpoints, most notably affecting Amazon DynamoDB.
DNS acts as the internet's phonebook; a failure means applications cannot locate their necessary databases or services.
The breakdown in DNS lookup caused applications to fail, rendering them "vaporware" as they couldn't access their backend resources.
Despite AWS resolving the core issue within a couple of hours, a massive queue of accumulated serverless jobs, such as Lambda function calls and Simple Queue Service (SQS) messages, caused continued problems for hours.

Because in this case it was broken, Amazon would say, "Sorry, I can't find your database on this address."

Over-reliance on Centralized Cloud Providers [03:07]

The incident serves as a stark reminder of the risks associated with concentrating so much computing power with a single company.
The speaker mentions another cloud provider, Superbase, experiencing over 10 days of downtime in the EU West 2 region, which is also linked to AWS capacity issues.
The reliance on "Big Cloud" for essential services is highlighted as a critical vulnerability.

It's not Superbase's fault though. AWS just won't give them the capacity no matter how hard they beg.

The Role of AI and Tracer [03:30]

The exact developer responsible for the outage remains unknown, but the theory is that faulty AI code might have been deployed.
The speaker introduces Tracer, a sponsor of the video, as a solution to prevent such issues by adding a layer of planning and verification for coding agents.
Tracer allows users to define their goals, and it creates a detailed implementation plan broken into phases, which can then be passed to coding agents for code generation.
After code generation, Tracer scans for issues, preventing "slop" from reaching production, and is particularly beneficial for large codebases.