AWS US-EAST-1 Major Outage: Complete Timeline and Technical Anatomy of a 15-Hour Global Service Disruption

🌏 閱讀中文版本


Table of Contents

Incident Overview

On October 20, 2025, Amazon Web Services (AWS), the world’s largest
cloud provider, experienced one of its most severe service disruptions
in the US-EAST-1 region (Northern Virginia). This 14-hour and
32-minute
global catastrophe not only shocked the cloud
industry but has become a classic case study in cloud architecture
design and crisis management.

Key Metrics

Metric Data
Outage Duration 14 hours 32 minutes (2025-10-19 23:48 PDT → 2025-10-20 14:20
PDT)
Affected Services 142 AWS services
Official Updates 21 times (average every 43 minutes)
Impacted Users Millions (Snapchat, Roblox, Robinhood, McDonald’s, United Airlines,
etc.)
Estimated Loss Billions of dollars (Catchpoint CEO estimate)
Geographic Impact Global (USA, UK, Europe, etc.)
Incident Reports Over 6.5 million (Downdetector, covering 1000+ services)

Impact Scope

Key Affected Services: – Social Platforms: Snapchat,
Signal – Gaming: Roblox, Fortnite – FinTech: Robinhood, Coinbase, Venmo
– Airlines: United Airlines (website & app) – Retail: McDonald’s
App, Amazon Ring doorbell cameras – AI Services: ChatGPT, Perplexity –
Banking: Lloyds, Bank of Scotland, Halifax (UK)

AWS Internal Affected Services: EC2, DynamoDB,
Lambda, S3, RDS, ECS, CloudWatch, CloudFront, IAM, STS, API Gateway,
ELB, SQS, SNS, Step Functions, Redshift, Connect, Glue, Athena, Kinesis,
and 122 more services.


Complete Incident Timeline

Phase 1:
Discovery & Initial Investigation (00:11 – 02:01)

timeline
    title AWS US-EAST-1 Major Incident Timeline (2025-10-20)
    section Discovery Phase
        23:48 (10/19) : Incident actually begins (DNS issue triggered)
        00:11 : Investigation starts - elevated error rates in US-EAST-1
        00:51 : Multiple AWS services confirmed affected
        01:26 : DynamoDB endpoint significant error rates confirmed
    section Root Cause Identification
        02:01 : Identified "potential root cause" - DynamoDB DNS resolution
        02:22 : Initial mitigation applied, early recovery signs observed
        02:24 : DynamoDB DNS issue resolved (but services not recovered)
        02:27 : Most requests succeeding, processing queued requests
    section Cascade Failure Handling
        03:03 : Global services and US-EAST-1 dependencies recovered
        03:35 : DNS fully mitigated, but EC2 launches still failing
        04:08 : Continuing to address EC2 launch errors & Lambda delays
        05:48 : New EC2 instances launching successfully in some AZs
    section True Root Cause Confirmed
        07:29 : Multiple services network connectivity issues confirmed
        08:04 : Issue identified within "EC2 internal network"
        08:43 : 🎯 True root cause confirmed - NLB health check subsystem failure
    section Recovery Phase
        09:13 : Mitigation steps applied, connectivity recovery observed
        09:38 : NLB health checks recovered (critical turning point)
        10:03 : EC2 fix validated, preparing deployment
        11:22 : EC2 launch success rate improving, network issues decreasing
        12:15 : Most services continuing recovery, Lambda improving
    section Full Recovery
        13:03 : Lambda invocations fully recovered
        13:52 : EC2 throttles reduced to pre-event levels
        14:48 : Connect handling new sessions normally, backlog processing
        14:20 : ✅ All AWS services returned to normal operations
        14:53 : Complete Post-Event Summary published

Detailed Timeline
with AWS Official Updates

23:48 PDT (Oct
19) – Incident Actually Begins

According to the final report, the incident was triggered at this
time, but AWS was not yet aware.

00:11 PDT – Problem First
Detected

“We are investigating increased error rates and latencies for
multiple AWS services in the US-EAST-1 Region.”

Analysis: AWS detected the problem 22 minutes after
it began, indicating monitoring system detection latency.

00:51 PDT – Impact Scope
Confirmed

“We can confirm increased error rates and latencies for multiple AWS
Services in the US-EAST-1 Region. This issue may also be affecting Case
Creation through the AWS Support Center or the Support API.”

Analysis: Even the support system was affected,
revealing the severity.

01:26 PDT – Surface
Symptom Identified

“We can confirm significant error rates for requests made to the
DynamoDB endpoint in the US-EAST-1 Region. This issue also affects other
AWS Services in the US-EAST-1 Region as well.”

Analysis: DynamoDB identified as the main symptom,
but root cause not yet found.

02:01
PDT – Initial Root Cause Hypothesis (Later Proven
Incorrect)

“We have identified a potential root cause for error
rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our
investigation, the issue appears to be related to DNS
resolution
of the DynamoDB API endpoint in US-EAST-1. We are
working on multiple parallel paths to accelerate recovery.”

Analysis: Use of “potential” indicates uncertainty.
Later confirmed DNS was only a trigger, not the true root cause.

02:24
PDT – DNS Issue Resolved (But Disaster Continues)

“After resolving the DynamoDB DNS issue at 2:24 AM, services began
recovering but we had a subsequent impairment in the
internal subsystem of EC2 that is responsible for launching EC2
instances due to its dependency on DynamoDB.”

Critical Finding: After DNS fix, services did not
recover—instead, it triggered a cascade failure. This is key evidence
for understanding the entire incident.

08:04 PDT – Problem Scope
Narrowed

“We continue to investigate the root cause for the network
connectivity issues that are impacting AWS services. We have identified
that the issue originated from within the EC2 internal
network
.”

Analysis: Took 6 hours to narrow the problem scope
to EC2 internal network.

08:43 PDT – True Root
Cause Confirmed

“We have narrowed down the source of the network connectivity issues
that impacted AWS Services. The root cause is an underlying
internal subsystem responsible for monitoring the health of our network
load balancers.

Major Breakthrough: Finally found the true root
cause! No longer using “potential,” but definitive “The root
cause is
.”

From 02:01 (DNS hypothesis) to 08:43 (NLB Health Check confirmation),
AWS engineers spent 6 hours 42 minutes to find the real
cause.

09:38 PDT – Critical
Turning Point

“We recovered the Network Load Balancer health checks at 9:38
AM.”

Analysis: After NLB health checks recovered,
services began large-scale recovery. This proves NLB monitoring system
was the true culprit.

14:20 PDT – Full
Recovery

“By 2:20 PM, all AWS services returned to normal operations.”

Summary: From actual start (23:48) to full recovery
(14:20), total of 14 hours 32 minutes.


Root Cause Technical Anatomy

Surface Trigger vs True Root
Cause

Surface Trigger:
DNS Resolution Issue

  • Time: 2025-10-19 23:49 PDT
  • Symptom: DynamoDB service endpoint DNS resolution failure
  • Fix Time: 02:24 PDT (2.5 hours)

True
Root Cause: NLB Health Monitoring System Failure

  • Time: Confirmed at 08:43 PDT, likely started earlier
  • Nature: Internal subsystem unable to correctly monitor Network Load
    Balancer health status
  • Impact Duration: 02:24 → 09:38 (7.2 hours)
  • Full Recovery Time: 09:38 → 15:01 (5.4 hours)

Why Did AWS Initially
Misdiagnose?

Reason 1: Symptom vs Root Cause Confusion

Surface Phenomenon: DynamoDB DNS resolution failure
         ↓
Actual Mechanism: NLB Health Check system failure
         ↓
         Unable to correctly monitor DynamoDB endpoint health
         ↓
         Marked healthy DynamoDB endpoints as "unhealthy"
         ↓
         DNS resolution system returns errors (no "healthy" endpoints available)
         ↓
Appears to be a "DNS problem"

Analogy: – Symptom: Patient has fever (DNS error) –
Root Cause: Immune system failure (NLB Health Monitoring)

Fixing the fever (DNS) doesn’t address immune system problems, so
condition continues to deteriorate.

Reason 2: Cascade Reaction Complexity

AWS infrastructure service interdependencies are extremely
complex:

graph TD
    A[NLB Health Monitoring<br/>System] -->|Monitoring Failure| B[DynamoDB Endpoints]
    B -->|Marked as Unhealthy| C[DNS Resolution Returns Errors]
    C -->|Depends on DynamoDB| D[EC2 Instance Launch]
    D -->|Cannot Start| E[Lambda Execution Environment]
    D -->|Cannot Start| F[CloudWatch Data Collection]
    D -->|Cannot Start| G[Other 142 Services]

    A -->|Also Affects| H[Other Services'<br/>NLB Health Checks]
    H -->|Network Connectivity Lost| I[Lambda/CloudWatch/<br/>API Gateway, etc.]

    style A fill:#ff6b6b
    style H fill:#ff6b6b
    style C fill:#ffa500
    style D fill:#ffa500

Reason 3: Progressive Investigation is Standard SRE
Practice

AWS used the standard “peeling the onion” failure troubleshooting
approach:

  1. Layer 1 (Surface Symptoms): Elevated error rates →
    Need to narrow scope
  2. Layer 2 (Service Layer): DynamoDB errors → Found
    symptom focus
  3. Layer 3 (Infrastructure Layer): DNS resolution
    issue → Found trigger
  4. Layer 4 (Internal System Layer): EC2 internal
    network → Narrowed scope
  5. Layer 5 (Control Plane Layer): NLB Health
    Monitoring → True Root Cause

This method, while time-consuming, ensures no possibilities are
missed.

Timeline
Evidence: Why DNS Wasn’t the True Root Cause

Key Timestamp Event Explanation
02:24 DNS issue resolved AWS fixed DynamoDB DNS resolution
02:24 – 09:38 Disaster continues 7.2 hours — If DNS were root cause, services should
recover now
09:38 NLB Health Check recovered Critical turning point, services begin large-scale recovery
09:38 – 14:20 Gradual recovery 4.7 hours — Processing backlog & throttle
removal

Conclusion: 12 hours of outage
after DNS fix proves DNS was only a trigger; NLB Health Monitoring
System was the true disease root.


The Fatal Role
of NLB Health Monitoring System

What is NLB Health
Monitoring System?

Definition: The Network Load Balancer (NLB) Health
Monitoring System is an AWS internal control plane
subsystem
responsible for:

  1. Monitoring NLB Backend Target Health
    • Periodically sends health check requests to backend services
    • Determines which backends are “Healthy” vs “Unhealthy”
  2. Controlling Traffic Distribution
    • Routes traffic only to “Healthy” backends
    • Automatically removes “Unhealthy” backends from load balancing
      pool
  3. Affecting Service Discovery
    • DNS resolution system depends on health check results to return IP
      addresses
    • If all backends marked “Unhealthy,” DNS resolution fails
  4. Maintaining Internal Network Connectivity
    • EC2 instance launches depend on NLB-provided internal network
      connectivity
    • Lambda execution environments need NLB-provided network routing

Why Did
This Subsystem Trigger a Global Catastrophe?

Problem
1: The Monitor Itself Needs Monitoring (Who Watches the
Watchers?)

Normal Architecture:
  NLB Health Monitoring → Monitors → DynamoDB/EC2/Lambda

Paradox in This Incident:
  NLB Health Monitoring itself failed
         ↓
  But no "upper-level monitoring system" detected this failure
         ↓
  It continued operating, but with incorrect judgments
         ↓
  Marked all "Healthy" services as "Unhealthy"
         ↓
  Triggered global disaster

Fundamental Problem: AWS lacks “Meta-Monitoring”
mechanism to timely detect anomalies in the monitoring system
itself.

Problem
2: Insufficient Control Plane vs Data Plane Isolation

Ideal Architecture:

Control Plane: Handles management, monitoring, configuration
  - On failure, should only affect "new operations" (like new EC2 launches)
  - Should NOT affect "running services" (like running EC2 instances)

Data Plane: Handles actual traffic processing
  - Should continue operating when control plane fails

Reality in AWS Incident:

NLB Health Monitoring (Control Plane) failure
         ↓
Affected running EC2 instance network connectivity (Data Plane)
         ↓
Violated "plane isolation" principle

Result: Not only could new EC2 instances not launch,
but running services also lost network connectivity.

Problem
3: Single Point of Dependency & Lack of Fault
Tolerance

Architectural Weak Points: 1. Global
Services Depend on Single Region (US-EAST-1)
– IAM, STS (global
authentication) primary endpoints – DynamoDB Global Tables coordination
center

  1. NLB Health Monitoring System Lacks Redundancy
    • No backup monitoring system
    • Single subsystem failure triggers cascade disaster
  2. Over-Coupled Service Dependencies
    • Why does EC2 launch need to depend on DynamoDB?
    • Why is Lambda affected by NLB health checks?
    • Lacks “Degraded Mode” operation mechanism

Cascade Failure Mechanism
Explained

Dual Cascade Reaction Paths

Path 1:
DNS → DynamoDB → EC2 → Upper-Layer Services

graph LR
    A[NLB Health<br/>Monitoring Failure] --> B[DynamoDB Endpoints<br/>Misjudged as Unhealthy]
    B --> C[DNS Resolution<br/>Failure]
    C --> D[EC2 Instance<br/>Launch Failure]
    D --> E[Lambda Execution<br/>Environment Cannot Create]
    D --> F[ECS Containers<br/>Cannot Start]
    D --> G[Glue Jobs<br/>Cannot Execute]

    style A fill:#ff6b6b
    style B fill:#ffa500
    style C fill:#ffa500
    style D fill:#ffa500

Affected Services: ECS, Glue, RDS, Redshift, EMR,
SageMaker (all services depending on EC2)

Path
2: NLB Health Check → Network Connectivity → Global
Services

graph LR
    A[NLB Health<br/>Monitoring Failure] --> B[NLB Health Checks<br/>Comprehensively Failed]
    B --> C[Network Connectivity<br/>Lost]
    C --> D[Lambda<br/>Invocation Failure]
    C --> E[CloudWatch<br/>Cannot Collect Metrics]
    C --> F[API Gateway<br/>Cannot Route Requests]

    style A fill:#ff6b6b
    style B fill:#ff6b6b
    style C fill:#ff0000

Affected Services: Lambda, CloudWatch, API Gateway,
Step Functions, EventBridge

Why Did Recovery Take So
Long?

Phase 1: DNS Fix (2.5
hours)

  • Relatively fast, as it was a surface-level issue
  • But fixing it didn’t resolve the problem

Phase 2:
Cascade Failure Handling (7.2 hours)

  • Needed to identify NLB Health Monitoring failure (took 6.5
    hours)
  • Fix NLB Health Check system (took 0.7 hours)

Phase 3:
Gradual Service Recovery (5.4 hours)

  • Why not immediately recover at full speed? AWS adopted a “throttling
    strategy”

Throttling Reasons: 1. Avoid Thundering Herd
Problem
– If all services restart simultaneously, would put
enormous pressure on infrastructure – Could trigger “secondary
collapse”

  1. Ensure Stability Over Speed
    • Gradually lift throttles (EC2 launches, Lambda invocations, SQS
      polling)
    • Confirm each phase is stable before proceeding to next
  2. Process Backlog Tasks
    • CloudTrail, EventBridge, Connect had massive backlogs
    • Needed time to gradually clear

Conclusion: AWS chose “Slow is Smooth, Smooth is
Fast” recovery strategy, sacrificing speed for stability.


List of 142 Affected
Services

Core Infrastructure Services

  • Compute: EC2, Lambda, ECS, EKS, Fargate, Batch
  • Storage: S3, EBS, EFS, FSx
  • Database: DynamoDB, RDS, Aurora, Redshift,
    ElastiCache, DocumentDB, Neptune
  • Networking: VPC, ELB (ALB/NLB/CLB), CloudFront,
    Route 53, API Gateway, Direct Connect, VPN

Developer Tools

  • CI/CD: CodeBuild, CodePipeline, CodeDeploy
  • Monitoring: CloudWatch, X-Ray, CloudTrail
  • Management: Systems Manager, CloudFormation,
    Config, OpsWorks

Application Services

  • Messaging: SQS, SNS, EventBridge, Kinesis, MQ
  • Workflow: Step Functions, SWF, Glue
  • AI/ML: SageMaker, Bedrock, Comprehend, Rekognition,
    Polly, Transcribe, Translate

Enterprise Services

  • Authentication: IAM, Cognito, STS, IAM Identity
    Center
  • Security: Secrets Manager, WAF, Security Hub,
    GuardDuty, Firewall Manager
  • Analytics: Athena, EMR, QuickSight, DataZone, Lake
    Formation

Customer Services

  • Communications: Connect, Chime, WorkMail, Pinpoint,
    SES
  • End User: WorkSpaces, AppStream 2.0, WorkSpaces
    Thin Client

Complete list of 142 services — demonstrates
US-EAST-1 region’s importance to AWS global architecture.


Post-Mortem
Analysis: AWS Final Report’s Root Cause Correction

Real-Time
Judgment vs. Complete Investigation Report

During the incident, AWS engineers went through a process of
“gradually approaching the truth.” However, in the complete Post-Event
Summary (published October 23, 2025), AWS, based on in-depth
post-incident Root Cause Analysis (RCA), repositioned the role of each
component in this incident.

Judgment
During Real-Time Updates (Incident in Progress)

02:01 PDT – AWS first proposed a hypothesis: >
“We have identified a potential root cause for error
rates for the DynamoDB APIs… The issue appears to be related to
DNS resolution.”

Key word: Used “potential”,
indicating not fully confirmed.

08:43 PDT – AWS confirmed judgment: > “We have
narrowed down the source… The root cause is an
underlying internal subsystem responsible for monitoring the health of
our network load balancers.”

Key word: No longer using “potential”, but
explicitly stating “The root cause is”.

Conclusion
in Final Report (Post-Incident Deep Investigation)

In the complete Post-Event Summary, AWS provided a different
conclusion:

The root cause of this issue was a latent
race condition in the DynamoDB DNS management system
that
resulted in an incorrect empty DNS record for the service’s regional
endpoint.”

Key difference: Root cause changed from “NLB Health
Monitoring System” to “DynamoDB DNS race condition”.

Why This Difference?

Reason
1: Different Goals Between Immediate Recovery and Root Cause
Analysis

Immediate Recovery Phase (incident in progress): –
Goal: Restore service as quickly as possible –
Focus: “What fix will restore service?” –
Conclusion: Fix NLB Health Monitoring → Service
restored → Therefore identified as root cause

Post-Incident Investigation Phase (Post-Event
Summary): – Goal: Find out “why it happened” –
Focus: “If X hadn’t occurred, would the entire incident
not have happened?” – Conclusion: Without DNS race
condition, NLB wouldn’t have started misidentifying → DNS race condition
is the starting point

Reason 2: Complexity of
Cause Chain

Both conclusions have validity, but observe different layers:

Real-Time Judgment Logic (bottom-up):

Observed phenomenon: All services failing
         ↓
Found recovery key: Fix NLB Health Monitoring
         ↓
Conclusion: NLB Health Monitoring is root cause

Post-Incident Analysis Logic (top-down):

Starting point: DynamoDB DNS race condition
  ↓
Triggered: DynamoDB endpoint unresolvable
  ↓
Caused: EC2 instance launch failures (dependent on DynamoDB)
  ↓
Caused: Network configuration propagation delays
  ↓
Caused: NLB health checks started misidentifying
  ↓
Amplified: Repeated remove/restore cycles
  ↓
Result: Service unstable for 14.5 hours

Reason 3:
Different Definitions of “Root Cause”

In incident analysis, “root cause” has two common definitions:

Definition A: Critical Path – “Removing which
problem restores service fastest?” – Real-time judgment used this
definition → NLB Health Monitoring

Definition B: First Cause – “If we could go back in
time, preventing which problem would prevent the entire incident?” –
Post-incident analysis used this definition → DynamoDB DNS race
condition

Repositioning Component
Roles

Based on AWS’s final report, we need to re-understand each
component’s role:

Component Role in Real-Time Judgment Role in Final Report Technical Position
DynamoDB DNS race condition Surface trigger Root cause First Cause
NLB Health Monitoring System Root cause Secondary impact and amplifier Critical Path
EC2 instance launch failures Secondary impact Direct impact Intermediate Effect
142 service outages Final result Final result Final Impact

Complete Correct Cause Chain

According to AWS’s final report, the complete technical cause chain
should be:

【Starting Point】DynamoDB DNS race condition
        ↓
      Two independent DNS Enactors interacted with race condition
        ↓
      One Enactor applied old DNS plan
      Another Enactor simultaneously deleted the plan
        ↓
      DynamoDB regional endpoint DNS record became empty
        ↓
【Direct Impact】DynamoDB endpoint unresolvable
        ↓
      EC2 instance launch dependent on DynamoDB → Launch failures
        ↓
      Newly launched EC2 instances' network configuration propagation delayed
        ↓
【Secondary Impact】NLB health check subsystem affected
        ↓
      NLB performed health checks on network-delayed EC2 instances
        ↓
      Health checks failed (but instances were actually healthy)
        ↓
【Amplification Mechanism】Repeated remove/restore cycles
        ↓
      NLB removed "unhealthy" instances (deleted from DNS)
        ↓
      Next check found instances ready → Check succeeded
        ↓
      NLB restored instances (added back to DNS)
        ↓
      Continuously repeated this cycle, increasing system load
        ↓
      Health check system itself degraded performance
        ↓
      Triggered automatic AZ DNS failover
        ↓
【Final Result】Service unstable for 14.5 hours

Reinterpreting Time Evidence

Let’s re-examine key time points from the “complete cause chain”
perspective:

Time (PDT) Event Original Interpretation Correct Interpretation
23:48 Incident began DNS issue triggered ✅ DNS race condition occurred (root cause)
02:24 DNS issue resolved Surface fix ✅ Fixed starting problem, but chain reaction already initiated
02:24 – 09:38 Disaster continued 7.2 hours Proves DNS isn’t root cause Wrong inference: This time is when secondary
impacts and amplification mechanisms were active
09:38 NLB Health Check restored True root cause fixed ✅ Critical path fixed, but not root cause
14:20 Fully recovered All problems resolved ✅ Backlog processing completed

Key Insight: – Service not restored after DNS fix
doesn’t prove “DNS isn’t the root cause” – Rather, because chain
reaction had already started
, even after removing the starting
cause, the initiated secondary problems (NLB misidentification) still
required separate handling – Analogy: Forest fire started by lightning,
but eliminating lightning cannot extinguish the already burning fire

AWS Engineers’ Diagnostic
Evolution

Understanding this evolution process is very valuable for learning
incident investigation:

Phase 1 (00:11 – 02:01): Discovered surface symptoms
– Observation: DynamoDB error rates increasing – Hypothesis: Possibly
DNS issue – Status: Symptom-level analysis

Phase 2 (02:01 – 02:24): Attempted to fix
hypothesized cause – Action: Fixed DNS issue – Expectation: Service
should restore – Result: ❌ Service not restored

Phase 3 (02:24 – 08:43): Reinvestigated – Discovery:
Problem more complex than imagined – Deeper: EC2 internal network issues
– Progress: Gradually narrowed scope

Phase 4 (08:43): Found critical path – Breakthrough:
NLB Health Monitoring System failure – Judgment: This is the root cause
(real-time judgment) – Reason: Fixing it restored
service

Phase 5 (Post-incident 10/23): Complete RCA – Traced
back: Why did NLB start misidentifying? – Discovery: DNS race condition
was the starting point – Correction: DynamoDB DNS race condition is the
root cause (final conclusion)

This Article’s Documentary
Value

Important Note: This article preserves the evolution
of AWS’s real-time judgment, which is very valuable learning
material.

Why Preserve Real-Time Judgment? 1.
Authenticity: Reflects the true situation during the
incident 2. Educational Value: Demonstrates complexity
of investigating large-scale distributed system failures 3. SRE
Practice
: Explains why immediate recovery and post-incident RCA
conclusions differ 4. Historical Record: Evolution of
AWS update content itself is important data

How to Read This Article Correctly: – “Root Cause
Technical Analysis” section: Records judgment at the time (real-time
analysis) – “Post-Mortem Analysis” section (this section): Supplements
final report conclusion (complete RCA) – Combined: Complete
understanding from immediate recovery to deep analysis

Key Learning Points

Learning
Point 1: Immediate Recovery ≠ Root Cause Analysis

Immediate Recovery: – Goal: Restore service quickly
– Method: Find Critical Path – Problem: May misjudge starting cause

Root Cause Analysis: – Goal: Prevent recurrence –
Method: Find First Cause – Requires: Post-incident calm deep
investigation

Learning Point 2:
Inertia of Chain Reactions

Complex system chain reactions have “inertia”: – Even after removing
starting cause, chain reaction may not automatically stop – Requires
intervention at each critical node of the chain reaction – This is also
why service wasn’t restored after DNS fix

Learning
Point 3: Both Types of “Root Cause” Are Important

From practical perspective: – First Cause (DynamoDB DNS race
condition)
: Needs fixing to prevent recurrence –
Critical Path (NLB Health Monitoring): Needs
strengthening to shorten recovery time

Both require improvement measures!

References

  • AWS Official Post-Event Summary:
    https://aws.amazon.com/message/101925/ (2025-10-23)
  • This Article’s “Root Cause Technical Analysis”
    Section
    : Records real-time judgment evolution during
    incident
  • Data Verification Date: 2025-10-24

Classic Scenario: When
the Watcher Fails

Philosophical
Question: Who Watches the Watchers?

The Latin proverb “Quis custodiet ipsos custodes?” found real-world
validation in this incident:

Traditional Monitoring Architecture:

Monitoring System → Monitors → Business System

Problem: When the monitoring system itself fails,
who discovers it?

Manifestation in AWS Incident:

NLB Health Monitoring System
  ↓ (Failed but undiscovered)
Continues operating, but with incorrect judgments
  ↓
Marks healthy services as abnormal
  ↓
Triggers global disaster
  ↓
AWS engineers spend 6.5 hours discovering the monitoring system itself has issues

Solution: Meta-Monitoring
Architecture

Improved Architecture:

Meta-Monitoring Layer (monitors the monitoring system)
      ↓
Monitoring System Layer (monitors business systems)
      ↓
Business System Layer (actual services)

Implementation:

  1. Independent Meta-Monitoring System

    Monitoring Target: Is NLB Health Monitoring System behavior normal?
    Detection Metrics:
      - Health Check request success rate
      - Reasonableness of judgment results (sudden mass marking as unhealthy)
      - System response time
  2. Anomaly Detection

    # Pseudocode example
    if (unhealthy_targets / total_targets) > 0.5:
        # Over 50% marked unhealthy is highly unreasonable
        alert("NLB Health Monitoring System may be failing")
        trigger_failover_to_backup_monitoring_system()
  3. Redundant Monitoring Systems

    • Primary monitoring system + backup monitoring system
    • Both use different technology stacks (avoid common mode
      failure)
    • Auto-switch when primary system abnormal

Similar Historical Cases

Google Cloud
2019-06-02 Incident

  • Root Cause: Network configuration error causing
    traffic routing failure
  • Similarity: Control plane issues affecting data
    plane
  • Duration: 4.5 hours
  • Lesson: Google subsequently implemented “gradual
    rollout of configuration changes” mechanism

Azure 2018-09-04
Incident

  • Root Cause: DNS configuration error
  • Similarity: DNS issues triggering cascade
    failures
  • Duration: 24 hours
  • Lesson: Microsoft strengthened DNS system
    redundancy

Cloudflare 2019-07-02
Incident

  • Root Cause: WAF rule deployment error
  • Similarity: Monitoring/protection system became the
    attacker
  • Duration: 3.5 hours
  • Lesson: Cloudflare implemented stricter
    configuration validation

AWS Incident’s Uniqueness: – Longest duration (15
hours) – Most affected services (142) – Deepest root cause (monitoring
system itself failed) – Most conservative recovery strategy (extensive
throttling use)


Key Findings & Technical
Insights

Finding 1:
US-EAST-1’s “Special Status” is Systemic Risk

Why is US-EAST-1 So Critical?

  1. Historical Reasons
    • AWS’s first region launched in 2006
    • Bears massive legacy architecture and technical debt
  2. Global Service Dependencies
    • IAM, STS global authentication primary endpoints
    • DynamoDB Global Tables coordination center
    • CloudFront configuration management center
  3. Customer Inertia
    • Earliest customers concentrated here
    • Migration costs extremely high (data, network latency,
      compliance)

Risk Analysis:

US-EAST-1 failure → Global services affected
                  → Other regions cannot operate fully independently
                  → Violates "regional independence" design principle

Finding
2: Control Plane & Data Plane Coupling is Fatal Flaw

Theory vs Reality:

Design Principle AWS Promise Reality in Incident
Control plane failure doesn’t affect data plane ✅ Claims isolation ❌ NLB Health Check failure affected running services
Regional independence ✅ Each region operates independently ❌ Other regions depend on US-EAST-1 global services
Degraded operation capability ✅ Core services should be degradable ❌ EC2 cannot launch, no degradation mechanism

Improvement Directions: – EC2 should have
“no-DynamoDB launch mode” (degraded but usable) – Global services should
truly implement multi-region active-standby – Monitoring system failure
shouldn’t affect established network connections

Finding
3: Progressive Root Cause Analysis is Double-Edged Sword

Advantages: – Avoids hasty conclusions leading to
wrong decisions – Layer-by-layer investigation ensures no issues
missed

Disadvantages: – Too time-consuming (6.5 hours to
find true root cause) – Initial misjudgment (DNS) may mislead
customers

Best Practice:

Parallel Investigation + Transparent Communication

Simultaneously:
  1. Surface symptom mitigation (DNS fix)
  2. Deep root cause investigation (NLB Health Monitoring)
  3. External communication clearly distinguishes "symptom" vs "root cause"

Finding
4: Throttling Strategy Reflects Mature SRE Culture

AWS chose to sacrifice recovery speed for stability, demonstrating
“reliability over speed” core values:

Throttling Implementation: – EC2 instance launch
throttling – Lambda invocation throttling – SQS polling rate limiting –
Asynchronous operations throttling

Philosophy: Slow is Smooth, Smooth is Fast – Rapid
recovery might trigger secondary collapse (Thundering Herd) – Gradual
recovery ensures not repeating mistakes

Results: – No “secondary collapse” – Recovery
process stable and controlled – But total duration extended to 15
hours


Series Article Preview

This article is the first in the AWS US-EAST-1 Major Incident
Series Analysis
, providing a complete record of the incident’s
timeline and technical root causes.

Want to know: – How did AWS handle this crisis? What communication
strategies are worth learning? – How can enterprises learn from this to
design fault-tolerant architectures and multi-cloud strategies? – What
immediate actions should technology decision-makers take? – How can SRE
teams replicate AWS’s excellent practices while avoiding similar
mistakes?

Read the second article in the series: AWS
Architecture Resilience Deep Dive: Multi-AZ/Multi-Region Design
Principles from US-EAST-1 Outage


References

  1. AWS Official Sources
  2. Media Coverage
    • CNN Business: “AWS global outage, Amazon, Snapchat, Roblox and
      Fortnite down” (2025-10-20)
    • NPR: “Outage at Amazon Web Services disrupts websites across the
      internet” (2025-10-20)
    • TechRadar: “Amazon fixes huge AWS outage that broke much of the
      internet” (2025-10-20)
    • Bloomberg: “AWS Outage: Amazon Cloud Restored; Hit Snapchat, Roblox,
      Robinhood” (2025-10-20)
  3. Technical Community Analysis
    • DEV Community: “The AWS Outage on Oct 20, 2025: What Broke, Why It
      Felt Global, and How Amazon Stabilized It”
    • DEV Community: “The Great AWS Outage of October 2025: When the
      Internet’s Backbone Buckled”
    • GeekWire: “AWS outage was not due to a cyberattack — but shows
      potential for ‘far worse’ damage”
  4. Expert Commentary
    • Catchpoint CEO: Estimated total loss in billions of dollars
    • Betsy Cooper (Cybersecurity Expert): “This highlights the fragility
      of depending on a few companies”

📌 About This Series

This is a classic case in cloud service history worth deep study by
all technology decision-makers, architects, and SRE engineers. We’ll
provide comprehensive analysis of the incident and practical insights
through two series articles.

Article 1 (This Article): Complete incident record
and technical anatomy Article 2: AWS
Architecture Resilience Deep Dive: Multi-AZ/Multi-Region Design
Principles from US-EAST-1 Outage

Leave a Comment