AWS US-EAST-1 Major Outage: Complete Timeline and Technical Anatomy of a 15-Hour Global Service Disruption

Table of Contents

Incident Overview

On October 20, 2025, Amazon Web Services (AWS), the world’s largest
cloud provider, experienced one of its most severe service disruptions
in the US-EAST-1 region (Northern Virginia). This 14-hour and
32-minute global catastrophe not only shocked the cloud
industry but has become a classic case study in cloud architecture
design and crisis management.

Key Metrics

Metric	Data
Outage Duration	14 hours 32 minutes (2025-10-19 23:48 PDT → 2025-10-20 14:20 PDT)
Affected Services	142 AWS services
Official Updates	21 times (average every 43 minutes)
Impacted Users	Millions (Snapchat, Roblox, Robinhood, McDonald’s, United Airlines, etc.)
Estimated Loss	Billions of dollars (Catchpoint CEO estimate)
Geographic Impact	Global (USA, UK, Europe, etc.)
Incident Reports	Over 6.5 million (Downdetector, covering 1000+ services)

Impact Scope

Key Affected Services: – Social Platforms: Snapchat,
Signal – Gaming: Roblox, Fortnite – FinTech: Robinhood, Coinbase, Venmo
– Airlines: United Airlines (website & app) – Retail: McDonald’s
App, Amazon Ring doorbell cameras – AI Services: ChatGPT, Perplexity –
Banking: Lloyds, Bank of Scotland, Halifax (UK)

AWS Internal Affected Services: EC2, DynamoDB,
Lambda, S3, RDS, ECS, CloudWatch, CloudFront, IAM, STS, API Gateway,
ELB, SQS, SNS, Step Functions, Redshift, Connect, Glue, Athena, Kinesis,
and 122 more services.

Complete Incident Timeline

Phase 1:
Discovery & Initial Investigation (00:11 – 02:01)

timeline
    title AWS US-EAST-1 Major Incident Timeline (2025-10-20)
    section Discovery Phase
        23:48 (10/19) : Incident actually begins (DNS issue triggered)
        00:11 : Investigation starts - elevated error rates in US-EAST-1
        00:51 : Multiple AWS services confirmed affected
        01:26 : DynamoDB endpoint significant error rates confirmed
    section Root Cause Identification
        02:01 : Identified "potential root cause" - DynamoDB DNS resolution
        02:22 : Initial mitigation applied, early recovery signs observed
        02:24 : DynamoDB DNS issue resolved (but services not recovered)
        02:27 : Most requests succeeding, processing queued requests
    section Cascade Failure Handling
        03:03 : Global services and US-EAST-1 dependencies recovered
        03:35 : DNS fully mitigated, but EC2 launches still failing
        04:08 : Continuing to address EC2 launch errors & Lambda delays
        05:48 : New EC2 instances launching successfully in some AZs
    section True Root Cause Confirmed
        07:29 : Multiple services network connectivity issues confirmed
        08:04 : Issue identified within "EC2 internal network"
        08:43 : 🎯 True root cause confirmed - NLB health check subsystem failure
    section Recovery Phase
        09:13 : Mitigation steps applied, connectivity recovery observed
        09:38 : NLB health checks recovered (critical turning point)
        10:03 : EC2 fix validated, preparing deployment
        11:22 : EC2 launch success rate improving, network issues decreasing
        12:15 : Most services continuing recovery, Lambda improving
    section Full Recovery
        13:03 : Lambda invocations fully recovered
        13:52 : EC2 throttles reduced to pre-event levels
        14:48 : Connect handling new sessions normally, backlog processing
        14:20 : ✅ All AWS services returned to normal operations
        14:53 : Complete Post-Event Summary published

Detailed Timeline
with AWS Official Updates

23:48 PDT (Oct
19) – Incident Actually Begins

According to the final report, the incident was triggered at this
time, but AWS was not yet aware.

00:11 PDT – Problem First
Detected

“We are investigating increased error rates and latencies for
multiple AWS services in the US-EAST-1 Region.”

Analysis: AWS detected the problem 22 minutes after
it began, indicating monitoring system detection latency.

00:51 PDT – Impact Scope
Confirmed

“We can confirm increased error rates and latencies for multiple AWS
Services in the US-EAST-1 Region. This issue may also be affecting Case
Creation through the AWS Support Center or the Support API.”

Analysis: Even the support system was affected,
revealing the severity.

01:26 PDT – Surface
Symptom Identified

“We can confirm significant error rates for requests made to the
DynamoDB endpoint in the US-EAST-1 Region. This issue also affects other
AWS Services in the US-EAST-1 Region as well.”

Analysis: DynamoDB identified as the main symptom,
but root cause not yet found.

02:01
PDT – Initial Root Cause Hypothesis (Later Proven
Incorrect)

“We have identified a potential root cause for error
rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our
investigation, the issue appears to be related to DNS
resolution of the DynamoDB API endpoint in US-EAST-1. We are
working on multiple parallel paths to accelerate recovery.”

Analysis: Use of “potential” indicates uncertainty.
Later confirmed DNS was only a trigger, not the true root cause.

02:24
PDT – DNS Issue Resolved (But Disaster Continues)

“After resolving the DynamoDB DNS issue at 2:24 AM, services began
recovering but we had a subsequent impairment in the
internal subsystem of EC2 that is responsible for launching EC2
instances due to its dependency on DynamoDB.”

Critical Finding: After DNS fix, services did not
recover—instead, it triggered a cascade failure. This is key evidence
for understanding the entire incident.

08:04 PDT – Problem Scope
Narrowed

“We continue to investigate the root cause for the network
connectivity issues that are impacting AWS services. We have identified
that the issue originated from within the EC2 internal
network.”

Analysis: Took 6 hours to narrow the problem scope
to EC2 internal network.

08:43 PDT – True Root
Cause Confirmed

“We have narrowed down the source of the network connectivity issues
that impacted AWS Services. The root cause is an underlying
internal subsystem responsible for monitoring the health of our network
load balancers.”

Major Breakthrough: Finally found the true root
cause! No longer using “potential,” but definitive “The root
cause is.”

From 02:01 (DNS hypothesis) to 08:43 (NLB Health Check confirmation),
AWS engineers spent 6 hours 42 minutes to find the real
cause.

09:38 PDT – Critical
Turning Point

“We recovered the Network Load Balancer health checks at 9:38
AM.”

Analysis: After NLB health checks recovered,
services began large-scale recovery. This proves NLB monitoring system
was the true culprit.

14:20 PDT – Full
Recovery

“By 2:20 PM, all AWS services returned to normal operations.”

Summary: From actual start (23:48) to full recovery
(14:20), total of 14 hours 32 minutes.

Root Cause Technical Anatomy

Surface Trigger vs True Root
Cause

Surface Trigger:
DNS Resolution Issue

Time: 2025-10-19 23:49 PDT
Symptom: DynamoDB service endpoint DNS resolution failure
Fix Time: 02:24 PDT (2.5 hours)

True
Root Cause: NLB Health Monitoring System Failure

Time: Confirmed at 08:43 PDT, likely started earlier
Nature: Internal subsystem unable to correctly monitor Network Load
Balancer health status
Impact Duration: 02:24 → 09:38 (7.2 hours)
Full Recovery Time: 09:38 → 15:01 (5.4 hours)

Why Did AWS Initially
Misdiagnose?

Reason 1: Symptom vs Root Cause Confusion

Surface Phenomenon: DynamoDB DNS resolution failure
         ↓
Actual Mechanism: NLB Health Check system failure
         ↓
         Unable to correctly monitor DynamoDB endpoint health
         ↓
         Marked healthy DynamoDB endpoints as "unhealthy"
         ↓
         DNS resolution system returns errors (no "healthy" endpoints available)
         ↓
Appears to be a "DNS problem"

Analogy: – Symptom: Patient has fever (DNS error) –
Root Cause: Immune system failure (NLB Health Monitoring)

Fixing the fever (DNS) doesn’t address immune system problems, so
condition continues to deteriorate.

Reason 2: Cascade Reaction Complexity

AWS infrastructure service interdependencies are extremely
complex:

graph TD
    A[NLB Health Monitoring<br/>System] -->|Monitoring Failure| B[DynamoDB Endpoints]
    B -->|Marked as Unhealthy| C[DNS Resolution Returns Errors]
    C -->|Depends on DynamoDB| D[EC2 Instance Launch]
    D -->|Cannot Start| E[Lambda Execution Environment]
    D -->|Cannot Start| F[CloudWatch Data Collection]
    D -->|Cannot Start| G[Other 142 Services]

    A -->|Also Affects| H[Other Services'<br/>NLB Health Checks]
    H -->|Network Connectivity Lost| I[Lambda/CloudWatch/<br/>API Gateway, etc.]

    style A fill:#ff6b6b
    style H fill:#ff6b6b
    style C fill:#ffa500
    style D fill:#ffa500

Reason 3: Progressive Investigation is Standard SRE
Practice

AWS used the standard “peeling the onion” failure troubleshooting
approach:

Layer 1 (Surface Symptoms): Elevated error rates →
Need to narrow scope
Layer 2 (Service Layer): DynamoDB errors → Found
symptom focus
Layer 3 (Infrastructure Layer): DNS resolution
issue → Found trigger
Layer 4 (Internal System Layer): EC2 internal
network → Narrowed scope
Layer 5 (Control Plane Layer): NLB Health
Monitoring → True Root Cause

This method, while time-consuming, ensures no possibilities are
missed.

Timeline
Evidence: Why DNS Wasn’t the True Root Cause

Key Timestamp	Event	Explanation
02:24	DNS issue resolved	AWS fixed DynamoDB DNS resolution
02:24 – 09:38	Disaster continues	7.2 hours — If DNS were root cause, services should recover now
09:38	NLB Health Check recovered	Critical turning point, services begin large-scale recovery
09:38 – 14:20	Gradual recovery	4.7 hours — Processing backlog & throttle removal

Conclusion: 12 hours of outage
after DNS fix proves DNS was only a trigger; NLB Health Monitoring
System was the true disease root.

The Fatal Role
of NLB Health Monitoring System

What is NLB Health
Monitoring System?

Definition: The Network Load Balancer (NLB) Health
Monitoring System is an AWS internal control plane
subsystem responsible for:

Monitoring NLB Backend Target Health
- Periodically sends health check requests to backend services
- Determines which backends are “Healthy” vs “Unhealthy”
Controlling Traffic Distribution
- Routes traffic only to “Healthy” backends
- Automatically removes “Unhealthy” backends from load balancing
  pool
Affecting Service Discovery
- DNS resolution system depends on health check results to return IP
  addresses
- If all backends marked “Unhealthy,” DNS resolution fails
Maintaining Internal Network Connectivity
- EC2 instance launches depend on NLB-provided internal network
  connectivity
- Lambda execution environments need NLB-provided network routing

Why Did
This Subsystem Trigger a Global Catastrophe?

Problem
1: The Monitor Itself Needs Monitoring (Who Watches the
Watchers?)

Normal Architecture:
  NLB Health Monitoring → Monitors → DynamoDB/EC2/Lambda

Paradox in This Incident:
  NLB Health Monitoring itself failed
         ↓
  But no "upper-level monitoring system" detected this failure
         ↓
  It continued operating, but with incorrect judgments
         ↓
  Marked all "Healthy" services as "Unhealthy"
         ↓
  Triggered global disaster

Fundamental Problem: AWS lacks “Meta-Monitoring”
mechanism to timely detect anomalies in the monitoring system
itself.

Problem
2: Insufficient Control Plane vs Data Plane Isolation

Ideal Architecture:

Control Plane: Handles management, monitoring, configuration
  - On failure, should only affect "new operations" (like new EC2 launches)
  - Should NOT affect "running services" (like running EC2 instances)

Data Plane: Handles actual traffic processing
  - Should continue operating when control plane fails

Reality in AWS Incident:

NLB Health Monitoring (Control Plane) failure
         ↓
Affected running EC2 instance network connectivity (Data Plane)
         ↓
Violated "plane isolation" principle

Result: Not only could new EC2 instances not launch,
but running services also lost network connectivity.

Problem
3: Single Point of Dependency & Lack of Fault
Tolerance

Architectural Weak Points: 1. Global
Services Depend on Single Region (US-EAST-1) – IAM, STS (global
authentication) primary endpoints – DynamoDB Global Tables coordination
center

NLB Health Monitoring System Lacks Redundancy
- No backup monitoring system
- Single subsystem failure triggers cascade disaster
Over-Coupled Service Dependencies
- Why does EC2 launch need to depend on DynamoDB?
- Why is Lambda affected by NLB health checks?
- Lacks “Degraded Mode” operation mechanism

Cascade Failure Mechanism
Explained

Dual Cascade Reaction Paths

Path 1:
DNS → DynamoDB → EC2 → Upper-Layer Services

graph LR
    A[NLB Health<br/>Monitoring Failure] --> B[DynamoDB Endpoints<br/>Misjudged as Unhealthy]
    B --> C[DNS Resolution<br/>Failure]
    C --> D[EC2 Instance<br/>Launch Failure]
    D --> E[Lambda Execution<br/>Environment Cannot Create]
    D --> F[ECS Containers<br/>Cannot Start]
    D --> G[Glue Jobs<br/>Cannot Execute]

    style A fill:#ff6b6b
    style B fill:#ffa500
    style C fill:#ffa500
    style D fill:#ffa500

Affected Services: ECS, Glue, RDS, Redshift, EMR,
SageMaker (all services depending on EC2)

Path
2: NLB Health Check → Network Connectivity → Global
Services

graph LR
    A[NLB Health<br/>Monitoring Failure] --> B[NLB Health Checks<br/>Comprehensively Failed]
    B --> C[Network Connectivity<br/>Lost]
    C --> D[Lambda<br/>Invocation Failure]
    C --> E[CloudWatch<br/>Cannot Collect Metrics]
    C --> F[API Gateway<br/>Cannot Route Requests]

    style A fill:#ff6b6b
    style B fill:#ff6b6b
    style C fill:#ff0000

Affected Services: Lambda, CloudWatch, API Gateway,
Step Functions, EventBridge

Why Did Recovery Take So
Long?

Phase 1: DNS Fix (2.5
hours)

Relatively fast, as it was a surface-level issue
But fixing it didn’t resolve the problem

Phase 2:
Cascade Failure Handling (7.2 hours)

Needed to identify NLB Health Monitoring failure (took 6.5
hours)
Fix NLB Health Check system (took 0.7 hours)

Phase 3:
Gradual Service Recovery (5.4 hours)

Why not immediately recover at full speed? AWS adopted a “throttling
strategy”

Throttling Reasons: 1. Avoid Thundering Herd
Problem – If all services restart simultaneously, would put
enormous pressure on infrastructure – Could trigger “secondary
collapse”

Ensure Stability Over Speed
- Gradually lift throttles (EC2 launches, Lambda invocations, SQS
  polling)
- Confirm each phase is stable before proceeding to next
Process Backlog Tasks
- CloudTrail, EventBridge, Connect had massive backlogs
- Needed time to gradually clear

Conclusion: AWS chose “Slow is Smooth, Smooth is
Fast” recovery strategy, sacrificing speed for stability.

List of 142 Affected
Services

Core Infrastructure Services

Compute: EC2, Lambda, ECS, EKS, Fargate, Batch
Storage: S3, EBS, EFS, FSx
Database: DynamoDB, RDS, Aurora, Redshift,
ElastiCache, DocumentDB, Neptune
Networking: VPC, ELB (ALB/NLB/CLB), CloudFront,
Route 53, API Gateway, Direct Connect, VPN

Developer Tools

CI/CD: CodeBuild, CodePipeline, CodeDeploy
Monitoring: CloudWatch, X-Ray, CloudTrail
Management: Systems Manager, CloudFormation,
Config, OpsWorks

Application Services

Messaging: SQS, SNS, EventBridge, Kinesis, MQ
Workflow: Step Functions, SWF, Glue
AI/ML: SageMaker, Bedrock, Comprehend, Rekognition,
Polly, Transcribe, Translate

Enterprise Services

Authentication: IAM, Cognito, STS, IAM Identity
Center
Security: Secrets Manager, WAF, Security Hub,
GuardDuty, Firewall Manager
Analytics: Athena, EMR, QuickSight, DataZone, Lake
Formation

Customer Services

Communications: Connect, Chime, WorkMail, Pinpoint,
SES
End User: WorkSpaces, AppStream 2.0, WorkSpaces
Thin Client

Complete list of 142 services — demonstrates
US-EAST-1 region’s importance to AWS global architecture.

Post-Mortem
Analysis: AWS Final Report’s Root Cause Correction

Real-Time
Judgment vs. Complete Investigation Report

During the incident, AWS engineers went through a process of
“gradually approaching the truth.” However, in the complete Post-Event
Summary (published October 23, 2025), AWS, based on in-depth
post-incident Root Cause Analysis (RCA), repositioned the role of each
component in this incident.

Judgment
During Real-Time Updates (Incident in Progress)

02:01 PDT – AWS first proposed a hypothesis: >
“We have identified a potential root cause for error
rates for the DynamoDB APIs… The issue appears to be related to
DNS resolution.”

Key word: Used “potential”,
indicating not fully confirmed.

08:43 PDT – AWS confirmed judgment: > “We have
narrowed down the source… The root cause is an
underlying internal subsystem responsible for monitoring the health of
our network load balancers.”

Key word: No longer using “potential”, but
explicitly stating “The root cause is”.

Conclusion
in Final Report (Post-Incident Deep Investigation)

In the complete Post-Event Summary, AWS provided a different
conclusion:

“The root cause of this issue was a latent
race condition in the DynamoDB DNS management system that
resulted in an incorrect empty DNS record for the service’s regional
endpoint.”

Key difference: Root cause changed from “NLB Health
Monitoring System” to “DynamoDB DNS race condition”.

Why This Difference?

Reason
1: Different Goals Between Immediate Recovery and Root Cause
Analysis

Immediate Recovery Phase (incident in progress): –
Goal: Restore service as quickly as possible –
Focus: “What fix will restore service?” –
Conclusion: Fix NLB Health Monitoring → Service
restored → Therefore identified as root cause

Post-Incident Investigation Phase (Post-Event
Summary): – Goal: Find out “why it happened” –
Focus: “If X hadn’t occurred, would the entire incident
not have happened?” – Conclusion: Without DNS race
condition, NLB wouldn’t have started misidentifying → DNS race condition
is the starting point

Reason 2: Complexity of
Cause Chain

Both conclusions have validity, but observe different layers:

Real-Time Judgment Logic (bottom-up):

Observed phenomenon: All services failing
         ↓
Found recovery key: Fix NLB Health Monitoring
         ↓
Conclusion: NLB Health Monitoring is root cause

Post-Incident Analysis Logic (top-down):

Starting point: DynamoDB DNS race condition
  ↓
Triggered: DynamoDB endpoint unresolvable
  ↓
Caused: EC2 instance launch failures (dependent on DynamoDB)
  ↓
Caused: Network configuration propagation delays
  ↓
Caused: NLB health checks started misidentifying
  ↓
Amplified: Repeated remove/restore cycles
  ↓
Result: Service unstable for 14.5 hours

Reason 3:
Different Definitions of “Root Cause”

In incident analysis, “root cause” has two common definitions:

Definition A: Critical Path – “Removing which
problem restores service fastest?” – Real-time judgment used this
definition → NLB Health Monitoring

Definition B: First Cause – “If we could go back in
time, preventing which problem would prevent the entire incident?” –
Post-incident analysis used this definition → DynamoDB DNS race
condition

Repositioning Component
Roles

Based on AWS’s final report, we need to re-understand each
component’s role:

Component	Role in Real-Time Judgment	Role in Final Report	Technical Position
DynamoDB DNS race condition	Surface trigger	✅ Root cause	First Cause
NLB Health Monitoring System	✅ Root cause	Secondary impact and amplifier	Critical Path
EC2 instance launch failures	Secondary impact	Direct impact	Intermediate Effect
142 service outages	Final result	Final result	Final Impact

Complete Correct Cause Chain

According to AWS’s final report, the complete technical cause chain
should be:

【Starting Point】DynamoDB DNS race condition
        ↓
      Two independent DNS Enactors interacted with race condition
        ↓
      One Enactor applied old DNS plan
      Another Enactor simultaneously deleted the plan
        ↓
      DynamoDB regional endpoint DNS record became empty
        ↓
【Direct Impact】DynamoDB endpoint unresolvable
        ↓
      EC2 instance launch dependent on DynamoDB → Launch failures
        ↓
      Newly launched EC2 instances' network configuration propagation delayed
        ↓
【Secondary Impact】NLB health check subsystem affected
        ↓
      NLB performed health checks on network-delayed EC2 instances
        ↓
      Health checks failed (but instances were actually healthy)
        ↓
【Amplification Mechanism】Repeated remove/restore cycles
        ↓
      NLB removed "unhealthy" instances (deleted from DNS)
        ↓
      Next check found instances ready → Check succeeded
        ↓
      NLB restored instances (added back to DNS)
        ↓
      Continuously repeated this cycle, increasing system load
        ↓
      Health check system itself degraded performance
        ↓
      Triggered automatic AZ DNS failover
        ↓
【Final Result】Service unstable for 14.5 hours

Reinterpreting Time Evidence

Let’s re-examine key time points from the “complete cause chain”
perspective:

Time (PDT)	Event	Original Interpretation	Correct Interpretation
23:48	Incident began	DNS issue triggered	✅ DNS race condition occurred (root cause)
02:24	DNS issue resolved	Surface fix	✅ Fixed starting problem, but chain reaction already initiated
02:24 – 09:38	Disaster continued 7.2 hours	Proves DNS isn’t root cause	❌ Wrong inference: This time is when secondary impacts and amplification mechanisms were active
09:38	NLB Health Check restored	True root cause fixed	✅ Critical path fixed, but not root cause
14:20	Fully recovered	All problems resolved	✅ Backlog processing completed

Key Insight: – Service not restored after DNS fix
doesn’t prove “DNS isn’t the root cause” – Rather, because chain
reaction had already started, even after removing the starting
cause, the initiated secondary problems (NLB misidentification) still
required separate handling – Analogy: Forest fire started by lightning,
but eliminating lightning cannot extinguish the already burning fire

AWS Engineers’ Diagnostic
Evolution

Understanding this evolution process is very valuable for learning
incident investigation:

Phase 1 (00:11 – 02:01): Discovered surface symptoms
– Observation: DynamoDB error rates increasing – Hypothesis: Possibly
DNS issue – Status: Symptom-level analysis

Phase 2 (02:01 – 02:24): Attempted to fix
hypothesized cause – Action: Fixed DNS issue – Expectation: Service
should restore – Result: ❌ Service not restored

Phase 3 (02:24 – 08:43): Reinvestigated – Discovery:
Problem more complex than imagined – Deeper: EC2 internal network issues
– Progress: Gradually narrowed scope

Phase 4 (08:43): Found critical path – Breakthrough:
NLB Health Monitoring System failure – Judgment: This is the root cause
(real-time judgment) – Reason: Fixing it restored
service

Phase 5 (Post-incident 10/23): Complete RCA – Traced
back: Why did NLB start misidentifying? – Discovery: DNS race condition
was the starting point – Correction: DynamoDB DNS race condition is the
root cause (final conclusion)

This Article’s Documentary
Value

Important Note: This article preserves the evolution
of AWS’s real-time judgment, which is very valuable learning
material.

Why Preserve Real-Time Judgment? 1.
Authenticity: Reflects the true situation during the
incident 2. Educational Value: Demonstrates complexity
of investigating large-scale distributed system failures 3. SRE
Practice: Explains why immediate recovery and post-incident RCA
conclusions differ 4. Historical Record: Evolution of
AWS update content itself is important data

How to Read This Article Correctly: – “Root Cause
Technical Analysis” section: Records judgment at the time (real-time
analysis) – “Post-Mortem Analysis” section (this section): Supplements
final report conclusion (complete RCA) – Combined: Complete
understanding from immediate recovery to deep analysis

Key Learning Points

Learning
Point 1: Immediate Recovery ≠ Root Cause Analysis

Immediate Recovery: – Goal: Restore service quickly
– Method: Find Critical Path – Problem: May misjudge starting cause

Root Cause Analysis: – Goal: Prevent recurrence –
Method: Find First Cause – Requires: Post-incident calm deep
investigation

Learning Point 2:
Inertia of Chain Reactions

Complex system chain reactions have “inertia”: – Even after removing
starting cause, chain reaction may not automatically stop – Requires
intervention at each critical node of the chain reaction – This is also
why service wasn’t restored after DNS fix

Learning
Point 3: Both Types of “Root Cause” Are Important

From practical perspective: – First Cause (DynamoDB DNS race
condition): Needs fixing to prevent recurrence –
Critical Path (NLB Health Monitoring): Needs
strengthening to shorten recovery time

Both require improvement measures!

References

AWS Official Post-Event Summary:
https://aws.amazon.com/message/101925/ (2025-10-23)
This Article’s “Root Cause Technical Analysis”
Section: Records real-time judgment evolution during
incident
Data Verification Date: 2025-10-24

Classic Scenario: When
the Watcher Fails

Philosophical
Question: Who Watches the Watchers?

The Latin proverb “Quis custodiet ipsos custodes?” found real-world
validation in this incident:

Traditional Monitoring Architecture:

Monitoring System → Monitors → Business System

Problem: When the monitoring system itself fails,
who discovers it?

Manifestation in AWS Incident:

NLB Health Monitoring System
  ↓ (Failed but undiscovered)
Continues operating, but with incorrect judgments
  ↓
Marks healthy services as abnormal
  ↓
Triggers global disaster
  ↓
AWS engineers spend 6.5 hours discovering the monitoring system itself has issues

Solution: Meta-Monitoring
Architecture

Improved Architecture:

Meta-Monitoring Layer (monitors the monitoring system)
      ↓
Monitoring System Layer (monitors business systems)
      ↓
Business System Layer (actual services)

Implementation:

Independent Meta-Monitoring System

Monitoring Target: Is NLB Health Monitoring System behavior normal?
Detection Metrics:
  - Health Check request success rate
  - Reasonableness of judgment results (sudden mass marking as unhealthy)
  - System response time

Anomaly Detection

# Pseudocode example
if (unhealthy_targets / total_targets) > 0.5:
    # Over 50% marked unhealthy is highly unreasonable
    alert("NLB Health Monitoring System may be failing")
    trigger_failover_to_backup_monitoring_system()

Redundant Monitoring Systems
- Primary monitoring system + backup monitoring system
- Both use different technology stacks (avoid common mode
  failure)
- Auto-switch when primary system abnormal

Similar Historical Cases

Google Cloud
2019-06-02 Incident

Root Cause: Network configuration error causing
traffic routing failure
Similarity: Control plane issues affecting data
plane
Duration: 4.5 hours
Lesson: Google subsequently implemented “gradual
rollout of configuration changes” mechanism

Azure 2018-09-04
Incident

Root Cause: DNS configuration error
Similarity: DNS issues triggering cascade
failures
Duration: 24 hours
Lesson: Microsoft strengthened DNS system
redundancy

Cloudflare 2019-07-02
Incident

Root Cause: WAF rule deployment error
Similarity: Monitoring/protection system became the
attacker
Duration: 3.5 hours
Lesson: Cloudflare implemented stricter
configuration validation

AWS Incident’s Uniqueness: – Longest duration (15
hours) – Most affected services (142) – Deepest root cause (monitoring
system itself failed) – Most conservative recovery strategy (extensive
throttling use)

Key Findings & Technical
Insights

Finding 1:
US-EAST-1’s “Special Status” is Systemic Risk

Why is US-EAST-1 So Critical?

Historical Reasons
- AWS’s first region launched in 2006
- Bears massive legacy architecture and technical debt
Global Service Dependencies
- IAM, STS global authentication primary endpoints
- DynamoDB Global Tables coordination center
- CloudFront configuration management center
Customer Inertia
- Earliest customers concentrated here
- Migration costs extremely high (data, network latency,
  compliance)

Risk Analysis:

US-EAST-1 failure → Global services affected
                  → Other regions cannot operate fully independently
                  → Violates "regional independence" design principle

Finding
2: Control Plane & Data Plane Coupling is Fatal Flaw

Theory vs Reality:

Design Principle	AWS Promise	Reality in Incident
Control plane failure doesn’t affect data plane	✅ Claims isolation	❌ NLB Health Check failure affected running services
Regional independence	✅ Each region operates independently	❌ Other regions depend on US-EAST-1 global services
Degraded operation capability	✅ Core services should be degradable	❌ EC2 cannot launch, no degradation mechanism

Improvement Directions: – EC2 should have
“no-DynamoDB launch mode” (degraded but usable) – Global services should
truly implement multi-region active-standby – Monitoring system failure
shouldn’t affect established network connections

Finding
3: Progressive Root Cause Analysis is Double-Edged Sword

Advantages: – Avoids hasty conclusions leading to
wrong decisions – Layer-by-layer investigation ensures no issues
missed

Disadvantages: – Too time-consuming (6.5 hours to
find true root cause) – Initial misjudgment (DNS) may mislead
customers

Best Practice:

Parallel Investigation + Transparent Communication

Simultaneously:
  1. Surface symptom mitigation (DNS fix)
  2. Deep root cause investigation (NLB Health Monitoring)
  3. External communication clearly distinguishes "symptom" vs "root cause"

Finding
4: Throttling Strategy Reflects Mature SRE Culture

AWS chose to sacrifice recovery speed for stability, demonstrating
“reliability over speed” core values:

Throttling Implementation: – EC2 instance launch
throttling – Lambda invocation throttling – SQS polling rate limiting –
Asynchronous operations throttling

Philosophy: Slow is Smooth, Smooth is Fast – Rapid
recovery might trigger secondary collapse (Thundering Herd) – Gradual
recovery ensures not repeating mistakes

Results: – No “secondary collapse” – Recovery
process stable and controlled – But total duration extended to 15
hours

Series Article Preview

This article is the first in the AWS US-EAST-1 Major Incident
Series Analysis, providing a complete record of the incident’s
timeline and technical root causes.

Want to know: – How did AWS handle this crisis? What communication
strategies are worth learning? – How can enterprises learn from this to
design fault-tolerant architectures and multi-cloud strategies? – What
immediate actions should technology decision-makers take? – How can SRE
teams replicate AWS’s excellent practices while avoiding similar
mistakes?

Read the second article in the series: AWS
Architecture Resilience Deep Dive: Multi-AZ/Multi-Region Design
Principles from US-EAST-1 Outage

References

AWS Official Sources
- AWS Health
  Dashboard – US-EAST-1 Operational Issue (2025-10-20)
- AWS Service Health Dashboard incident records (complete 21 status
  updates)
Media Coverage
- CNN Business: “AWS global outage, Amazon, Snapchat, Roblox and
  Fortnite down” (2025-10-20)
- NPR: “Outage at Amazon Web Services disrupts websites across the
  internet” (2025-10-20)
- TechRadar: “Amazon fixes huge AWS outage that broke much of the
  internet” (2025-10-20)
- Bloomberg: “AWS Outage: Amazon Cloud Restored; Hit Snapchat, Roblox,
  Robinhood” (2025-10-20)
Technical Community Analysis
- DEV Community: “The AWS Outage on Oct 20, 2025: What Broke, Why It
  Felt Global, and How Amazon Stabilized It”
- DEV Community: “The Great AWS Outage of October 2025: When the
  Internet’s Backbone Buckled”
- GeekWire: “AWS outage was not due to a cyberattack — but shows
  potential for ‘far worse’ damage”
Expert Commentary
- Catchpoint CEO: Estimated total loss in billions of dollars
- Betsy Cooper (Cybersecurity Expert): “This highlights the fragility
  of depending on a few companies”

📌 About This Series

This is a classic case in cloud service history worth deep study by
all technology decision-makers, architects, and SRE engineers. We’ll
provide comprehensive analysis of the incident and practical insights
through two series articles.

Article 1 (This Article): Complete incident record
and technical anatomy Article 2: AWS
Architecture Resilience Deep Dive: Multi-AZ/Multi-Region Design
Principles from US-EAST-1 Outage

Incident Overview

Key Metrics

Impact Scope

Complete Incident Timeline

Phase 1: Discovery & Initial Investigation (00:11 – 02:01)

Detailed Timeline with AWS Official Updates

23:48 PDT (Oct 19) – Incident Actually Begins

00:11 PDT – Problem First Detected

00:51 PDT – Impact Scope Confirmed

01:26 PDT – Surface Symptom Identified

02:01 PDT – Initial Root Cause Hypothesis (Later Proven Incorrect)

02:24 PDT – DNS Issue Resolved (But Disaster Continues)

08:04 PDT – Problem Scope Narrowed

08:43 PDT – True Root Cause Confirmed

09:38 PDT – Critical Turning Point

14:20 PDT – Full Recovery

Root Cause Technical Anatomy

Surface Trigger vs True Root Cause

Surface Trigger: DNS Resolution Issue

True Root Cause: NLB Health Monitoring System Failure

Why Did AWS Initially Misdiagnose?

Timeline Evidence: Why DNS Wasn’t the True Root Cause

The Fatal Role of NLB Health Monitoring System

What is NLB Health Monitoring System?

Why Did This Subsystem Trigger a Global Catastrophe?

Problem 1: The Monitor Itself Needs Monitoring (Who Watches the Watchers?)

Problem 2: Insufficient Control Plane vs Data Plane Isolation

Problem 3: Single Point of Dependency & Lack of Fault Tolerance

Cascade Failure Mechanism Explained

Dual Cascade Reaction Paths

Path 1: DNS → DynamoDB → EC2 → Upper-Layer Services

Path 2: NLB Health Check → Network Connectivity → Global Services

Why Did Recovery Take So Long?

Phase 1: DNS Fix (2.5 hours)

Phase 2: Cascade Failure Handling (7.2 hours)

Phase 3: Gradual Service Recovery (5.4 hours)

List of 142 Affected Services

Core Infrastructure Services

Developer Tools

Application Services

Enterprise Services

Customer Services

Post-Mortem Analysis: AWS Final Report’s Root Cause Correction

Real-Time Judgment vs. Complete Investigation Report

Judgment During Real-Time Updates (Incident in Progress)

Conclusion in Final Report (Post-Incident Deep Investigation)

Why This Difference?

Reason 1: Different Goals Between Immediate Recovery and Root Cause Analysis

Reason 2: Complexity of Cause Chain

Reason 3: Different Definitions of “Root Cause”

Repositioning Component Roles

Complete Correct Cause Chain

Reinterpreting Time Evidence

AWS Engineers’ Diagnostic Evolution

This Article’s Documentary Value

Key Learning Points

Learning Point 1: Immediate Recovery ≠ Root Cause Analysis

Learning Point 2: Inertia of Chain Reactions

Learning Point 3: Both Types of “Root Cause” Are Important

References

Classic Scenario: When the Watcher Fails

Philosophical Question: Who Watches the Watchers?

Solution: Meta-Monitoring Architecture

Similar Historical Cases

Google Cloud 2019-06-02 Incident

Azure 2018-09-04 Incident

Cloudflare 2019-07-02 Incident

Key Findings & Technical Insights

Finding 1: US-EAST-1’s “Special Status” is Systemic Risk

Finding 2: Control Plane & Data Plane Coupling is Fatal Flaw

Finding 3: Progressive Root Cause Analysis is Double-Edged Sword

Finding 4: Throttling Strategy Reflects Mature SRE Culture

Series Article Preview

References

Related posts:

Leave a Comment Cancel reply

Phase 1:
Discovery & Initial Investigation (00:11 – 02:01)

Detailed Timeline
with AWS Official Updates

23:48 PDT (Oct
19) – Incident Actually Begins

00:11 PDT – Problem First
Detected

00:51 PDT – Impact Scope
Confirmed

01:26 PDT – Surface
Symptom Identified

02:01
PDT – Initial Root Cause Hypothesis (Later Proven
Incorrect)

02:24
PDT – DNS Issue Resolved (But Disaster Continues)

08:04 PDT – Problem Scope
Narrowed

08:43 PDT – True Root
Cause Confirmed

09:38 PDT – Critical
Turning Point

14:20 PDT – Full
Recovery

Surface Trigger vs True Root
Cause

Surface Trigger:
DNS Resolution Issue

True
Root Cause: NLB Health Monitoring System Failure

Why Did AWS Initially
Misdiagnose?

Timeline
Evidence: Why DNS Wasn’t the True Root Cause

The Fatal Role
of NLB Health Monitoring System

What is NLB Health
Monitoring System?

Why Did
This Subsystem Trigger a Global Catastrophe?

Problem
1: The Monitor Itself Needs Monitoring (Who Watches the
Watchers?)

Problem
2: Insufficient Control Plane vs Data Plane Isolation

Problem
3: Single Point of Dependency & Lack of Fault
Tolerance

Cascade Failure Mechanism
Explained

Path 1:
DNS → DynamoDB → EC2 → Upper-Layer Services

Path
2: NLB Health Check → Network Connectivity → Global
Services

Why Did Recovery Take So
Long?

Phase 1: DNS Fix (2.5
hours)

Phase 2:
Cascade Failure Handling (7.2 hours)

Phase 3:
Gradual Service Recovery (5.4 hours)

List of 142 Affected
Services

Post-Mortem
Analysis: AWS Final Report’s Root Cause Correction

Real-Time
Judgment vs. Complete Investigation Report

Judgment
During Real-Time Updates (Incident in Progress)

Conclusion
in Final Report (Post-Incident Deep Investigation)

Reason
1: Different Goals Between Immediate Recovery and Root Cause
Analysis

Reason 2: Complexity of
Cause Chain

Reason 3:
Different Definitions of “Root Cause”

Repositioning Component
Roles

AWS Engineers’ Diagnostic
Evolution

This Article’s Documentary
Value

Learning
Point 1: Immediate Recovery ≠ Root Cause Analysis

Learning Point 2:
Inertia of Chain Reactions

Learning
Point 3: Both Types of “Root Cause” Are Important

Classic Scenario: When
the Watcher Fails

Philosophical
Question: Who Watches the Watchers?

Solution: Meta-Monitoring
Architecture

Google Cloud
2019-06-02 Incident

Azure 2018-09-04
Incident

Cloudflare 2019-07-02
Incident

Key Findings & Technical
Insights

Finding 1:
US-EAST-1’s “Special Status” is Systemic Risk

Finding
2: Control Plane & Data Plane Coupling is Fatal Flaw

Finding
3: Progressive Root Cause Analysis is Double-Edged Sword

Finding
4: Throttling Strategy Reflects Mature SRE Culture