🌏 閱讀中文版本
Incident Overview
On October 20, 2025, Amazon Web Services (AWS), the world’s largest
cloud provider, experienced one of its most severe service disruptions
in the US-EAST-1 region (Northern Virginia). This 14-hour and
32-minute global catastrophe not only shocked the cloud
industry but has become a classic case study in cloud architecture
design and crisis management.
Key Metrics
| Metric | Data |
|---|---|
| Outage Duration | 14 hours 32 minutes (2025-10-19 23:48 PDT → 2025-10-20 14:20 PDT) |
| Affected Services | 142 AWS services |
| Official Updates | 21 times (average every 43 minutes) |
| Impacted Users | Millions (Snapchat, Roblox, Robinhood, McDonald’s, United Airlines, etc.) |
| Estimated Loss | Billions of dollars (Catchpoint CEO estimate) |
| Geographic Impact | Global (USA, UK, Europe, etc.) |
| Incident Reports | Over 6.5 million (Downdetector, covering 1000+ services) |
Impact Scope
Key Affected Services: – Social Platforms: Snapchat,
Signal – Gaming: Roblox, Fortnite – FinTech: Robinhood, Coinbase, Venmo
– Airlines: United Airlines (website & app) – Retail: McDonald’s
App, Amazon Ring doorbell cameras – AI Services: ChatGPT, Perplexity –
Banking: Lloyds, Bank of Scotland, Halifax (UK)
AWS Internal Affected Services: EC2, DynamoDB,
Lambda, S3, RDS, ECS, CloudWatch, CloudFront, IAM, STS, API Gateway,
ELB, SQS, SNS, Step Functions, Redshift, Connect, Glue, Athena, Kinesis,
and 122 more services.
Complete Incident Timeline
Phase 1:
Discovery & Initial Investigation (00:11 – 02:01)
timeline
title AWS US-EAST-1 Major Incident Timeline (2025-10-20)
section Discovery Phase
23:48 (10/19) : Incident actually begins (DNS issue triggered)
00:11 : Investigation starts - elevated error rates in US-EAST-1
00:51 : Multiple AWS services confirmed affected
01:26 : DynamoDB endpoint significant error rates confirmed
section Root Cause Identification
02:01 : Identified "potential root cause" - DynamoDB DNS resolution
02:22 : Initial mitigation applied, early recovery signs observed
02:24 : DynamoDB DNS issue resolved (but services not recovered)
02:27 : Most requests succeeding, processing queued requests
section Cascade Failure Handling
03:03 : Global services and US-EAST-1 dependencies recovered
03:35 : DNS fully mitigated, but EC2 launches still failing
04:08 : Continuing to address EC2 launch errors & Lambda delays
05:48 : New EC2 instances launching successfully in some AZs
section True Root Cause Confirmed
07:29 : Multiple services network connectivity issues confirmed
08:04 : Issue identified within "EC2 internal network"
08:43 : 🎯 True root cause confirmed - NLB health check subsystem failure
section Recovery Phase
09:13 : Mitigation steps applied, connectivity recovery observed
09:38 : NLB health checks recovered (critical turning point)
10:03 : EC2 fix validated, preparing deployment
11:22 : EC2 launch success rate improving, network issues decreasing
12:15 : Most services continuing recovery, Lambda improving
section Full Recovery
13:03 : Lambda invocations fully recovered
13:52 : EC2 throttles reduced to pre-event levels
14:48 : Connect handling new sessions normally, backlog processing
14:20 : ✅ All AWS services returned to normal operations
14:53 : Complete Post-Event Summary published
Detailed Timeline
with AWS Official Updates
23:48 PDT (Oct
19) – Incident Actually Begins
According to the final report, the incident was triggered at this
time, but AWS was not yet aware.
00:11 PDT – Problem First
Detected
“We are investigating increased error rates and latencies for
multiple AWS services in the US-EAST-1 Region.”
Analysis: AWS detected the problem 22 minutes after
it began, indicating monitoring system detection latency.
00:51 PDT – Impact Scope
Confirmed
“We can confirm increased error rates and latencies for multiple AWS
Services in the US-EAST-1 Region. This issue may also be affecting Case
Creation through the AWS Support Center or the Support API.”
Analysis: Even the support system was affected,
revealing the severity.
01:26 PDT – Surface
Symptom Identified
“We can confirm significant error rates for requests made to the
DynamoDB endpoint in the US-EAST-1 Region. This issue also affects other
AWS Services in the US-EAST-1 Region as well.”
Analysis: DynamoDB identified as the main symptom,
but root cause not yet found.
02:01
PDT – Initial Root Cause Hypothesis (Later Proven
Incorrect)
“We have identified a potential root cause for error
rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our
investigation, the issue appears to be related to DNS
resolution of the DynamoDB API endpoint in US-EAST-1. We are
working on multiple parallel paths to accelerate recovery.”
Analysis: Use of “potential” indicates uncertainty.
Later confirmed DNS was only a trigger, not the true root cause.
02:24
PDT – DNS Issue Resolved (But Disaster Continues)
“After resolving the DynamoDB DNS issue at 2:24 AM, services began
recovering but we had a subsequent impairment in the
internal subsystem of EC2 that is responsible for launching EC2
instances due to its dependency on DynamoDB.”
Critical Finding: After DNS fix, services did not
recover—instead, it triggered a cascade failure. This is key evidence
for understanding the entire incident.
08:04 PDT – Problem Scope
Narrowed
“We continue to investigate the root cause for the network
connectivity issues that are impacting AWS services. We have identified
that the issue originated from within the EC2 internal
network.”
Analysis: Took 6 hours to narrow the problem scope
to EC2 internal network.
08:43 PDT – True Root
Cause Confirmed
“We have narrowed down the source of the network connectivity issues
that impacted AWS Services. The root cause is an underlying
internal subsystem responsible for monitoring the health of our network
load balancers.”
Major Breakthrough: Finally found the true root
cause! No longer using “potential,” but definitive “The root
cause is.”
From 02:01 (DNS hypothesis) to 08:43 (NLB Health Check confirmation),
AWS engineers spent 6 hours 42 minutes to find the real
cause.
09:38 PDT – Critical
Turning Point
“We recovered the Network Load Balancer health checks at 9:38
AM.”
Analysis: After NLB health checks recovered,
services began large-scale recovery. This proves NLB monitoring system
was the true culprit.
14:20 PDT – Full
Recovery
“By 2:20 PM, all AWS services returned to normal operations.”
Summary: From actual start (23:48) to full recovery
(14:20), total of 14 hours 32 minutes.
Root Cause Technical Anatomy
Surface Trigger vs True Root
Cause
Surface Trigger:
DNS Resolution Issue
- Time: 2025-10-19 23:49 PDT
- Symptom: DynamoDB service endpoint DNS resolution failure
- Fix Time: 02:24 PDT (2.5 hours)
True
Root Cause: NLB Health Monitoring System Failure
- Time: Confirmed at 08:43 PDT, likely started earlier
- Nature: Internal subsystem unable to correctly monitor Network Load
Balancer health status - Impact Duration: 02:24 → 09:38 (7.2 hours)
- Full Recovery Time: 09:38 → 15:01 (5.4 hours)
Why Did AWS Initially
Misdiagnose?
Reason 1: Symptom vs Root Cause Confusion
Surface Phenomenon: DynamoDB DNS resolution failure
↓
Actual Mechanism: NLB Health Check system failure
↓
Unable to correctly monitor DynamoDB endpoint health
↓
Marked healthy DynamoDB endpoints as "unhealthy"
↓
DNS resolution system returns errors (no "healthy" endpoints available)
↓
Appears to be a "DNS problem"
Analogy: – Symptom: Patient has fever (DNS error) –
Root Cause: Immune system failure (NLB Health Monitoring)
Fixing the fever (DNS) doesn’t address immune system problems, so
condition continues to deteriorate.
Reason 2: Cascade Reaction Complexity
AWS infrastructure service interdependencies are extremely
complex:
graph TD
A[NLB Health Monitoring<br/>System] -->|Monitoring Failure| B[DynamoDB Endpoints]
B -->|Marked as Unhealthy| C[DNS Resolution Returns Errors]
C -->|Depends on DynamoDB| D[EC2 Instance Launch]
D -->|Cannot Start| E[Lambda Execution Environment]
D -->|Cannot Start| F[CloudWatch Data Collection]
D -->|Cannot Start| G[Other 142 Services]
A -->|Also Affects| H[Other Services'<br/>NLB Health Checks]
H -->|Network Connectivity Lost| I[Lambda/CloudWatch/<br/>API Gateway, etc.]
style A fill:#ff6b6b
style H fill:#ff6b6b
style C fill:#ffa500
style D fill:#ffa500
Reason 3: Progressive Investigation is Standard SRE
Practice
AWS used the standard “peeling the onion” failure troubleshooting
approach:
- Layer 1 (Surface Symptoms): Elevated error rates →
Need to narrow scope - Layer 2 (Service Layer): DynamoDB errors → Found
symptom focus - Layer 3 (Infrastructure Layer): DNS resolution
issue → Found trigger - Layer 4 (Internal System Layer): EC2 internal
network → Narrowed scope - Layer 5 (Control Plane Layer): NLB Health
Monitoring → True Root Cause
This method, while time-consuming, ensures no possibilities are
missed.
Timeline
Evidence: Why DNS Wasn’t the True Root Cause
| Key Timestamp | Event | Explanation |
|---|---|---|
| 02:24 | DNS issue resolved | AWS fixed DynamoDB DNS resolution |
| 02:24 – 09:38 | Disaster continues | 7.2 hours — If DNS were root cause, services should recover now |
| 09:38 | NLB Health Check recovered | Critical turning point, services begin large-scale recovery |
| 09:38 – 14:20 | Gradual recovery | 4.7 hours — Processing backlog & throttle removal |
Conclusion: 12 hours of outage
after DNS fix proves DNS was only a trigger; NLB Health Monitoring
System was the true disease root.
The Fatal Role
of NLB Health Monitoring System
What is NLB Health
Monitoring System?
Definition: The Network Load Balancer (NLB) Health
Monitoring System is an AWS internal control plane
subsystem responsible for:
- Monitoring NLB Backend Target Health
- Periodically sends health check requests to backend services
- Determines which backends are “Healthy” vs “Unhealthy”
- Controlling Traffic Distribution
- Routes traffic only to “Healthy” backends
- Automatically removes “Unhealthy” backends from load balancing
pool
- Affecting Service Discovery
- DNS resolution system depends on health check results to return IP
addresses - If all backends marked “Unhealthy,” DNS resolution fails
- DNS resolution system depends on health check results to return IP
- Maintaining Internal Network Connectivity
- EC2 instance launches depend on NLB-provided internal network
connectivity - Lambda execution environments need NLB-provided network routing
- EC2 instance launches depend on NLB-provided internal network
Why Did
This Subsystem Trigger a Global Catastrophe?
Problem
1: The Monitor Itself Needs Monitoring (Who Watches the
Watchers?)
Normal Architecture:
NLB Health Monitoring → Monitors → DynamoDB/EC2/Lambda
Paradox in This Incident:
NLB Health Monitoring itself failed
↓
But no "upper-level monitoring system" detected this failure
↓
It continued operating, but with incorrect judgments
↓
Marked all "Healthy" services as "Unhealthy"
↓
Triggered global disaster
Fundamental Problem: AWS lacks “Meta-Monitoring”
mechanism to timely detect anomalies in the monitoring system
itself.
Problem
2: Insufficient Control Plane vs Data Plane Isolation
Ideal Architecture:
Control Plane: Handles management, monitoring, configuration
- On failure, should only affect "new operations" (like new EC2 launches)
- Should NOT affect "running services" (like running EC2 instances)
Data Plane: Handles actual traffic processing
- Should continue operating when control plane fails
Reality in AWS Incident:
NLB Health Monitoring (Control Plane) failure
↓
Affected running EC2 instance network connectivity (Data Plane)
↓
Violated "plane isolation" principle
Result: Not only could new EC2 instances not launch,
but running services also lost network connectivity.
Problem
3: Single Point of Dependency & Lack of Fault
Tolerance
Architectural Weak Points: 1. Global
Services Depend on Single Region (US-EAST-1) – IAM, STS (global
authentication) primary endpoints – DynamoDB Global Tables coordination
center
- NLB Health Monitoring System Lacks Redundancy
- No backup monitoring system
- Single subsystem failure triggers cascade disaster
- Over-Coupled Service Dependencies
- Why does EC2 launch need to depend on DynamoDB?
- Why is Lambda affected by NLB health checks?
- Lacks “Degraded Mode” operation mechanism
Cascade Failure Mechanism
Explained
Dual Cascade Reaction Paths
Path 1:
DNS → DynamoDB → EC2 → Upper-Layer Services
graph LR
A[NLB Health<br/>Monitoring Failure] --> B[DynamoDB Endpoints<br/>Misjudged as Unhealthy]
B --> C[DNS Resolution<br/>Failure]
C --> D[EC2 Instance<br/>Launch Failure]
D --> E[Lambda Execution<br/>Environment Cannot Create]
D --> F[ECS Containers<br/>Cannot Start]
D --> G[Glue Jobs<br/>Cannot Execute]
style A fill:#ff6b6b
style B fill:#ffa500
style C fill:#ffa500
style D fill:#ffa500
Affected Services: ECS, Glue, RDS, Redshift, EMR,
SageMaker (all services depending on EC2)
Path
2: NLB Health Check → Network Connectivity → Global
Services
graph LR
A[NLB Health<br/>Monitoring Failure] --> B[NLB Health Checks<br/>Comprehensively Failed]
B --> C[Network Connectivity<br/>Lost]
C --> D[Lambda<br/>Invocation Failure]
C --> E[CloudWatch<br/>Cannot Collect Metrics]
C --> F[API Gateway<br/>Cannot Route Requests]
style A fill:#ff6b6b
style B fill:#ff6b6b
style C fill:#ff0000
Affected Services: Lambda, CloudWatch, API Gateway,
Step Functions, EventBridge
Why Did Recovery Take So
Long?
Phase 1: DNS Fix (2.5
hours)
- Relatively fast, as it was a surface-level issue
- But fixing it didn’t resolve the problem
Phase 2:
Cascade Failure Handling (7.2 hours)
- Needed to identify NLB Health Monitoring failure (took 6.5
hours) - Fix NLB Health Check system (took 0.7 hours)
Phase 3:
Gradual Service Recovery (5.4 hours)
- Why not immediately recover at full speed? AWS adopted a “throttling
strategy”
Throttling Reasons: 1. Avoid Thundering Herd
Problem – If all services restart simultaneously, would put
enormous pressure on infrastructure – Could trigger “secondary
collapse”
- Ensure Stability Over Speed
- Gradually lift throttles (EC2 launches, Lambda invocations, SQS
polling) - Confirm each phase is stable before proceeding to next
- Gradually lift throttles (EC2 launches, Lambda invocations, SQS
- Process Backlog Tasks
- CloudTrail, EventBridge, Connect had massive backlogs
- Needed time to gradually clear
Conclusion: AWS chose “Slow is Smooth, Smooth is
Fast” recovery strategy, sacrificing speed for stability.
List of 142 Affected
Services
Core Infrastructure Services
- Compute: EC2, Lambda, ECS, EKS, Fargate, Batch
- Storage: S3, EBS, EFS, FSx
- Database: DynamoDB, RDS, Aurora, Redshift,
ElastiCache, DocumentDB, Neptune - Networking: VPC, ELB (ALB/NLB/CLB), CloudFront,
Route 53, API Gateway, Direct Connect, VPN
Developer Tools
- CI/CD: CodeBuild, CodePipeline, CodeDeploy
- Monitoring: CloudWatch, X-Ray, CloudTrail
- Management: Systems Manager, CloudFormation,
Config, OpsWorks
Application Services
- Messaging: SQS, SNS, EventBridge, Kinesis, MQ
- Workflow: Step Functions, SWF, Glue
- AI/ML: SageMaker, Bedrock, Comprehend, Rekognition,
Polly, Transcribe, Translate
Enterprise Services
- Authentication: IAM, Cognito, STS, IAM Identity
Center - Security: Secrets Manager, WAF, Security Hub,
GuardDuty, Firewall Manager - Analytics: Athena, EMR, QuickSight, DataZone, Lake
Formation
Customer Services
- Communications: Connect, Chime, WorkMail, Pinpoint,
SES - End User: WorkSpaces, AppStream 2.0, WorkSpaces
Thin Client
Complete list of 142 services — demonstrates
US-EAST-1 region’s importance to AWS global architecture.
Post-Mortem
Analysis: AWS Final Report’s Root Cause Correction
Real-Time
Judgment vs. Complete Investigation Report
During the incident, AWS engineers went through a process of
“gradually approaching the truth.” However, in the complete Post-Event
Summary (published October 23, 2025), AWS, based on in-depth
post-incident Root Cause Analysis (RCA), repositioned the role of each
component in this incident.
Judgment
During Real-Time Updates (Incident in Progress)
02:01 PDT – AWS first proposed a hypothesis: >
“We have identified a potential root cause for error
rates for the DynamoDB APIs… The issue appears to be related to
DNS resolution.”
Key word: Used “potential”,
indicating not fully confirmed.
08:43 PDT – AWS confirmed judgment: > “We have
narrowed down the source… The root cause is an
underlying internal subsystem responsible for monitoring the health of
our network load balancers.”
Key word: No longer using “potential”, but
explicitly stating “The root cause is”.
Conclusion
in Final Report (Post-Incident Deep Investigation)
In the complete Post-Event Summary, AWS provided a different
conclusion:
“The root cause of this issue was a latent
race condition in the DynamoDB DNS management system that
resulted in an incorrect empty DNS record for the service’s regional
endpoint.”
Key difference: Root cause changed from “NLB Health
Monitoring System” to “DynamoDB DNS race condition”.
Why This Difference?
Reason
1: Different Goals Between Immediate Recovery and Root Cause
Analysis
Immediate Recovery Phase (incident in progress): –
Goal: Restore service as quickly as possible –
Focus: “What fix will restore service?” –
Conclusion: Fix NLB Health Monitoring → Service
restored → Therefore identified as root cause
Post-Incident Investigation Phase (Post-Event
Summary): – Goal: Find out “why it happened” –
Focus: “If X hadn’t occurred, would the entire incident
not have happened?” – Conclusion: Without DNS race
condition, NLB wouldn’t have started misidentifying → DNS race condition
is the starting point
Reason 2: Complexity of
Cause Chain
Both conclusions have validity, but observe different layers:
Real-Time Judgment Logic (bottom-up):
Observed phenomenon: All services failing
↓
Found recovery key: Fix NLB Health Monitoring
↓
Conclusion: NLB Health Monitoring is root cause
Post-Incident Analysis Logic (top-down):
Starting point: DynamoDB DNS race condition
↓
Triggered: DynamoDB endpoint unresolvable
↓
Caused: EC2 instance launch failures (dependent on DynamoDB)
↓
Caused: Network configuration propagation delays
↓
Caused: NLB health checks started misidentifying
↓
Amplified: Repeated remove/restore cycles
↓
Result: Service unstable for 14.5 hours
Reason 3:
Different Definitions of “Root Cause”
In incident analysis, “root cause” has two common definitions:
Definition A: Critical Path – “Removing which
problem restores service fastest?” – Real-time judgment used this
definition → NLB Health Monitoring
Definition B: First Cause – “If we could go back in
time, preventing which problem would prevent the entire incident?” –
Post-incident analysis used this definition → DynamoDB DNS race
condition
Repositioning Component
Roles
Based on AWS’s final report, we need to re-understand each
component’s role:
| Component | Role in Real-Time Judgment | Role in Final Report | Technical Position |
|---|---|---|---|
| DynamoDB DNS race condition | Surface trigger | ✅ Root cause | First Cause |
| NLB Health Monitoring System | ✅ Root cause | Secondary impact and amplifier | Critical Path |
| EC2 instance launch failures | Secondary impact | Direct impact | Intermediate Effect |
| 142 service outages | Final result | Final result | Final Impact |
Complete Correct Cause Chain
According to AWS’s final report, the complete technical cause chain
should be:
【Starting Point】DynamoDB DNS race condition
↓
Two independent DNS Enactors interacted with race condition
↓
One Enactor applied old DNS plan
Another Enactor simultaneously deleted the plan
↓
DynamoDB regional endpoint DNS record became empty
↓
【Direct Impact】DynamoDB endpoint unresolvable
↓
EC2 instance launch dependent on DynamoDB → Launch failures
↓
Newly launched EC2 instances' network configuration propagation delayed
↓
【Secondary Impact】NLB health check subsystem affected
↓
NLB performed health checks on network-delayed EC2 instances
↓
Health checks failed (but instances were actually healthy)
↓
【Amplification Mechanism】Repeated remove/restore cycles
↓
NLB removed "unhealthy" instances (deleted from DNS)
↓
Next check found instances ready → Check succeeded
↓
NLB restored instances (added back to DNS)
↓
Continuously repeated this cycle, increasing system load
↓
Health check system itself degraded performance
↓
Triggered automatic AZ DNS failover
↓
【Final Result】Service unstable for 14.5 hours
Reinterpreting Time Evidence
Let’s re-examine key time points from the “complete cause chain”
perspective:
| Time (PDT) | Event | Original Interpretation | Correct Interpretation |
|---|---|---|---|
| 23:48 | Incident began | DNS issue triggered | ✅ DNS race condition occurred (root cause) |
| 02:24 | DNS issue resolved | Surface fix | ✅ Fixed starting problem, but chain reaction already initiated |
| 02:24 – 09:38 | Disaster continued 7.2 hours | Proves DNS isn’t root cause | ❌ Wrong inference: This time is when secondary impacts and amplification mechanisms were active |
| 09:38 | NLB Health Check restored | True root cause fixed | ✅ Critical path fixed, but not root cause |
| 14:20 | Fully recovered | All problems resolved | ✅ Backlog processing completed |
Key Insight: – Service not restored after DNS fix
doesn’t prove “DNS isn’t the root cause” – Rather, because chain
reaction had already started, even after removing the starting
cause, the initiated secondary problems (NLB misidentification) still
required separate handling – Analogy: Forest fire started by lightning,
but eliminating lightning cannot extinguish the already burning fire
AWS Engineers’ Diagnostic
Evolution
Understanding this evolution process is very valuable for learning
incident investigation:
Phase 1 (00:11 – 02:01): Discovered surface symptoms
– Observation: DynamoDB error rates increasing – Hypothesis: Possibly
DNS issue – Status: Symptom-level analysis
Phase 2 (02:01 – 02:24): Attempted to fix
hypothesized cause – Action: Fixed DNS issue – Expectation: Service
should restore – Result: ❌ Service not restored
Phase 3 (02:24 – 08:43): Reinvestigated – Discovery:
Problem more complex than imagined – Deeper: EC2 internal network issues
– Progress: Gradually narrowed scope
Phase 4 (08:43): Found critical path – Breakthrough:
NLB Health Monitoring System failure – Judgment: This is the root cause
(real-time judgment) – Reason: Fixing it restored
service
Phase 5 (Post-incident 10/23): Complete RCA – Traced
back: Why did NLB start misidentifying? – Discovery: DNS race condition
was the starting point – Correction: DynamoDB DNS race condition is the
root cause (final conclusion)
This Article’s Documentary
Value
Important Note: This article preserves the evolution
of AWS’s real-time judgment, which is very valuable learning
material.
Why Preserve Real-Time Judgment? 1.
Authenticity: Reflects the true situation during the
incident 2. Educational Value: Demonstrates complexity
of investigating large-scale distributed system failures 3. SRE
Practice: Explains why immediate recovery and post-incident RCA
conclusions differ 4. Historical Record: Evolution of
AWS update content itself is important data
How to Read This Article Correctly: – “Root Cause
Technical Analysis” section: Records judgment at the time (real-time
analysis) – “Post-Mortem Analysis” section (this section): Supplements
final report conclusion (complete RCA) – Combined: Complete
understanding from immediate recovery to deep analysis
Key Learning Points
Learning
Point 1: Immediate Recovery ≠ Root Cause Analysis
Immediate Recovery: – Goal: Restore service quickly
– Method: Find Critical Path – Problem: May misjudge starting cause
Root Cause Analysis: – Goal: Prevent recurrence –
Method: Find First Cause – Requires: Post-incident calm deep
investigation
Learning Point 2:
Inertia of Chain Reactions
Complex system chain reactions have “inertia”: – Even after removing
starting cause, chain reaction may not automatically stop – Requires
intervention at each critical node of the chain reaction – This is also
why service wasn’t restored after DNS fix
Learning
Point 3: Both Types of “Root Cause” Are Important
From practical perspective: – First Cause (DynamoDB DNS race
condition): Needs fixing to prevent recurrence –
Critical Path (NLB Health Monitoring): Needs
strengthening to shorten recovery time
Both require improvement measures!
References
- AWS Official Post-Event Summary:
https://aws.amazon.com/message/101925/ (2025-10-23) - This Article’s “Root Cause Technical Analysis”
Section: Records real-time judgment evolution during
incident - Data Verification Date: 2025-10-24
Classic Scenario: When
the Watcher Fails
Philosophical
Question: Who Watches the Watchers?
The Latin proverb “Quis custodiet ipsos custodes?” found real-world
validation in this incident:
Traditional Monitoring Architecture:
Monitoring System → Monitors → Business System
Problem: When the monitoring system itself fails,
who discovers it?
Manifestation in AWS Incident:
NLB Health Monitoring System
↓ (Failed but undiscovered)
Continues operating, but with incorrect judgments
↓
Marks healthy services as abnormal
↓
Triggers global disaster
↓
AWS engineers spend 6.5 hours discovering the monitoring system itself has issues
Solution: Meta-Monitoring
Architecture
Improved Architecture:
Meta-Monitoring Layer (monitors the monitoring system)
↓
Monitoring System Layer (monitors business systems)
↓
Business System Layer (actual services)
Implementation:
-
Independent Meta-Monitoring System
Monitoring Target: Is NLB Health Monitoring System behavior normal? Detection Metrics: - Health Check request success rate - Reasonableness of judgment results (sudden mass marking as unhealthy) - System response time -
Anomaly Detection
# Pseudocode example if (unhealthy_targets / total_targets) > 0.5: # Over 50% marked unhealthy is highly unreasonable alert("NLB Health Monitoring System may be failing") trigger_failover_to_backup_monitoring_system() -
Redundant Monitoring Systems
- Primary monitoring system + backup monitoring system
- Both use different technology stacks (avoid common mode
failure) - Auto-switch when primary system abnormal
Similar Historical Cases
Google Cloud
2019-06-02 Incident
- Root Cause: Network configuration error causing
traffic routing failure - Similarity: Control plane issues affecting data
plane - Duration: 4.5 hours
- Lesson: Google subsequently implemented “gradual
rollout of configuration changes” mechanism
Azure 2018-09-04
Incident
- Root Cause: DNS configuration error
- Similarity: DNS issues triggering cascade
failures - Duration: 24 hours
- Lesson: Microsoft strengthened DNS system
redundancy
Cloudflare 2019-07-02
Incident
- Root Cause: WAF rule deployment error
- Similarity: Monitoring/protection system became the
attacker - Duration: 3.5 hours
- Lesson: Cloudflare implemented stricter
configuration validation
AWS Incident’s Uniqueness: – Longest duration (15
hours) – Most affected services (142) – Deepest root cause (monitoring
system itself failed) – Most conservative recovery strategy (extensive
throttling use)
Key Findings & Technical
Insights
Finding 1:
US-EAST-1’s “Special Status” is Systemic Risk
Why is US-EAST-1 So Critical?
- Historical Reasons
- AWS’s first region launched in 2006
- Bears massive legacy architecture and technical debt
- Global Service Dependencies
- IAM, STS global authentication primary endpoints
- DynamoDB Global Tables coordination center
- CloudFront configuration management center
- Customer Inertia
- Earliest customers concentrated here
- Migration costs extremely high (data, network latency,
compliance)
Risk Analysis:
US-EAST-1 failure → Global services affected
→ Other regions cannot operate fully independently
→ Violates "regional independence" design principle
Finding
2: Control Plane & Data Plane Coupling is Fatal Flaw
Theory vs Reality:
| Design Principle | AWS Promise | Reality in Incident |
|---|---|---|
| Control plane failure doesn’t affect data plane | ✅ Claims isolation | ❌ NLB Health Check failure affected running services |
| Regional independence | ✅ Each region operates independently | ❌ Other regions depend on US-EAST-1 global services |
| Degraded operation capability | ✅ Core services should be degradable | ❌ EC2 cannot launch, no degradation mechanism |
Improvement Directions: – EC2 should have
“no-DynamoDB launch mode” (degraded but usable) – Global services should
truly implement multi-region active-standby – Monitoring system failure
shouldn’t affect established network connections
Finding
3: Progressive Root Cause Analysis is Double-Edged Sword
Advantages: – Avoids hasty conclusions leading to
wrong decisions – Layer-by-layer investigation ensures no issues
missed
Disadvantages: – Too time-consuming (6.5 hours to
find true root cause) – Initial misjudgment (DNS) may mislead
customers
Best Practice:
Parallel Investigation + Transparent Communication
Simultaneously:
1. Surface symptom mitigation (DNS fix)
2. Deep root cause investigation (NLB Health Monitoring)
3. External communication clearly distinguishes "symptom" vs "root cause"
Finding
4: Throttling Strategy Reflects Mature SRE Culture
AWS chose to sacrifice recovery speed for stability, demonstrating
“reliability over speed” core values:
Throttling Implementation: – EC2 instance launch
throttling – Lambda invocation throttling – SQS polling rate limiting –
Asynchronous operations throttling
Philosophy: Slow is Smooth, Smooth is Fast – Rapid
recovery might trigger secondary collapse (Thundering Herd) – Gradual
recovery ensures not repeating mistakes
Results: – No “secondary collapse” – Recovery
process stable and controlled – But total duration extended to 15
hours
Series Article Preview
This article is the first in the AWS US-EAST-1 Major Incident
Series Analysis, providing a complete record of the incident’s
timeline and technical root causes.
Want to know: – How did AWS handle this crisis? What communication
strategies are worth learning? – How can enterprises learn from this to
design fault-tolerant architectures and multi-cloud strategies? – What
immediate actions should technology decision-makers take? – How can SRE
teams replicate AWS’s excellent practices while avoiding similar
mistakes?
Read the second article in the series: AWS
Architecture Resilience Deep Dive: Multi-AZ/Multi-Region Design
Principles from US-EAST-1 Outage
References
- AWS Official Sources
- AWS Health
Dashboard – US-EAST-1 Operational Issue (2025-10-20) - AWS Service Health Dashboard incident records (complete 21 status
updates)
- AWS Health
- Media Coverage
- CNN Business: “AWS global outage, Amazon, Snapchat, Roblox and
Fortnite down” (2025-10-20) - NPR: “Outage at Amazon Web Services disrupts websites across the
internet” (2025-10-20) - TechRadar: “Amazon fixes huge AWS outage that broke much of the
internet” (2025-10-20) - Bloomberg: “AWS Outage: Amazon Cloud Restored; Hit Snapchat, Roblox,
Robinhood” (2025-10-20)
- CNN Business: “AWS global outage, Amazon, Snapchat, Roblox and
- Technical Community Analysis
- DEV Community: “The AWS Outage on Oct 20, 2025: What Broke, Why It
Felt Global, and How Amazon Stabilized It” - DEV Community: “The Great AWS Outage of October 2025: When the
Internet’s Backbone Buckled” - GeekWire: “AWS outage was not due to a cyberattack — but shows
potential for ‘far worse’ damage”
- DEV Community: “The AWS Outage on Oct 20, 2025: What Broke, Why It
- Expert Commentary
- Catchpoint CEO: Estimated total loss in billions of dollars
- Betsy Cooper (Cybersecurity Expert): “This highlights the fragility
of depending on a few companies”
📌 About This Series
This is a classic case in cloud service history worth deep study by
all technology decision-makers, architects, and SRE engineers. We’ll
provide comprehensive analysis of the incident and practical insights
through two series articles.
Article 1 (This Article): Complete incident record
and technical anatomy Article 2: AWS
Architecture Resilience Deep Dive: Multi-AZ/Multi-Region Design
Principles from US-EAST-1 Outage