🌏 閱讀中文版本
This article analyzes key considerations in cloud architecture design based on the AWS US-EAST-1 service interruption on October 20, 2025, providing Multi-AZ and Multi-Region design principles, health check frameworks, monitoring strategies, and cost-benefit analysis.
💡 Incident Background: AWS US-EAST-1 Major Incident Complete Timeline: 15-Hour Global Service Disruption Timeline and Technical Analysis
📚 Incident Sources:
– The Register: Major AWS outage across US-East region
– AWS Health Dashboard Official Report
– Root Cause: DynamoDB DNS resolution failure + NLB health check system degradation🏗️ Architecture Reference Standards:
– AWS Well-Architected Framework – Reliability Pillar
– REL10-BP01: Deploy workload to multiple locations
🎯 Cost-Conscious Approach: This article provides pragmatic, cost-effective recommendations suitable for businesses of all sizes, emphasizing that 2 AZs are sufficient for most enterprises.
Key Architecture Considerations from the Incident
1. Region Dependency Assessment
Observation: Many enterprises deploy all resources in a single Availability Zone, relying on single-region deployment.
Design Consideration: Misunderstanding probabilistic availability as deterministic availability. When an AZ experiences catastrophic failure (such as power system issues), all resources become unavailable simultaneously.
Recommended Approach: High-availability architecture must be based on at least two independent failure domains. Multi-AZ is the baseline, not an option.
2. Monitoring System Independence
Observation: CloudWatch and the applications it monitors run in the same Region/AZ, creating an observer-observable coupling problem.
Design Consideration: When Regional control plane fails, not only do applications become unavailable, but the monitoring system also goes down, leaving you “blind.”
Recommended Approach: Monitoring systems must maintain complete independence from monitored systems, including:
– Independent network paths
– Independent authentication systems
– Independent data storage
– Cross-Region or multi-cloud monitoring backup
3. Health Check Mechanism Design
Observation: Many systems only implement basic HTTP 200 checks without distinguishing between startup, readiness, and liveness states.
Design Consideration: When backend services fail, load balancers cannot make accurate routing decisions, causing cascading failures.
Recommended Approach: Implement three-tier health checks:
– Startup Probe: Container initialization completion check
– Readiness Probe: Service ready to accept traffic check (including dependency availability)
– Liveness Probe: Service process health check
4. Disaster Recovery Validation
Observation: Many enterprises have DR plans on paper but never conduct actual failover drills.
Design Consideration: RTO (Recovery Time Objective) and RPO (Recovery Point Objective) in plans are theoretical values; actual recovery time could be 5-10 times longer.
Recommended Approach: Monthly or quarterly disaster recovery drills must include:
– Complete Region failover procedures
– Data consistency verification
– Performance testing after recovery
– Rollback plan verification
5. Cost-Benefit Balance
Observation: To reduce costs, some enterprises choose single-AZ deployment or minimal redundancy configurations.
Design Consideration: Lack of quantitative risk assessment model to evaluate “cost savings vs. outage loss” trade-off.
Recommended Approach: Establish architecture ROI model to quantify redundancy costs and outage risks, making decisions based on business impact rather than pure cost reduction.
Enterprise Architecture Health Check Framework
I recommend conducting architecture health checks using a five-level assessment framework, covering everything from basic availability to cost-effectiveness:
Level 1: AZ Distribution Check (Baseline)
Assessment Criteria:
– Are critical services deployed across at least two AZs?
– Can each AZ independently handle 50%+ traffic?
– Are AZ-level failover mechanisms automated?
Cost-Conscious Checklist:
| Component | Health Standard | Risk Level | Pragmatic Cost Advice |
|---|---|---|---|
| Compute | At least 2 AZs | 🔴 High | 2 AZs sufficient for small businesses |
| Database | Multi-AZ enabled | 🔴 High | Multi-AZ costs +100%, evaluate necessity |
| Load Balancer | Auto cross-AZ | 🟡 Medium | Low cost impact, recommended |
| Storage | S3 or EBS snapshots | 🟡 Medium | S3 free, EBS snapshots low-cost |
| Network | Subnets across AZs | 🟢 Low | Minimal additional cost |
Common Issues:
– Database Primary-Standby in the same AZ
– Load Balancer configured for single AZ only
– EBS volumes without cross-AZ snapshot strategy
Architecture Decision (Pragmatic Approach):
Multi-AZ is not just “resource distribution” but requires consideration of:
– Network topology: Avoid single NAT Gateway or Internet Gateway as bottleneck
– Data replication: Choose synchronous vs. asynchronous replication based on RPO requirements
– Failover strategy: Active-Active vs. Active-Passive, depends on application statefulness
– 🎯 Cost Reality: 2 AZs provide 80% of the benefit at 35% additional cost; 3 AZs add another 30% cost for only 20% additional uptime improvement
Level 2: Region-Level Disaster Recovery (Advanced)
Assessment Criteria:
– Is there a standby Region available?
– Is Region failover tested quarterly?
– Does data replication support cross-Region?
Four Architecture Patterns:
- Backup & Restore (Lowest cost, Highest RTO)
- Suitable for: Non-critical systems, acceptable RTO 4-12 hours
- Architecture characteristics: Periodic snapshots to S3, restore on demand
- Cost ratio: 1x
-
RTO/RPO: 4-12 hours / 1-24 hours
-
Pilot Light (Moderate cost, Moderate RTO)
- Suitable for: Medium criticality systems, RTO within 1 hour
- Architecture characteristics: Minimal infrastructure running in standby Region, scaled up on demand
- Cost ratio: 1.5-2x
-
RTO/RPO: 10-60 minutes / 5-60 minutes
-
Warm Standby (Higher cost, Lower RTO)
- Suitable for: High criticality systems, RTO within 10 minutes
- Architecture characteristics: Reduced capacity running in standby Region, scaled up during failover
- Cost ratio: 2-3x
-
RTO/RPO: 5-15 minutes / 1-5 minutes
-
Active-Active (Highest cost, Lowest RTO)
- Suitable for: Mission-critical systems, near-zero RTO
- Architecture characteristics: Full capacity running in both Regions, traffic distributed via Global Accelerator or Route 53
- Cost ratio: 2.5-4x
- RTO/RPO: < 1 minute / Near-zero
Selection Decision Matrix:
– Financial trading systems: Active-Active required
– E-commerce platforms: Warm Standby minimum
– Internal management systems: Pilot Light acceptable
– Archive systems: Backup & Restore sufficient
Level 3: Monitoring Independence (Critical)
Assessment Criteria:
– Are monitoring systems deployed independently from application systems?
– Are there out-of-band monitoring channels?
– Can you detect failures when cloud provider control plane is down?
Three-Layer Monitoring Strategy:
- In-Region Monitoring (First Layer)
- CloudWatch Metrics & Alarms
- Application Performance Monitoring (APM)
-
Log aggregation and analysis
-
Cross-Region Monitoring (Second Layer)
- Independent Region running monitoring infrastructure
- Active synthetic testing from different geolocations
-
Cross-Region log backup
-
Out-of-Cloud Monitoring (Third Layer)
- Third-party monitoring services (Datadog, New Relic, Pingdom)
- On-premises monitoring backup
- Multi-cloud monitoring (AWS + Azure/GCP)
Recommended Approach: Monitoring system availability must be one order of magnitude higher than application system. If application SLA is 99.9%, monitoring system must reach 99.99%+.
Level 4: Failure Drill Regularization (Validation)
Assessment Criteria:
– Are disaster recovery drills conducted at least quarterly?
– Do drill scenarios cover various failure types?
– Are drill results documented and optimization measures implemented?
Drill Scenario Design:
– Single-AZ complete failure: Simulate power or network complete outage
– Region control plane degradation: Simulate API call timeout or throttling
– Database Primary failure: Simulate RDS failover process
– Network partition: Simulate cross-AZ communication interruption
– Cascading failure: Simulate dependency service unavailability triggering chain reaction
GameDay Practice:
– Schedule chaos engineering exercises quarterly
– Involve all relevant teams (Dev, Ops, Security, Business)
– Set clear success criteria and KPIs
– Conduct post-mortem after drills to continuously optimize
Level 5: Cost-Benefit Balance (Strategic)
Assessment Criteria:
– Is there a quantitative model for architecture redundancy costs?
– Is outage risk quantified into monetary losses?
– Are architecture decisions data-driven?
ROI Calculation Framework:
| Architecture Pattern | Monthly Cost Increase | RTO | Expected Annual Outage Loss | Net Benefit |
|---|---|---|---|---|
| Single-AZ | Baseline (1x) | 4-24 hours | $500,000 | -$500,000 |
| Multi-AZ | +30-50% | 10-60 min | $100,000 | +$250,000 |
| Multi-Region (Pilot) | +50-100% | 5-30 min | $50,000 | +$200,000 |
| Multi-Region (Active) | +150-300% | < 1 min | $5,000 | -$150,000 |
Decision Logic:
– Calculate Expected Annual Loss = Outage Probability × Single Outage Loss
– Compare Architecture Investment vs. Risk Reduction Benefit
– Consider intangible costs: Brand reputation, customer trust, regulatory penalties
Real-World Example (Public Data):
– 2017 AWS S3 outage caused estimated industry loss of $150 million (source: Cyence)
– 2021 Facebook outage resulted in $100 million revenue loss (source: Facebook earnings report)
– This demonstrates: For high-traffic enterprises, Multi-Region architecture investment can pay back in one incident
Multi-AZ Design Principles (Conceptual Framework)
Architecture Thinking: Treating Each AZ as Independent Datacenter
Multi-AZ is not simply “deploying two instances,” but requires thinking from failure domain isolation perspective:
Core Principles:
1. Stateless Application Tier: Compute nodes can fail and restart at any time
2. Stateful Data Tier: Data replication must ensure consistency
3. Network Tier: Avoid single points (NAT Gateway, VPN, etc.)
4. Deployment Tier: Use blue-green or canary deployments to reduce blast radius
⚠️ Cost Reality Check: 3 AZs Are NOT Always Necessary
⏰ Pricing Disclaimer: The following cost estimates are based on 2025 AWS pricing. For current rates, please refer to AWS Pricing Calculator.
Important: Most cloud architecture guides commonly recommend “3 AZ” configurations, but this is significantly more expensive for most businesses:
- 3 AZs vs 2 AZs Cost: +50% additional infrastructure spend
- Reality: Only financial/healthcare sectors with extreme availability needs can justify this
- Pragmatic Recommendation:
- Small/Medium Businesses (Revenue < $10M): 2 AZs + good backup strategy
- Large Enterprises (Revenue $10M-$50M): 2 AZs + Cross-Region backup
- Financial/Medical (Revenue > $50M): 3 AZs (if budget allows)
Cost-Benefit Analysis:
Single AZ → 2 AZs:
Cost: +35%, Downtime reduction: -80% ✅ Excellent ROI
2 AZs → 3 AZs:
Cost: +30%, Downtime reduction: -50% ⚠️ Lower ROI
Bottom Line: 95% of businesses should choose 2 AZs, not 3.
Critical Decision Points
Decision 1: Active-Active or Active-Passive?
– Active-Active: Both AZs simultaneously serve traffic, higher resource utilization but complex state synchronization
– Active-Passive: One AZ serves traffic, other as hot standby, lower cost but need automation for failover
Decision 2: Synchronous or Asynchronous Replication?
– Synchronous: Ensures zero data loss but higher latency, suitable for financial transactions
– Asynchronous: Lower latency but potential data loss, suitable for non-critical data
Decision 3: Automated or Manual Failover?
– Automated: Faster recovery but risk of flapping and unnecessary failovers
– Manual: More controlled but slower, dependent on on-call engineer availability
Real-World Architecture Pattern (No Code)
E-commerce Platform Multi-AZ Architecture:
– Frontend: ALB distributes traffic to EC2/ECS in both AZs
– Application: Auto Scaling Group spans two AZs, each can handle 100% traffic
– Cache: ElastiCache Multi-AZ with automatic failover
– Database: RDS Multi-AZ synchronous replication
– Storage: S3 automatic cross-AZ replication
– Queue: SQS naturally supports Multi-AZ
Key Architectural Decisions:
– Session state stored in ElastiCache (not local), supports any instance handling requests
– Database connection pool configured for automatic failover, application layer transparent
– Static assets via CloudFront, independent from AZ failures
– Asynchronous tasks via SQS, ensuring at-least-once delivery
Real-World Enterprise Architecture Practices
Netflix: Active-Active Multi-Region Architecture
– Scale: 1000+ microservices, global multi-region deployment
– Architecture Highlights: Active-Active architecture, can evacuate entire region in 40 minutes
– Chaos Engineering: Proactive failure injection (Chaos Monkey)
– Outcome: Any single component failure doesn’t affect user streaming
– Reference: Netflix Multi-Region Architecture (QCon)
Airbnb: Pragmatic Multi-AZ Architecture
– Scale: 200 EC2 instances, global service
– Architecture Highlights: Multi-AZ deployment, Amazon RDS automatic replication
– Migration Achievement: Completed entire database migration with only 15 minutes downtime
– Cost-Effectiveness: Balances availability and cost
– Reference: Airbnb AWS Case Study (Medium)
Spotify: Kubernetes Microservices Architecture
– Scale: 300+ microservices, 500M+ active users
– Architecture Highlights: AWS + GCP hybrid cloud, multi-region deployment
– Fault Tolerance: Kubernetes automatic failover and scaling
– Global Optimization: Routes users to nearest server by location
– Reference: Spotify Cloud Infrastructure
Multi-Region Architecture Patterns (Strategic Planning)
When Should You Consider Multi-Region?
Multi-Region is not for everyone. Consider when:
– Regulatory requirements: Data sovereignty, GDPR, etc.
– Global user base: Reduce cross-continent latency
– Extreme high availability: 99.99%+ SLA requirements
– Disaster recovery requirements: Need to survive Region-level catastrophic failures
Four Progressive Architecture Patterns
(Already detailed in Level 2 section above, omitted here)
Real Cross-Region Architecture Decisions
Decision 1: Data Consistency Strategy
– Strong consistency: Synchronous cross-Region replication, highest latency
– Eventual consistency: Asynchronous replication, suitable for most scenarios
– Conflict resolution: Use timestamp, version vector, or last-write-wins
Decision 2: Traffic Routing Strategy
– Geolocation-based: Route 53 based on user location
– Latency-based: Route to lowest latency Region
– Failover-based: Switch to standby Region only when Primary fails
– Weighted: Gradually shift traffic for canary testing
Decision 3: Data Residency Compliance
– EU users’ data must stay in EU Regions (GDPR)
– China users’ data must stay in China Regions (Cybersecurity Law)
– Healthcare data requires special compliance certifications (HIPAA)
Monitoring System Decoupling Strategy
Observer-Observable Coupling Problem
This incident revealed a critical architectural flaw: When CloudWatch and applications share fate in the same Region, you cannot monitor failures when they occur.
Analogy: Like a ship’s captain relying only on the ship’s own instruments—when the ship sinks, all instruments fail, leaving you with no way to assess the situation.
Three-Layer Independent Monitoring Architecture
Layer 1: In-Cloud Real-Time Monitoring
– Primary monitoring: CloudWatch in application Region
– Advantages: Low latency, tight integration
– Disadvantages: Shares fate with application
Layer 2: Cross-Region Backup Monitoring
– Independent Region running monitoring stack
– Active synthetics testing application endpoints
– Can detect failures even when primary Region control plane is down
– Recommended implementation: Second Region runs lightweight monitoring instance
Layer 3: Out-of-Cloud Fallback Monitoring
– Third-party SaaS monitoring services
– On-premises monitoring infrastructure
– Multi-cloud backup (e.g., monitoring AWS from Azure/GCP)
– Advantages: Complete independence from AWS availability
Monitoring Independence Checklist
- [ ] Is alert notification channel independent from primary cloud provider? (e.g., use third-party PagerDuty, OpsGenie)
- [ ] Are monitoring data and metrics stored in separate Region/cloud?
- [ ] Can synthetic testing run from external networks?
- [ ] Does on-call engineer have out-of-band access methods (non-AWS console)?
- [ ] Is incident response playbook accessible when cloud console is down?
Architecture Health Check Action Plan
For Enterprises Currently Single-AZ
Immediate Actions (Within 1 month):
1. Assess which services are critical and cannot tolerate 4+ hour downtime
2. Prioritize Multi-AZ transformation for critical services
3. Establish basic cross-Region backup (at least Backup & Restore pattern)
Medium-term Optimization (Within 3-6 months):
1. Gradually migrate non-critical services to Multi-AZ
2. Implement automated health checks and failover mechanisms
3. Conduct first disaster recovery drill
Long-term Planning (Within 1 year):
1. Evaluate Multi-Region architecture necessity
2. Establish quarterly GameDay culture
3. Build cost-benefit model for continuous optimization
For Enterprises Already Multi-AZ
Validation Checklist:
1. Is current Multi-AZ truly active? Or just “standby for show”?
2. Has actual AZ failover ever been performed?
3. Can monitoring systems detect failures when single AZ is down?
4. Is RTO/RPO theoretical value or tested result?
Advanced Optimization:
1. From reactive to proactive: Implement Chaos Engineering
2. From manual to automated: Automate all runbooks
3. From single-cloud to multi-cloud: Evaluate AWS + Azure/GCP hybrid architecture
Cost-Benefit Analysis: Is Multi-Region Worth It?
ROI Calculation Model
Basic Formula:
Net Annual Benefit = (Outage Probability Reduction × Single Outage Loss) - Additional Architecture Cost
Example 1: Medium E-commerce Company
– Single-AZ deployment cost: $20,000/month
– Multi-AZ deployment cost: $30,000/month (increase $10,000)
– Multi-Region deployment cost: $50,000/month (increase $30,000)
- Estimated single outage loss: $500,000 (revenue loss + customer churn + brand damage)
- Single-AZ annual outage probability: 0.5% (approximately once every 2 years)
- Multi-AZ annual outage probability: 0.1% (approximately once every 10 years)
- Multi-Region annual outage probability: 0.01% (approximately once every 100 years)
Calculation:
– Single-AZ expected annual loss: $500,000 × 0.5% = $2,500
– Multi-AZ expected annual loss: $500,000 × 0.1% = $500
– Multi-Region expected annual loss: $500,000 × 0.01% = $50
- Multi-AZ net benefit: ($2,500 – $500) – $10,000×12 = -$118,000 (worth it only if outage loss > $2.4M)
- Multi-Region net benefit: ($2,500 – $50) – $30,000×12 = -$357,550 (not recommended for this scale)
Conclusion: This medium company should choose Multi-AZ, not Multi-Region.
Example 2: Large Financial Services Company
– Single outage loss: $10,000,000 (regulatory penalties + transaction interruption + reputation damage)
– Multi-AZ additional cost: $50,000/month
– Multi-Region additional cost: $200,000/month
Calculation:
– Multi-AZ net benefit: ($10M × 0.4%) – $50K×12 = $40,000 – $600,000 = -$560,000 (borderline)
– Multi-Region net benefit: ($10M × 0.49%) – $200K×12 = $49,000 – $2,400,000 = -$2,351,000 (also needs evaluation)
But considering:
– Regulatory penalties could be much higher (GDPR up to 4% annual revenue)
– Brand reputation damage is hard to quantify
– Customer trust once lost is difficult to rebuild
Conclusion: Large financial companies should implement Multi-Region Active-Active architecture.
Conclusion: From “Cost Center” to “Risk Management”
This AWS incident teaches us: Cloud architecture design is not just a technical issue, but a business risk management issue.
Key Mindset Shifts:
1. From “chasing 100% availability” to “accepting failure and designing for resilience”
2. From “minimizing costs” to “optimizing risk-return ratio”
3. From “one-time design” to “continuous validation”
4. From “technical team responsibility” to “business stakeholder consensus”
Recommended Action:
– Book a 2-hour architecture review meeting this week
– Use the “Five-Level Health Check Framework” from this article to assess your current architecture
– Quantify your outage risks with actual numbers
– Make architecture decisions based on data, not intuition
Remember: Multi-AZ and Multi-Region are not silver bullets, but tools. The key is choosing the right architecture pattern based on your business characteristics, risk tolerance, and budget constraints.