🌏 閱讀中文版本
In modern software development, deployment strategies directly impact system availability and user experience. Traditional maintenance windows can no longer meet the demands of 24/7 services. Blue-Green Deployment and Canary Deployment provide zero-downtime solutions, but how do you actually implement them? This article dives into technical details with complete implementation guidance.
Why Zero-Downtime Deployments?
Business Requirements: – Global services have no “off-peak hours” – 99.9% SLA allows only 43 minutes of downtime per month – Competitors won’t wait for your maintenance window
Technical Challenges: – How to update code without service interruption? – How to handle database schema changes? – How to rollback quickly when issues occur?
Blue-Green Deployment
Architecture and Principles
Blue-green deployment achieves zero downtime by maintaining two identical production environments:
Load Balancer
│
┌───────────────┼───────────────┐
│ │ │
Blue (v1.0) Green (v1.1) Database
Currently New Version (Shared)
Active Standby
Core Concepts: 1. Blue Environment: Currently serving version 2. Green Environment: New version deployed here, ready after testing 3. Switching Mechanism: Load Balancer instantly redirects traffic from Blue to Green 4. Rollback Capability: Immediately switch back to Blue if problems occur
Kubernetes Implementation
1. Switching via Service Label Selector
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: myapp
image: myapp:v1.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: myapp
image: myapp:v1.1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
# service.yaml (key to switching)
apiVersion: v1
kind: Service
metadata:
name: myapp-service
spec:
selector:
app: myapp
version: blue # Change to green to complete deployment
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: LoadBalancer
Deployment and Switching Workflow:
# 1. Deploy Green environment (new version)
kubectl apply -f green-deployment.yaml
# 2. Wait for Green readiness
kubectl rollout status deployment/myapp-green
# 3. Verify Green environment health
kubectl get pods -l version=green
kubectl logs -l version=green --tail=50
# 4. Run smoke tests
kubectl port-forward deployment/myapp-green 8080:8080
curl http://localhost:8080/health
curl http://localhost:8080/api/test
# 5. Switch traffic to Green (critical step)
kubectl patch service myapp-service -p '{"spec":{"selector":{"version":"green"}}}'
# 6. Monitor metrics after switch (5-10 minutes)
kubectl top pods -l version=green
# Observe error rate, latency, CPU/Memory
# 7. Delete Blue environment after confirmation
kubectl delete deployment myapp-blue
# Rollback if needed: execute immediately
kubectl patch service myapp-service -p '{"spec":{"selector":{"version":"blue"}}}'
Nginx Implementation
Using Nginx as Load Balancer for Blue-Green Switching:
# /etc/nginx/conf.d/myapp.conf
upstream backend {
# Blue-Green switching: comment/uncomment to switch
# Blue environment (current production)
server 10.0.1.10:8080;
server 10.0.1.11:8080;
server 10.0.1.12:8080;
# Green environment (new version, ready to switch)
# server 10.0.2.10:8080;
# server 10.0.2.11:8080;
# server 10.0.2.12:8080;
}
server {
listen 80;
server_name myapp.example.com;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Health checks
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
}
location /health {
access_log off;
proxy_pass http://backend;
}
}
Switching Script:
#!/bin/bash
# blue-green-switch.sh
set -e
NGINX_CONF="/etc/nginx/conf.d/myapp.conf"
BACKUP_CONF="/tmp/myapp.conf.backup"
# Backup current config
cp $NGINX_CONF $BACKUP_CONF
# Check Green environment health
for server in 10.0.2.10 10.0.2.11 10.0.2.12; do
if ! curl -f -s http://$server:8080/health > /dev/null; then
echo "❌ Green server $server is not healthy"
exit 1
fi
done
echo "✅ All Green servers are healthy"
# Switch to Green (comment Blue, enable Green)
sed -i '/10.0.1/s/^/# /' $NGINX_CONF
sed -i '/10.0.2/s/^# //' $NGINX_CONF
# Test Nginx config
if nginx -t; then
echo "✅ Nginx configuration is valid"
# Reload Nginx (zero downtime)
nginx -s reload
echo "✅ Switched to Green environment"
else
echo "❌ Nginx configuration test failed, rolling back"
cp $BACKUP_CONF $NGINX_CONF
exit 1
fi
# Monitor for 5 minutes
echo "Monitoring for 5 minutes..."
sleep 300
# Manual cleanup of Blue environment if everything is normal
AWS CodeDeploy Implementation
AppSpec File (Blue-Green Deployment):
# appspec.yml
version: 0.0
os: linux
files:
- source: /
destination: /var/www/myapp
hooks:
BeforeInstall:
- location: scripts/before_install.sh
timeout: 300
runas: root
AfterInstall:
- location: scripts/after_install.sh
timeout: 300
runas: root
ApplicationStart:
- location: scripts/start_application.sh
timeout: 300
runas: root
ValidateService:
- location: scripts/validate_service.sh
timeout: 300
runas: root
Validation Script:
#!/bin/bash
# scripts/validate_service.sh
# Wait for application startup
sleep 10
# Health check
HEALTH_CHECK_URL="http://localhost:8080/health"
MAX_ATTEMPTS=30
for i in $(seq 1 $MAX_ATTEMPTS); do
if curl -f -s $HEALTH_CHECK_URL | grep -q "healthy"; then
echo "✅ Application is healthy"
exit 0
fi
echo "⏳ Waiting for application to be healthy (attempt $i/$MAX_ATTEMPTS)"
sleep 10
done
echo "❌ Application failed health check"
exit 1
Practical Considerations for Blue-Green Deployment
1. Database Schema Changes
The biggest challenge with blue-green deployment is shared database. Schema changes must be backward compatible:
-- ❌ Wrong: Direct column drop (will break Blue environment)
ALTER TABLE users DROP COLUMN legacy_field;
-- ✅ Correct: Phased migration
-- Phase 1: Add new column (both Blue/Green work)
ALTER TABLE users ADD COLUMN new_field VARCHAR(255);
-- Phase 2: Data migration
UPDATE users SET new_field = CONCAT(first_name, ' ', last_name);
-- Phase 3: Switch application to Green (using new_field)
-- Phase 4: Drop old column in next deployment
ALTER TABLE users DROP COLUMN legacy_field;
2. Session Handling
# Use Redis centralized session storage
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
SESSION_STORE: "redis"
REDIS_HOST: "redis-cluster.default.svc.cluster.local"
REDIS_PORT: "6379"
3. Cache Invalidation Strategy
# Clear cache during blue-green switch
import redis
def invalidate_cache_on_deployment():
r = redis.Redis(host='redis-host', port=6379)
# Use version number as cache key prefix
r.delete('cache:v1.0:*')
print("Cache invalidated for old version")
4. Monitoring Metrics
# Metrics to monitor after switching
# - HTTP error rates (4xx, 5xx)
# - Response times (p50, p95, p99)
# - Database connections
# - CPU/Memory usage
# Prometheus query examples
rate(http_requests_total{status=~"5.."}[5m]) # 5xx error rate
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # p95 latency
Canary Deployment
Principles and Traffic Control
Canary deployment reduces risk by gradually increasing traffic percentage:
Load Balancer (Traffic Distribution)
│
┌───────────────┼───────────────┐
│ │ │
Stable (95%) Canary (5%) Database
v1.0 v1.1 (Shared)
↓ Gradually increase Canary traffic if metrics are healthy
Stable (50%) Canary (50%)
↓ Eventually fully switch
Stable (0%) Canary (100%)
Kubernetes + Istio Implementation
1. Istio VirtualService for Traffic Distribution:
# canary-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp-vs
spec:
hosts:
- myapp.example.com
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: myapp-service
subset: canary
- route:
- destination:
host: myapp-service
subset: stable
weight: 95
- destination:
host: myapp-service
subset: canary
weight: 5 # Initial 5% traffic to Canary
# canary-destinationrule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp-dr
spec:
host: myapp-service
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
2. Deployment Workflow and Traffic Adjustment:
# 1. Deploy Canary version (5% traffic)
kubectl apply -f canary-deployment.yaml
kubectl apply -f canary-virtualservice.yaml
# 2. Monitor Canary metrics (30 minutes)
istioctl dashboard prometheus
# Observe: error rate, latency, resource usage
# 3. Gradually increase traffic (if metrics are healthy)
# 5% → 10% → 25% → 50% → 100%
kubectl patch virtualservice myapp-vs --type json
-p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":90},
{"op":"replace","path":"/spec/http/1/route/1/weight","value":10}]'
# Observe 15-30 minutes after each adjustment
# 4. Fully switch to Canary
kubectl patch virtualservice myapp-vs --type json
-p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":0},
{"op":"replace","path":"/spec/http/1/route/1/weight","value":100}]'
# 5. Clean up old version
kubectl delete deployment myapp-stable
# Rollback if needed: immediately switch traffic back to stable
kubectl patch virtualservice myapp-vs --type json
-p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":100},
{"op":"replace","path":"/spec/http/1/route/1/weight","value":0}]'
Nginx Implementation (Weighted Round Robin)
upstream backend {
# Stable version (95% traffic, 19 instances)
server 10.0.1.10:8080 weight=19;
# Canary version (5% traffic, 1 instance)
server 10.0.2.10:8080 weight=1;
}
server {
listen 80;
server_name myapp.example.com;
location / {
proxy_pass http://backend;
# Maintain session affinity (if needed)
# ip_hash; # or use sticky session
}
}
AWS CodeDeploy Canary Deployment
Deployment Configuration:
{
"deploymentConfigName": "CodeDeployDefault.Canary10Percent5Minutes",
"computePlatform": "Server",
"trafficRoutingConfig": {
"type": "TimeBasedCanary",
"timeBasedCanary": {
"canaryPercentage": 10,
"canaryInterval": 5
}
}
}
Custom Deployment Configuration:
# Create custom Canary config: 20% traffic increase every 10 minutes
aws deploy create-deployment-config
--deployment-config-name Custom.Canary20Percent10Minutes
--compute-platform Server
--traffic-routing-config '
{
"type": "TimeBasedCanary",
"timeBasedCanary": {
"canaryPercentage": 20,
"canaryInterval": 10
}
}'
Automated Rollback Configuration
Kubernetes + Prometheus + Flagger Automation:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # Auto rollback if success rate < 99%
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # Auto rollback if p99 latency > 500ms
interval: 1m
webhooks:
- name: load-test
url: http://loadtester.default/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary:8080/"
Monitoring and Alerting Script:
#!/bin/bash
# canary-monitor.sh
CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total{version="canary",status=~"5.."}[5m])' | jq -r '.data.result[0].value[1]')
THRESHOLD=0.01 # 1% error rate
if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
echo "❌ Canary error rate too high: $CANARY_ERROR_RATE"
echo "Rolling back to stable..."
# Auto rollback
kubectl patch virtualservice myapp-vs --type json
-p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":100},
{"op":"replace","path":"/spec/http/1/route/1/weight","value":0}]'
# Send alert
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL
-d "{"text":"⚠️ Canary deployment auto-rollback triggered"}"
exit 1
fi
echo "✅ Canary metrics are healthy: error_rate=$CANARY_ERROR_RATE"
Blue-Green vs Canary: How to Choose?
| Factor | Blue-Green Deployment | Canary Deployment |
|---|---|---|
| Risk Control | One-time switch, higher risk | Gradual rollout, lowest risk |
| Infrastructure Cost | Double resources (100% + 100%) | Lower resources (100% + 5-20%) |
| Rollback Speed | Instant (switch LB) | Requires gradual traffic reduction |
| Testing Complexity | Can fully test Green environment | Low Canary traffic, hard to test edge cases |
| Monitoring Requirements | Focused monitoring after switch | Continuous Canary metrics monitoring |
| Use Cases | Major version upgrades, database migrations | Daily feature iterations, A/B testing |
Recommended Strategy:
Daily deployments: Canary (5% → 10% → 25% → 50% → 100%)
Major version upgrades: Blue-Green (full testing then one-time switch)
Emergency fixes: Blue-Green (quick switch, quick rollback)
Experimental features: Canary + Feature Flag (test with specific users first)
Practical Checklist
Before Deployment: – [ ] Database schema changes are backward compatible – [ ] Session uses centralized storage (Redis/Memcached) – [ ] Static assets are versioned (avoid cache issues) – [ ] Health check endpoints working properly – [ ] Monitoring and alerting configured
During Deployment: – [ ] New version passes smoke tests – [ ] Monitoring dashboard open – [ ] Team on standby (can rollback immediately) – [ ] Deployment time and traffic distribution recorded
After Deployment: – [ ] Observe error rate, latency, resource usage – [ ] Check logs for anomalies – [ ] Verify critical business flows – [ ] Clean up old version resources
Summary
Blue-green and canary deployments are the two pillars of zero-downtime deployment:
Blue-Green Deployment: – Suitable for scenarios requiring fast switching and rollback – Implemented via Kubernetes Service, Nginx, AWS CodeDeploy – Key factors: database compatibility and health checks
Canary Deployment: – Suitable for gradual new version validation – Implemented via Istio, Nginx weighting, Flagger automation – Key factors: monitoring metrics and automated rollback mechanisms
Both can be combined: use Canary to verify low traffic first, then use Blue-Green to quickly complete remaining switch.
Next Steps: – Build CI/CD pipeline integrating deployment strategies – Set up Prometheus + Grafana monitoring – Implement automated rollback mechanisms – Regularly practice deployment and rollback procedures
Related Articles
- Choosing AWS Container Services: Kubernetes vs Amazon ECS Complete Comparison Guide
- 選擇 AWS 容器服務:Kubernetes vs Amazon ECS 完整對比指南
- AWS, Azure, and GCP Cloud Certifications Complete Comparison Guide (2025 Latest)
- AWS to GCP Architecture Migration Complete Guide: Service Mapping, Migration Strategy & Implementation
- AWS ALB Multi-Instance Deployment Strategy: A Double-Edged Sword for Large Project Architectures