Blue-Green and Canary Deployments: Modern System Deployment Strategies

🌏 閱讀中文版本


In modern software development, deployment strategies directly impact system availability and user experience. Traditional maintenance windows can no longer meet the demands of 24/7 services. Blue-Green Deployment and Canary Deployment provide zero-downtime solutions, but how do you actually implement them? This article dives into technical details with complete implementation guidance.

Why Zero-Downtime Deployments?

Business Requirements: – Global services have no “off-peak hours” – 99.9% SLA allows only 43 minutes of downtime per month – Competitors won’t wait for your maintenance window

Technical Challenges: – How to update code without service interruption? – How to handle database schema changes? – How to rollback quickly when issues occur?

Blue-Green Deployment

Architecture and Principles

Blue-green deployment achieves zero downtime by maintaining two identical production environments:

                   Load Balancer
                        │
        ┌───────────────┼───────────────┐
        │               │               │
    Blue (v1.0)    Green (v1.1)    Database
    Currently      New Version     (Shared)
    Active         Standby

Core Concepts: 1. Blue Environment: Currently serving version 2. Green Environment: New version deployed here, ready after testing 3. Switching Mechanism: Load Balancer instantly redirects traffic from Blue to Green 4. Rollback Capability: Immediately switch back to Blue if problems occur

Kubernetes Implementation

1. Switching via Service Label Selector

# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: myapp
        image: myapp:v1.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: myapp
        image: myapp:v1.1
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
# service.yaml (key to switching)
apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  selector:
    app: myapp
    version: blue  # Change to green to complete deployment
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer

Deployment and Switching Workflow:

# 1. Deploy Green environment (new version)
kubectl apply -f green-deployment.yaml

# 2. Wait for Green readiness
kubectl rollout status deployment/myapp-green

# 3. Verify Green environment health
kubectl get pods -l version=green
kubectl logs -l version=green --tail=50

# 4. Run smoke tests
kubectl port-forward deployment/myapp-green 8080:8080
curl http://localhost:8080/health
curl http://localhost:8080/api/test

# 5. Switch traffic to Green (critical step)
kubectl patch service myapp-service -p '{"spec":{"selector":{"version":"green"}}}'

# 6. Monitor metrics after switch (5-10 minutes)
kubectl top pods -l version=green
# Observe error rate, latency, CPU/Memory

# 7. Delete Blue environment after confirmation
kubectl delete deployment myapp-blue

# Rollback if needed: execute immediately
kubectl patch service myapp-service -p '{"spec":{"selector":{"version":"blue"}}}'

Nginx Implementation

Using Nginx as Load Balancer for Blue-Green Switching:

# /etc/nginx/conf.d/myapp.conf

upstream backend {
    # Blue-Green switching: comment/uncomment to switch
    # Blue environment (current production)
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    server 10.0.1.12:8080;

    # Green environment (new version, ready to switch)
    # server 10.0.2.10:8080;
    # server 10.0.2.11:8080;
    # server 10.0.2.12:8080;
}

server {
    listen 80;
    server_name myapp.example.com;

    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # Health checks
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_connect_timeout 5s;
        proxy_send_timeout 10s;
        proxy_read_timeout 10s;
    }

    location /health {
        access_log off;
        proxy_pass http://backend;
    }
}

Switching Script:

#!/bin/bash
# blue-green-switch.sh

set -e

NGINX_CONF="/etc/nginx/conf.d/myapp.conf"
BACKUP_CONF="/tmp/myapp.conf.backup"

# Backup current config
cp $NGINX_CONF $BACKUP_CONF

# Check Green environment health
for server in 10.0.2.10 10.0.2.11 10.0.2.12; do
    if ! curl -f -s http://$server:8080/health > /dev/null; then
        echo "❌ Green server $server is not healthy"
        exit 1
    fi
done

echo "✅ All Green servers are healthy"

# Switch to Green (comment Blue, enable Green)
sed -i '/10.0.1/s/^/# /' $NGINX_CONF
sed -i '/10.0.2/s/^# //' $NGINX_CONF

# Test Nginx config
if nginx -t; then
    echo "✅ Nginx configuration is valid"
    # Reload Nginx (zero downtime)
    nginx -s reload
    echo "✅ Switched to Green environment"
else
    echo "❌ Nginx configuration test failed, rolling back"
    cp $BACKUP_CONF $NGINX_CONF
    exit 1
fi

# Monitor for 5 minutes
echo "Monitoring for 5 minutes..."
sleep 300

# Manual cleanup of Blue environment if everything is normal

AWS CodeDeploy Implementation

AppSpec File (Blue-Green Deployment):

# appspec.yml
version: 0.0
os: linux
files:
  - source: /
    destination: /var/www/myapp
hooks:
  BeforeInstall:
    - location: scripts/before_install.sh
      timeout: 300
      runas: root
  AfterInstall:
    - location: scripts/after_install.sh
      timeout: 300
      runas: root
  ApplicationStart:
    - location: scripts/start_application.sh
      timeout: 300
      runas: root
  ValidateService:
    - location: scripts/validate_service.sh
      timeout: 300
      runas: root

Validation Script:

#!/bin/bash
# scripts/validate_service.sh

# Wait for application startup
sleep 10

# Health check
HEALTH_CHECK_URL="http://localhost:8080/health"
MAX_ATTEMPTS=30

for i in $(seq 1 $MAX_ATTEMPTS); do
    if curl -f -s $HEALTH_CHECK_URL | grep -q "healthy"; then
        echo "✅ Application is healthy"
        exit 0
    fi
    echo "⏳ Waiting for application to be healthy (attempt $i/$MAX_ATTEMPTS)"
    sleep 10
done

echo "❌ Application failed health check"
exit 1

Practical Considerations for Blue-Green Deployment

1. Database Schema Changes

The biggest challenge with blue-green deployment is shared database. Schema changes must be backward compatible:

-- ❌ Wrong: Direct column drop (will break Blue environment)
ALTER TABLE users DROP COLUMN legacy_field;

-- ✅ Correct: Phased migration
-- Phase 1: Add new column (both Blue/Green work)
ALTER TABLE users ADD COLUMN new_field VARCHAR(255);

-- Phase 2: Data migration
UPDATE users SET new_field = CONCAT(first_name, ' ', last_name);

-- Phase 3: Switch application to Green (using new_field)

-- Phase 4: Drop old column in next deployment
ALTER TABLE users DROP COLUMN legacy_field;

2. Session Handling

# Use Redis centralized session storage
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  SESSION_STORE: "redis"
  REDIS_HOST: "redis-cluster.default.svc.cluster.local"
  REDIS_PORT: "6379"

3. Cache Invalidation Strategy

# Clear cache during blue-green switch
import redis

def invalidate_cache_on_deployment():
    r = redis.Redis(host='redis-host', port=6379)
    # Use version number as cache key prefix
    r.delete('cache:v1.0:*')
    print("Cache invalidated for old version")

4. Monitoring Metrics

# Metrics to monitor after switching
# - HTTP error rates (4xx, 5xx)
# - Response times (p50, p95, p99)
# - Database connections
# - CPU/Memory usage

# Prometheus query examples
rate(http_requests_total{status=~"5.."}[5m])  # 5xx error rate
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # p95 latency

Canary Deployment

Principles and Traffic Control

Canary deployment reduces risk by gradually increasing traffic percentage:

                Load Balancer (Traffic Distribution)
                        │
        ┌───────────────┼───────────────┐
        │               │               │
    Stable (95%)    Canary (5%)     Database
    v1.0            v1.1            (Shared)

    ↓ Gradually increase Canary traffic if metrics are healthy

    Stable (50%)    Canary (50%)

    ↓ Eventually fully switch

    Stable (0%)     Canary (100%)

Kubernetes + Istio Implementation

1. Istio VirtualService for Traffic Distribution:

# canary-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp-vs
spec:
  hosts:
  - myapp.example.com
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: myapp-service
        subset: canary
  - route:
    - destination:
        host: myapp-service
        subset: stable
      weight: 95
    - destination:
        host: myapp-service
        subset: canary
      weight: 5  # Initial 5% traffic to Canary
# canary-destinationrule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp-dr
spec:
  host: myapp-service
  subsets:
  - name: stable
    labels:
      version: stable
  - name: canary
    labels:
      version: canary

2. Deployment Workflow and Traffic Adjustment:

# 1. Deploy Canary version (5% traffic)
kubectl apply -f canary-deployment.yaml
kubectl apply -f canary-virtualservice.yaml

# 2. Monitor Canary metrics (30 minutes)
istioctl dashboard prometheus
# Observe: error rate, latency, resource usage

# 3. Gradually increase traffic (if metrics are healthy)
# 5% → 10% → 25% → 50% → 100%
kubectl patch virtualservice myapp-vs --type json 
  -p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":90},
       {"op":"replace","path":"/spec/http/1/route/1/weight","value":10}]'

# Observe 15-30 minutes after each adjustment

# 4. Fully switch to Canary
kubectl patch virtualservice myapp-vs --type json 
  -p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":0},
       {"op":"replace","path":"/spec/http/1/route/1/weight","value":100}]'

# 5. Clean up old version
kubectl delete deployment myapp-stable

# Rollback if needed: immediately switch traffic back to stable
kubectl patch virtualservice myapp-vs --type json 
  -p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":100},
       {"op":"replace","path":"/spec/http/1/route/1/weight","value":0}]'

Nginx Implementation (Weighted Round Robin)

upstream backend {
    # Stable version (95% traffic, 19 instances)
    server 10.0.1.10:8080 weight=19;

    # Canary version (5% traffic, 1 instance)
    server 10.0.2.10:8080 weight=1;
}

server {
    listen 80;
    server_name myapp.example.com;

    location / {
        proxy_pass http://backend;

        # Maintain session affinity (if needed)
        # ip_hash;  # or use sticky session
    }
}

AWS CodeDeploy Canary Deployment

Deployment Configuration:

{
  "deploymentConfigName": "CodeDeployDefault.Canary10Percent5Minutes",
  "computePlatform": "Server",
  "trafficRoutingConfig": {
    "type": "TimeBasedCanary",
    "timeBasedCanary": {
      "canaryPercentage": 10,
      "canaryInterval": 5
    }
  }
}

Custom Deployment Configuration:

# Create custom Canary config: 20% traffic increase every 10 minutes
aws deploy create-deployment-config 
  --deployment-config-name Custom.Canary20Percent10Minutes 
  --compute-platform Server 
  --traffic-routing-config '
{
  "type": "TimeBasedCanary",
  "timeBasedCanary": {
    "canaryPercentage": 20,
    "canaryInterval": 10
  }
}'

Automated Rollback Configuration

Kubernetes + Prometheus + Flagger Automation:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99  # Auto rollback if success rate < 99%
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500  # Auto rollback if p99 latency > 500ms
      interval: 1m
    webhooks:
    - name: load-test
      url: http://loadtester.default/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary:8080/"

Monitoring and Alerting Script:

#!/bin/bash
# canary-monitor.sh

CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total{version="canary",status=~"5.."}[5m])' | jq -r '.data.result[0].value[1]')

THRESHOLD=0.01  # 1% error rate

if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
    echo "❌ Canary error rate too high: $CANARY_ERROR_RATE"
    echo "Rolling back to stable..."

    # Auto rollback
    kubectl patch virtualservice myapp-vs --type json 
      -p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":100},
           {"op":"replace","path":"/spec/http/1/route/1/weight","value":0}]'

    # Send alert
    curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL 
      -d "{"text":"⚠️ Canary deployment auto-rollback triggered"}"

    exit 1
fi

echo "✅ Canary metrics are healthy: error_rate=$CANARY_ERROR_RATE"

Blue-Green vs Canary: How to Choose?

Factor Blue-Green Deployment Canary Deployment
Risk Control One-time switch, higher risk Gradual rollout, lowest risk
Infrastructure Cost Double resources (100% + 100%) Lower resources (100% + 5-20%)
Rollback Speed Instant (switch LB) Requires gradual traffic reduction
Testing Complexity Can fully test Green environment Low Canary traffic, hard to test edge cases
Monitoring Requirements Focused monitoring after switch Continuous Canary metrics monitoring
Use Cases Major version upgrades, database migrations Daily feature iterations, A/B testing

Recommended Strategy:

Daily deployments: Canary (5% → 10% → 25% → 50% → 100%)
Major version upgrades: Blue-Green (full testing then one-time switch)
Emergency fixes: Blue-Green (quick switch, quick rollback)
Experimental features: Canary + Feature Flag (test with specific users first)

Practical Checklist

Before Deployment: – [ ] Database schema changes are backward compatible – [ ] Session uses centralized storage (Redis/Memcached) – [ ] Static assets are versioned (avoid cache issues) – [ ] Health check endpoints working properly – [ ] Monitoring and alerting configured

During Deployment: – [ ] New version passes smoke tests – [ ] Monitoring dashboard open – [ ] Team on standby (can rollback immediately) – [ ] Deployment time and traffic distribution recorded

After Deployment: – [ ] Observe error rate, latency, resource usage – [ ] Check logs for anomalies – [ ] Verify critical business flows – [ ] Clean up old version resources

Summary

Blue-green and canary deployments are the two pillars of zero-downtime deployment:

Blue-Green Deployment: – Suitable for scenarios requiring fast switching and rollback – Implemented via Kubernetes Service, Nginx, AWS CodeDeploy – Key factors: database compatibility and health checks

Canary Deployment: – Suitable for gradual new version validation – Implemented via Istio, Nginx weighting, Flagger automation – Key factors: monitoring metrics and automated rollback mechanisms

Both can be combined: use Canary to verify low traffic first, then use Blue-Green to quickly complete remaining switch.

Next Steps: – Build CI/CD pipeline integrating deployment strategies – Set up Prometheus + Grafana monitoring – Implement automated rollback mechanisms – Regularly practice deployment and rollback procedures

Related Articles

Leave a Comment