藍綠部署與金絲雀部署:現代化系統部署策略解析

🌏 Read the English version


在現代化的軟體開發中,部署策略直接影響系統可用性與用戶體驗。傳統的停機部署(maintenance window)已無法滿足 24/7 服務的需求。藍綠部署(Blue-Green Deployment)與金絲雀部署(Canary Deployment)提供了零停機部署的解決方案,但如何實際操作?本文將深入技術細節,提供完整的實作指引。

為什麼需要零停機部署?

業務需求: – 全球化服務無法找到「離峰時段」 – SLA 承諾 99.9% 可用性(每月僅 43 分鐘停機預算) – 競爭對手不會等你維護完成

技術挑戰: – 如何在不中斷服務的情況下更新程式碼? – 如何處理資料庫 schema 變更? – 出問題時如何快速回滾?

藍綠部署(Blue-Green Deployment)

原理與架構

藍綠部署透過維護兩個完全相同的生產環境來實現零停機:

                   Load Balancer
                        │
        ┌───────────────┼───────────────┐
        │               │               │
    Blue (v1.0)    Green (v1.1)    Database
    Currently      New Version     (Shared)
    Active         Standby

核心概念: 1. Blue 環境:當前正在服務的版本 2. Green 環境:新版本部署在此,完成測試後待命 3. 切換機制:Load Balancer 瞬間將流量從 Blue 切到 Green 4. 回滾能力:問題發生時立即切回 Blue

Kubernetes 實作範例

1. 使用 Service Label Selector 切換

# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: myapp
        image: myapp:v1.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 10
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: myapp
        image: myapp:v1.1
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
# service.yaml (切換的關鍵)
apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  selector:
    app: myapp
    version: blue  # 切換到 green 即完成部署
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer

部署與切換流程:

# 1. 部署 Green 環境(新版本)
kubectl apply -f green-deployment.yaml

# 2. 等待 Green 環境就緒
kubectl rollout status deployment/myapp-green

# 3. 驗證 Green 環境健康狀態
kubectl get pods -l version=green
kubectl logs -l version=green --tail=50

# 4. 執行冒煙測試(Smoke Test)
kubectl port-forward deployment/myapp-green 8080:8080
curl http://localhost:8080/health
curl http://localhost:8080/api/test

# 5. 切換流量到 Green(關鍵步驟)
kubectl patch service myapp-service -p '{"spec":{"selector":{"version":"green"}}}'

# 6. 監控切換後的指標(5-10 分鐘)
kubectl top pods -l version=green
# 觀察錯誤率、延遲、CPU/Memory

# 7. 確認無誤後,刪除 Blue 環境
kubectl delete deployment myapp-blue

# 如需回滾:立即執行
kubectl patch service myapp-service -p '{"spec":{"selector":{"version":"blue"}}}'

Nginx 實作範例

使用 Nginx 作為 Load Balancer 進行藍綠切換:

# /etc/nginx/conf.d/myapp.conf

upstream backend {
    # 藍綠切換:註解/取消註解即可切換
    # Blue 環境(當前生產)
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    server 10.0.1.12:8080;

    # Green 環境(新版本,待切換)
    # server 10.0.2.10:8080;
    # server 10.0.2.11:8080;
    # server 10.0.2.12:8080;
}

server {
    listen 80;
    server_name myapp.example.com;

    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # 健康檢查
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_connect_timeout 5s;
        proxy_send_timeout 10s;
        proxy_read_timeout 10s;
    }

    location /health {
        access_log off;
        proxy_pass http://backend;
    }
}

切換腳本:

#!/bin/bash
# blue-green-switch.sh

set -e

NGINX_CONF="/etc/nginx/conf.d/myapp.conf"
BACKUP_CONF="/tmp/myapp.conf.backup"

# 備份當前設定
cp $NGINX_CONF $BACKUP_CONF

# 檢查 Green 環境健康狀態
for server in 10.0.2.10 10.0.2.11 10.0.2.12; do
    if ! curl -f -s http://$server:8080/health > /dev/null; then
        echo "❌ Green server $server is not healthy"
        exit 1
    fi
done

echo "✅ All Green servers are healthy"

# 切換到 Green(註解 Blue,啟用 Green)
sed -i '/10.0.1/s/^/# /' $NGINX_CONF
sed -i '/10.0.2/s/^# //' $NGINX_CONF

# 測試 Nginx 設定
if nginx -t; then
    echo "✅ Nginx configuration is valid"
    # 重新載入 Nginx(無停機)
    nginx -s reload
    echo "✅ Switched to Green environment"
else
    echo "❌ Nginx configuration test failed, rolling back"
    cp $BACKUP_CONF $NGINX_CONF
    exit 1
fi

# 監控 5 分鐘
echo "Monitoring for 5 minutes..."
sleep 300

# 如果一切正常,可以手動清理 Blue 環境

AWS CodeDeploy 實作範例

AppSpec 檔案(藍綠部署):

# appspec.yml
version: 0.0
os: linux
files:
  - source: /
    destination: /var/www/myapp
hooks:
  BeforeInstall:
    - location: scripts/before_install.sh
      timeout: 300
      runas: root
  AfterInstall:
    - location: scripts/after_install.sh
      timeout: 300
      runas: root
  ApplicationStart:
    - location: scripts/start_application.sh
      timeout: 300
      runas: root
  ValidateService:
    - location: scripts/validate_service.sh
      timeout: 300
      runas: root

驗證腳本:

#!/bin/bash
# scripts/validate_service.sh

# 等待應用程式啟動
sleep 10

# 健康檢查
HEALTH_CHECK_URL="http://localhost:8080/health"
MAX_ATTEMPTS=30

for i in $(seq 1 $MAX_ATTEMPTS); do
    if curl -f -s $HEALTH_CHECK_URL | grep -q "healthy"; then
        echo "✅ Application is healthy"
        exit 0
    fi
    echo "⏳ Waiting for application to be healthy (attempt $i/$MAX_ATTEMPTS)"
    sleep 10
done

echo "❌ Application failed health check"
exit 1

藍綠部署的實務考量

1. 資料庫 Schema 變更

藍綠部署最大的挑戰是資料庫共享。Schema 變更必須向後相容:

-- ❌ 錯誤做法:直接刪除欄位(會導致 Blue 環境報錯)
ALTER TABLE users DROP COLUMN legacy_field;

-- ✅ 正確做法:分階段遷移
-- Phase 1: 新增欄位(Blue/Green 都能運作)
ALTER TABLE users ADD COLUMN new_field VARCHAR(255);

-- Phase 2: 資料遷移
UPDATE users SET new_field = CONCAT(first_name, ' ', last_name);

-- Phase 3: 應用程式切換到 Green(使用 new_field)

-- Phase 4: 下次部署才刪除舊欄位
ALTER TABLE users DROP COLUMN legacy_field;

2. Session 處理

# 使用 Redis 集中式 Session 儲存
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  SESSION_STORE: "redis"
  REDIS_HOST: "redis-cluster.default.svc.cluster.local"
  REDIS_PORT: "6379"

3. 快取失效策略

# 藍綠切換時清除快取
import redis

def invalidate_cache_on_deployment():
    r = redis.Redis(host='redis-host', port=6379)
    # 使用版本號作為 cache key prefix
    r.delete('cache:v1.0:*')
    print("Cache invalidated for old version")

4. 監控指標

# 切換後必須監控的指標
# - HTTP 錯誤率 (4xx, 5xx)
# - 回應時間 (p50, p95, p99)
# - 資料庫連線數
# - CPU/Memory 使用率

# Prometheus 查詢範例
rate(http_requests_total{status=~"5.."}[5m])  # 5xx 錯誤率
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # p95 延遲

金絲雀部署(Canary Deployment)

原理與流量控制

金絲雀部署透過逐步增加流量比例來降低風險:

                Load Balancer (流量分配)
                        │
        ┌───────────────┼───────────────┐
        │               │               │
    Stable (95%)    Canary (5%)     Database
    v1.0            v1.1            (Shared)

    ↓ 觀察指標正常後逐步增加 Canary 流量

    Stable (50%)    Canary (50%)

    ↓ 最終完全切換

    Stable (0%)     Canary (100%)

Kubernetes + Istio 實作範例

1. Istio VirtualService 配置流量分配:

# canary-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp-vs
spec:
  hosts:
  - myapp.example.com
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: myapp-service
        subset: canary
  - route:
    - destination:
        host: myapp-service
        subset: stable
      weight: 95
    - destination:
        host: myapp-service
        subset: canary
      weight: 5  # 初始 5% 流量到 Canary
# canary-destinationrule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp-dr
spec:
  host: myapp-service
  subsets:
  - name: stable
    labels:
      version: stable
  - name: canary
    labels:
      version: canary

2. 部署流程與流量調整:

# 1. 部署 Canary 版本(5% 流量)
kubectl apply -f canary-deployment.yaml
kubectl apply -f canary-virtualservice.yaml

# 2. 監控 Canary 指標(30 分鐘)
istioctl dashboard prometheus
# 觀察:錯誤率、延遲、資源使用

# 3. 逐步增加流量(如果指標正常)
# 5% → 10% → 25% → 50% → 100%
kubectl patch virtualservice myapp-vs --type json 
  -p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":90},
       {"op":"replace","path":"/spec/http/1/route/1/weight","value":10}]'

# 每次調整後觀察 15-30 分鐘

# 4. 完全切換到 Canary
kubectl patch virtualservice myapp-vs --type json 
  -p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":0},
       {"op":"replace","path":"/spec/http/1/route/1/weight","value":100}]'

# 5. 清理舊版本
kubectl delete deployment myapp-stable

# 如需回滾:立即將流量切回 stable
kubectl patch virtualservice myapp-vs --type json 
  -p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":100},
       {"op":"replace","path":"/spec/http/1/route/1/weight","value":0}]'

Nginx 實作範例(加權輪詢)

upstream backend {
    # Stable 版本(95% 流量,19 個實例)
    server 10.0.1.10:8080 weight=19;

    # Canary 版本(5% 流量,1 個實例)
    server 10.0.2.10:8080 weight=1;
}

server {
    listen 80;
    server_name myapp.example.com;

    location / {
        proxy_pass http://backend;

        # 保留 Session Affinity(如需要)
        # ip_hash;  # 或使用 sticky session
    }
}

AWS CodeDeploy 金絲雀部署

部署配置檔:

{
  "deploymentConfigName": "CodeDeployDefault.Canary10Percent5Minutes",
  "computePlatform": "Server",
  "trafficRoutingConfig": {
    "type": "TimeBasedCanary",
    "timeBasedCanary": {
      "canaryPercentage": 10,
      "canaryInterval": 5
    }
  }
}

自訂部署配置:

# 建立自訂 Canary 配置:每 10 分鐘增加 20% 流量
aws deploy create-deployment-config 
  --deployment-config-name Custom.Canary20Percent10Minutes 
  --compute-platform Server 
  --traffic-routing-config '
{
  "type": "TimeBasedCanary",
  "timeBasedCanary": {
    "canaryPercentage": 20,
    "canaryInterval": 10
  }
}'

自動回滾設定

Kubernetes + Prometheus + Flagger 自動化:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99  # 成功率低於 99% 自動回滾
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500  # p99 延遲超過 500ms 自動回滾
      interval: 1m
    webhooks:
    - name: load-test
      url: http://loadtester.default/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary:8080/"

監控與告警腳本:

#!/bin/bash
# canary-monitor.sh

CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total{version="canary",status=~"5.."}[5m])' | jq -r '.data.result[0].value[1]')

THRESHOLD=0.01  # 1% 錯誤率

if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
    echo "❌ Canary error rate too high: $CANARY_ERROR_RATE"
    echo "Rolling back to stable..."

    # 自動回滾
    kubectl patch virtualservice myapp-vs --type json 
      -p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":100},
           {"op":"replace","path":"/spec/http/1/route/1/weight","value":0}]'

    # 發送告警
    curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL 
      -d "{"text":"⚠️ Canary deployment auto-rollback triggered"}"

    exit 1
fi

echo "✅ Canary metrics are healthy: error_rate=$CANARY_ERROR_RATE"

藍綠 vs 金絲雀:如何選擇?

考量因素 藍綠部署 金絲雀部署
風險控制 一次性切換,風險較高 逐步放量,風險最低
基礎設施成本 需雙倍資源(100% + 100%) 資源較低(100% + 5-20%)
回滾速度 瞬間(切換 LB) 需逐步減少流量
測試複雜度 可完整測試 Green 環境 Canary 流量少,難測試邊緣案例
監控要求 切換後集中監控 需持續監控 Canary 指標
適用場景 重大版本升級、資料庫遷移 日常功能迭代、A/B 測試

建議策略:

日常部署:Canary(5% → 10% → 25% → 50% → 100%)
大版本升級:Blue-Green(完整測試後一次切換)
緊急修復:Blue-Green(快速切換,快速回滾)
實驗性功能:Canary + Feature Flag(特定用戶先測試)

實務檢查清單

部署前: – [ ] 資料庫 Schema 變更向後相容 – [ ] Session 使用集中式儲存(Redis/Memcached) – [ ] 靜態資源版本化(避免快取問題) – [ ] 健康檢查端點正常運作 – [ ] 監控與告警已設定

部署中: – [ ] 新版本通過冒煙測試 – [ ] 監控儀表板已開啟 – [ ] 團隊成員待命(可立即回滾) – [ ] 記錄部署時間與流量分配

部署後: – [ ] 觀察錯誤率、延遲、資源使用 – [ ] 檢查日誌有無異常 – [ ] 驗證關鍵業務流程 – [ ] 清理舊版本資源

總結

藍綠部署與金絲雀部署是零停機部署的兩大支柱:

藍綠部署: – 適合需要快速切換與回滾的場景 – 透過 Kubernetes Service、Nginx、AWS CodeDeploy 實現 – 關鍵在於資料庫相容性與健康檢查

金絲雀部署: – 適合逐步驗證新版本的場景 – 透過 Istio、Nginx 加權、Flagger 自動化實現 – 關鍵在於監控指標與自動回滾機制

兩者可結合使用:先用 Canary 驗證小流量,確認無誤後用 Blue-Green 快速完成剩餘切換。

下一步: – 建立 CI/CD Pipeline 整合部署策略 – 設定 Prometheus + Grafana 監控 – 實作自動化回滾機制 – 定期演練部署與回滾流程

相關文章

Leave a Comment