在現代化的軟體開發中,部署策略直接影響系統可用性與用戶體驗。傳統的停機部署(maintenance window)已無法滿足 24/7 服務的需求。藍綠部署(Blue-Green Deployment)與金絲雀部署(Canary Deployment)提供了零停機部署的解決方案,但如何實際操作?本文將深入技術細節,提供完整的實作指引。
為什麼需要零停機部署?
業務需求: – 全球化服務無法找到「離峰時段」 – SLA 承諾 99.9% 可用性(每月僅 43 分鐘停機預算) – 競爭對手不會等你維護完成
技術挑戰: – 如何在不中斷服務的情況下更新程式碼? – 如何處理資料庫 schema 變更? – 出問題時如何快速回滾?
藍綠部署(Blue-Green Deployment)
原理與架構
藍綠部署透過維護兩個完全相同的生產環境來實現零停機:
Load Balancer
│
┌───────────────┼───────────────┐
│ │ │
Blue (v1.0) Green (v1.1) Database
Currently New Version (Shared)
Active Standby
核心概念: 1. Blue 環境:當前正在服務的版本 2. Green 環境:新版本部署在此,完成測試後待命 3. 切換機制:Load Balancer 瞬間將流量從 Blue 切到 Green 4. 回滾能力:問題發生時立即切回 Blue
Kubernetes 實作範例
1. 使用 Service Label Selector 切換
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: myapp
image: myapp:v1.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: myapp
image: myapp:v1.1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
# service.yaml (切換的關鍵)
apiVersion: v1
kind: Service
metadata:
name: myapp-service
spec:
selector:
app: myapp
version: blue # 切換到 green 即完成部署
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: LoadBalancer
部署與切換流程:
# 1. 部署 Green 環境(新版本)
kubectl apply -f green-deployment.yaml
# 2. 等待 Green 環境就緒
kubectl rollout status deployment/myapp-green
# 3. 驗證 Green 環境健康狀態
kubectl get pods -l version=green
kubectl logs -l version=green --tail=50
# 4. 執行冒煙測試(Smoke Test)
kubectl port-forward deployment/myapp-green 8080:8080
curl http://localhost:8080/health
curl http://localhost:8080/api/test
# 5. 切換流量到 Green(關鍵步驟)
kubectl patch service myapp-service -p '{"spec":{"selector":{"version":"green"}}}'
# 6. 監控切換後的指標(5-10 分鐘)
kubectl top pods -l version=green
# 觀察錯誤率、延遲、CPU/Memory
# 7. 確認無誤後,刪除 Blue 環境
kubectl delete deployment myapp-blue
# 如需回滾:立即執行
kubectl patch service myapp-service -p '{"spec":{"selector":{"version":"blue"}}}'
Nginx 實作範例
使用 Nginx 作為 Load Balancer 進行藍綠切換:
# /etc/nginx/conf.d/myapp.conf
upstream backend {
# 藍綠切換:註解/取消註解即可切換
# Blue 環境(當前生產)
server 10.0.1.10:8080;
server 10.0.1.11:8080;
server 10.0.1.12:8080;
# Green 環境(新版本,待切換)
# server 10.0.2.10:8080;
# server 10.0.2.11:8080;
# server 10.0.2.12:8080;
}
server {
listen 80;
server_name myapp.example.com;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# 健康檢查
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
}
location /health {
access_log off;
proxy_pass http://backend;
}
}
切換腳本:
#!/bin/bash
# blue-green-switch.sh
set -e
NGINX_CONF="/etc/nginx/conf.d/myapp.conf"
BACKUP_CONF="/tmp/myapp.conf.backup"
# 備份當前設定
cp $NGINX_CONF $BACKUP_CONF
# 檢查 Green 環境健康狀態
for server in 10.0.2.10 10.0.2.11 10.0.2.12; do
if ! curl -f -s http://$server:8080/health > /dev/null; then
echo "❌ Green server $server is not healthy"
exit 1
fi
done
echo "✅ All Green servers are healthy"
# 切換到 Green(註解 Blue,啟用 Green)
sed -i '/10.0.1/s/^/# /' $NGINX_CONF
sed -i '/10.0.2/s/^# //' $NGINX_CONF
# 測試 Nginx 設定
if nginx -t; then
echo "✅ Nginx configuration is valid"
# 重新載入 Nginx(無停機)
nginx -s reload
echo "✅ Switched to Green environment"
else
echo "❌ Nginx configuration test failed, rolling back"
cp $BACKUP_CONF $NGINX_CONF
exit 1
fi
# 監控 5 分鐘
echo "Monitoring for 5 minutes..."
sleep 300
# 如果一切正常,可以手動清理 Blue 環境
AWS CodeDeploy 實作範例
AppSpec 檔案(藍綠部署):
# appspec.yml
version: 0.0
os: linux
files:
- source: /
destination: /var/www/myapp
hooks:
BeforeInstall:
- location: scripts/before_install.sh
timeout: 300
runas: root
AfterInstall:
- location: scripts/after_install.sh
timeout: 300
runas: root
ApplicationStart:
- location: scripts/start_application.sh
timeout: 300
runas: root
ValidateService:
- location: scripts/validate_service.sh
timeout: 300
runas: root
驗證腳本:
#!/bin/bash
# scripts/validate_service.sh
# 等待應用程式啟動
sleep 10
# 健康檢查
HEALTH_CHECK_URL="http://localhost:8080/health"
MAX_ATTEMPTS=30
for i in $(seq 1 $MAX_ATTEMPTS); do
if curl -f -s $HEALTH_CHECK_URL | grep -q "healthy"; then
echo "✅ Application is healthy"
exit 0
fi
echo "⏳ Waiting for application to be healthy (attempt $i/$MAX_ATTEMPTS)"
sleep 10
done
echo "❌ Application failed health check"
exit 1
藍綠部署的實務考量
1. 資料庫 Schema 變更
藍綠部署最大的挑戰是資料庫共享。Schema 變更必須向後相容:
-- ❌ 錯誤做法:直接刪除欄位(會導致 Blue 環境報錯)
ALTER TABLE users DROP COLUMN legacy_field;
-- ✅ 正確做法:分階段遷移
-- Phase 1: 新增欄位(Blue/Green 都能運作)
ALTER TABLE users ADD COLUMN new_field VARCHAR(255);
-- Phase 2: 資料遷移
UPDATE users SET new_field = CONCAT(first_name, ' ', last_name);
-- Phase 3: 應用程式切換到 Green(使用 new_field)
-- Phase 4: 下次部署才刪除舊欄位
ALTER TABLE users DROP COLUMN legacy_field;
2. Session 處理
# 使用 Redis 集中式 Session 儲存
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
SESSION_STORE: "redis"
REDIS_HOST: "redis-cluster.default.svc.cluster.local"
REDIS_PORT: "6379"
3. 快取失效策略
# 藍綠切換時清除快取
import redis
def invalidate_cache_on_deployment():
r = redis.Redis(host='redis-host', port=6379)
# 使用版本號作為 cache key prefix
r.delete('cache:v1.0:*')
print("Cache invalidated for old version")
4. 監控指標
# 切換後必須監控的指標
# - HTTP 錯誤率 (4xx, 5xx)
# - 回應時間 (p50, p95, p99)
# - 資料庫連線數
# - CPU/Memory 使用率
# Prometheus 查詢範例
rate(http_requests_total{status=~"5.."}[5m]) # 5xx 錯誤率
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # p95 延遲
金絲雀部署(Canary Deployment)
原理與流量控制
金絲雀部署透過逐步增加流量比例來降低風險:
Load Balancer (流量分配)
│
┌───────────────┼───────────────┐
│ │ │
Stable (95%) Canary (5%) Database
v1.0 v1.1 (Shared)
↓ 觀察指標正常後逐步增加 Canary 流量
Stable (50%) Canary (50%)
↓ 最終完全切換
Stable (0%) Canary (100%)
Kubernetes + Istio 實作範例
1. Istio VirtualService 配置流量分配:
# canary-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp-vs
spec:
hosts:
- myapp.example.com
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: myapp-service
subset: canary
- route:
- destination:
host: myapp-service
subset: stable
weight: 95
- destination:
host: myapp-service
subset: canary
weight: 5 # 初始 5% 流量到 Canary
# canary-destinationrule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp-dr
spec:
host: myapp-service
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
2. 部署流程與流量調整:
# 1. 部署 Canary 版本(5% 流量)
kubectl apply -f canary-deployment.yaml
kubectl apply -f canary-virtualservice.yaml
# 2. 監控 Canary 指標(30 分鐘)
istioctl dashboard prometheus
# 觀察:錯誤率、延遲、資源使用
# 3. 逐步增加流量(如果指標正常)
# 5% → 10% → 25% → 50% → 100%
kubectl patch virtualservice myapp-vs --type json
-p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":90},
{"op":"replace","path":"/spec/http/1/route/1/weight","value":10}]'
# 每次調整後觀察 15-30 分鐘
# 4. 完全切換到 Canary
kubectl patch virtualservice myapp-vs --type json
-p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":0},
{"op":"replace","path":"/spec/http/1/route/1/weight","value":100}]'
# 5. 清理舊版本
kubectl delete deployment myapp-stable
# 如需回滾:立即將流量切回 stable
kubectl patch virtualservice myapp-vs --type json
-p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":100},
{"op":"replace","path":"/spec/http/1/route/1/weight","value":0}]'
Nginx 實作範例(加權輪詢)
upstream backend {
# Stable 版本(95% 流量,19 個實例)
server 10.0.1.10:8080 weight=19;
# Canary 版本(5% 流量,1 個實例)
server 10.0.2.10:8080 weight=1;
}
server {
listen 80;
server_name myapp.example.com;
location / {
proxy_pass http://backend;
# 保留 Session Affinity(如需要)
# ip_hash; # 或使用 sticky session
}
}
AWS CodeDeploy 金絲雀部署
部署配置檔:
{
"deploymentConfigName": "CodeDeployDefault.Canary10Percent5Minutes",
"computePlatform": "Server",
"trafficRoutingConfig": {
"type": "TimeBasedCanary",
"timeBasedCanary": {
"canaryPercentage": 10,
"canaryInterval": 5
}
}
}
自訂部署配置:
# 建立自訂 Canary 配置:每 10 分鐘增加 20% 流量
aws deploy create-deployment-config
--deployment-config-name Custom.Canary20Percent10Minutes
--compute-platform Server
--traffic-routing-config '
{
"type": "TimeBasedCanary",
"timeBasedCanary": {
"canaryPercentage": 20,
"canaryInterval": 10
}
}'
自動回滾設定
Kubernetes + Prometheus + Flagger 自動化:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # 成功率低於 99% 自動回滾
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # p99 延遲超過 500ms 自動回滾
interval: 1m
webhooks:
- name: load-test
url: http://loadtester.default/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary:8080/"
監控與告警腳本:
#!/bin/bash
# canary-monitor.sh
CANARY_ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total{version="canary",status=~"5.."}[5m])' | jq -r '.data.result[0].value[1]')
THRESHOLD=0.01 # 1% 錯誤率
if (( $(echo "$CANARY_ERROR_RATE > $THRESHOLD" | bc -l) )); then
echo "❌ Canary error rate too high: $CANARY_ERROR_RATE"
echo "Rolling back to stable..."
# 自動回滾
kubectl patch virtualservice myapp-vs --type json
-p '[{"op":"replace","path":"/spec/http/1/route/0/weight","value":100},
{"op":"replace","path":"/spec/http/1/route/1/weight","value":0}]'
# 發送告警
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL
-d "{"text":"⚠️ Canary deployment auto-rollback triggered"}"
exit 1
fi
echo "✅ Canary metrics are healthy: error_rate=$CANARY_ERROR_RATE"
藍綠 vs 金絲雀:如何選擇?
| 考量因素 | 藍綠部署 | 金絲雀部署 |
|---|---|---|
| 風險控制 | 一次性切換,風險較高 | 逐步放量,風險最低 |
| 基礎設施成本 | 需雙倍資源(100% + 100%) | 資源較低(100% + 5-20%) |
| 回滾速度 | 瞬間(切換 LB) | 需逐步減少流量 |
| 測試複雜度 | 可完整測試 Green 環境 | Canary 流量少,難測試邊緣案例 |
| 監控要求 | 切換後集中監控 | 需持續監控 Canary 指標 |
| 適用場景 | 重大版本升級、資料庫遷移 | 日常功能迭代、A/B 測試 |
建議策略:
日常部署:Canary(5% → 10% → 25% → 50% → 100%)
大版本升級:Blue-Green(完整測試後一次切換)
緊急修復:Blue-Green(快速切換,快速回滾)
實驗性功能:Canary + Feature Flag(特定用戶先測試)
實務檢查清單
部署前: – [ ] 資料庫 Schema 變更向後相容 – [ ] Session 使用集中式儲存(Redis/Memcached) – [ ] 靜態資源版本化(避免快取問題) – [ ] 健康檢查端點正常運作 – [ ] 監控與告警已設定
部署中: – [ ] 新版本通過冒煙測試 – [ ] 監控儀表板已開啟 – [ ] 團隊成員待命(可立即回滾) – [ ] 記錄部署時間與流量分配
部署後: – [ ] 觀察錯誤率、延遲、資源使用 – [ ] 檢查日誌有無異常 – [ ] 驗證關鍵業務流程 – [ ] 清理舊版本資源
總結
藍綠部署與金絲雀部署是零停機部署的兩大支柱:
藍綠部署: – 適合需要快速切換與回滾的場景 – 透過 Kubernetes Service、Nginx、AWS CodeDeploy 實現 – 關鍵在於資料庫相容性與健康檢查
金絲雀部署: – 適合逐步驗證新版本的場景 – 透過 Istio、Nginx 加權、Flagger 自動化實現 – 關鍵在於監控指標與自動回滾機制
兩者可結合使用:先用 Canary 驗證小流量,確認無誤後用 Blue-Green 快速完成剩餘切換。
下一步: – 建立 CI/CD Pipeline 整合部署策略 – 設定 Prometheus + Grafana 監控 – 實作自動化回滾機制 – 定期演練部署與回滾流程