Quartz 集群模式完整指南:ECS 多容器環境下的任務調度防重複機制

🌏 Read the English version


Quartz 集群模式完整指南:ECS 多容器環境下的任務調度防重複機制

在現代微服務架構中,應用程式通常會在多個容器實例間進行負載均衡與自動擴展。當使用 Quartz Scheduler 進行任務調度時,一個關鍵問題隨之而來:如何確保同一個定時任務不會被多個容器重複執行?

本文將深入探討 Quartz 集群模式的運作原理,並提供在 AWS ECS 環境下的完整配置指南,確保任務調度的可靠性與一致性。

問題場景:多容器環境的挑戰

假設您的應用程式部署在 AWS ECS 上,使用 Auto Scaling 動態調整容器數量:

  • 正常時段:3 個容器實例運行
  • 高峰時段:自動擴展至 10 個容器實例
  • 定時任務:每天凌晨 2:00 執行資料同步任務

問題:如果沒有適當配置,凌晨 2:00 時所有 10 個容器都會同時執行資料同步任務,導致:

  • ❌ 資料重複處理
  • ❌ 資料庫鎖衝突
  • ❌ 系統資源浪費
  • ❌ 資料不一致性

解決方案:Quartz 集群模式透過資料庫鎖機制,確保同一時間只有一個容器執行特定任務。

Quartz 集群模式核心機制

1. 資料庫鎖機制(QRTZ_LOCKS 表)

Quartz 使用資料庫悲觀鎖來協調多個容器實例,防止任務重複執行。

QRTZ_LOCKS 資料表結構

CREATE TABLE QRTZ_LOCKS (
  SCHED_NAME VARCHAR(120) NOT NULL,
  LOCK_NAME VARCHAR(40) NOT NULL,
  PRIMARY KEY (SCHED_NAME, LOCK_NAME)
);

-- 系統預設的鎖類型
-- TRIGGER_ACCESS: 控制 Trigger 的讀取和更新
-- STATE_ACCESS: 控制 Scheduler 狀態的更新
-- JOB_ACCESS: 控制 Job 的讀取和更新

鎖的獲取流程

-- Container A 嘗試取得鎖(PostgreSQL 範例)
BEGIN TRANSACTION;

-- 使用 SELECT FOR UPDATE 取得排他鎖
SELECT * FROM QRTZ_LOCKS 
WHERE SCHED_NAME = 'MyScheduler' AND LOCK_NAME = 'TRIGGER_ACCESS'
FOR UPDATE NOWAIT;

-- 如果成功,更新 Trigger 狀態為「執行中」
UPDATE QRTZ_TRIGGERS 
SET TRIGGER_STATE = 'EXECUTING', 
    FIRED_TIME = CURRENT_TIMESTAMP
WHERE TRIGGER_NAME = 'dataSync' 
  AND TRIGGER_GROUP = 'DEFAULT'
  AND TRIGGER_STATE = 'WAITING';

-- 記錄執行資訊
INSERT INTO QRTZ_FIRED_TRIGGERS (
  SCHED_NAME, ENTRY_ID, TRIGGER_NAME, TRIGGER_GROUP,
  INSTANCE_NAME, FIRED_TIME, STATE
) VALUES (
  'MyScheduler', 'unique_entry_id', 'dataSync', 'DEFAULT',
  'ecs-task-12345-abc', CURRENT_TIMESTAMP, 'EXECUTING'
);

COMMIT;

其他容器的處理

-- Container B 同時嘗試取得鎖
BEGIN TRANSACTION;

SELECT * FROM QRTZ_LOCKS 
WHERE SCHED_NAME = 'MyScheduler' AND LOCK_NAME = 'TRIGGER_ACCESS'
FOR UPDATE NOWAIT;
-- ❌ 失敗!資料庫回傳錯誤:lock not available

-- Container B 會等待或跳過,依據配置而定
ROLLBACK;

2. 心跳檢測與故障轉移

每個容器實例都會定期更新 QRTZ_SCHEDULER_STATE 表,宣告自己的存活狀態。

QRTZ_SCHEDULER_STATE 資料表

CREATE TABLE QRTZ_SCHEDULER_STATE (
  SCHED_NAME VARCHAR(120) NOT NULL,
  INSTANCE_NAME VARCHAR(200) NOT NULL,
  LAST_CHECKIN_TIME BIGINT NOT NULL,
  CHECKIN_INTERVAL BIGINT NOT NULL,
  PRIMARY KEY (SCHED_NAME, INSTANCE_NAME)
);

心跳範例

-- 查詢所有活躍的容器實例
SELECT 
  instance_name,
  TO_TIMESTAMP(last_checkin_time / 1000) AS last_checkin,
  checkin_interval,
  EXTRACT(EPOCH FROM (NOW() - TO_TIMESTAMP(last_checkin_time / 1000))) * 1000 AS ms_since_checkin
FROM QRTZ_SCHEDULER_STATE
WHERE sched_name = 'MyECSScheduler'
ORDER BY last_checkin_time DESC;

-- 實際輸出範例
/*
instance_name              | last_checkin            | checkin_interval | ms_since_checkin
---------------------------|-------------------------|------------------|------------------
ecs-task-12345-abc-001     | 2025-10-20 10:30:45     | 20000            | 5234
ecs-task-67890-def-002     | 2025-10-20 10:30:47     | 20000            | 3125
ecs-task-11111-ghi-003     | 2025-10-20 10:30:43     | 20000            | 7456
ecs-task-22222-jkl-004     | 2025-10-20 10:29:10     | 20000            | 100234  ⚠️ 疑似失效
*/

故障轉移機制

Quartz 會自動偵測失效的容器實例:

  • 檢測閾值last_checkin_time 超過 checkin_interval × 倍數(預設 7 倍)
  • 自動接管:健康的容器實例會接管失效實例的任務
  • 恢復執行:確保任務不會因為單一容器當機而停止
-- 標記失效實例的任務為「可執行」狀態
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'WAITING'
WHERE TRIGGER_STATE = 'ACQUIRED'
  AND EXISTS (
    SELECT 1 FROM QRTZ_FIRED_TRIGGERS ft
    WHERE ft.trigger_name = QRTZ_TRIGGERS.trigger_name
      AND ft.instance_name = 'ecs-task-22222-jkl-004'  -- 失效實例
  );

ECS 環境完整配置

1. Spring Boot 應用程式配置

application.properties

# ========================================
# 資料源配置(使用 RDS PostgreSQL)
# ========================================
spring.datasource.url=jdbc:postgresql://${RDS_ENDPOINT}:5432/quartz_db
spring.datasource.username=quartz_user
spring.datasource.password=${DB_PASSWORD}
spring.datasource.driver-class-name=org.postgresql.Driver

# ========================================
# HikariCP 連線池配置
# ========================================
spring.datasource.hikari.maximum-pool-size=10
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.connection-timeout=30000
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.pool-name=QuartzHikariPool

# ========================================
# Quartz 核心配置
# ========================================
spring.quartz.job-store-type=jdbc
spring.quartz.jdbc.initialize-schema=never  # 使用 Flyway/Liquibase 管理

# ========================================
# 集群模式配置(關鍵!)
# ========================================
# 啟用集群模式
spring.quartz.properties.org.quartz.jobStore.isClustered=true

# 心跳間隔(毫秒),預設 15000
spring.quartz.properties.org.quartz.jobStore.clusterCheckinInterval=20000

# 自動生成唯一的實例 ID
spring.quartz.properties.org.quartz.scheduler.instanceId=AUTO

# 所有容器必須使用相同的 scheduler name
spring.quartz.properties.org.quartz.scheduler.instanceName=MyECSScheduler

# ========================================
# JobStore 配置
# ========================================
spring.quartz.properties.org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
spring.quartz.properties.org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
spring.quartz.properties.org.quartz.jobStore.tablePrefix=QRTZ_
spring.quartz.properties.org.quartz.jobStore.useProperties=false

# 優化:批次獲取 Trigger
spring.quartz.properties.org.quartz.scheduler.batchTriggerAcquisitionMaxCount=10
spring.quartz.properties.org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=5000

# ========================================
# 線程池配置
# ========================================
spring.quartz.properties.org.quartz.threadPool.class=org.quartz.simpl.SimpleThreadPool
spring.quartz.properties.org.quartz.threadPool.threadCount=5
spring.quartz.properties.org.quartz.threadPool.threadPriority=5
spring.quartz.properties.org.quartz.threadPool.threadsInheritContextClassLoaderOfInitializingThread=true

Java 配置類(可選)

@Configuration
public class QuartzClusterConfig {
    
    @Value("${spring.datasource.url}")
    private String datasourceUrl;
    
    @Bean
    public SchedulerFactoryBean schedulerFactoryBean(DataSource dataSource) {
        SchedulerFactoryBean factory = new SchedulerFactoryBean();
        factory.setDataSource(dataSource);
        factory.setOverwriteExistingJobs(true);
        factory.setAutoStartup(true);
        factory.setWaitForJobsToCompleteOnShutdown(true);
        factory.setQuartzProperties(quartzProperties());
        
        return factory;
    }
    
    private Properties quartzProperties() {
        Properties props = new Properties();
        
        // 集群配置
        props.put("org.quartz.scheduler.instanceId", "AUTO");
        props.put("org.quartz.scheduler.instanceName", "MyECSScheduler");
        props.put("org.quartz.jobStore.isClustered", "true");
        props.put("org.quartz.jobStore.clusterCheckinInterval", "20000");
        
        // JobStore 配置
        props.put("org.quartz.jobStore.class", "org.quartz.impl.jdbcjobstore.JobStoreTX");
        props.put("org.quartz.jobStore.driverDelegateClass", 
                  "org.quartz.impl.jdbcjobstore.PostgreSQLDelegate");
        props.put("org.quartz.jobStore.tablePrefix", "QRTZ_");
        
        return props;
    }
}

2. ECS Task Definition 配置

{
  "family": "quartz-scheduler-cluster",
  "taskRoleArn": "arn:aws:iam::123456789012:role/ecsTaskRole",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "quartz-app",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/quartz-app:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "SPRING_PROFILES_ACTIVE",
          "value": "production"
        },
        {
          "name": "RDS_ENDPOINT",
          "value": "quartz-db.cluster-abc123.us-east-1.rds.amazonaws.com"
        },
        {
          "name": "JAVA_OPTS",
          "value": "-Xms512m -Xmx768m -Duser.timezone=Asia/Taipei"
        }
      ],
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:quartz-db-password-abc123"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/quartz-scheduler",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "quartz"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/actuator/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

3. ECS Service 配置(Auto Scaling)

{
  "serviceName": "quartz-scheduler-service",
  "cluster": "production-cluster",
  "taskDefinition": "quartz-scheduler-cluster:5",
  "desiredCount": 3,
  "launchType": "FARGATE",
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": [
        "subnet-abc123",
        "subnet-def456"
      ],
      "securityGroups": ["sg-quartz-app"],
      "assignPublicIp": "DISABLED"
    }
  },
  "loadBalancers": [
    {
      "targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/quartz-tg/abc123",
      "containerName": "quartz-app",
      "containerPort": 8080
    }
  ],
  "deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
  }
}

任務執行流程詳解

時間軸流程圖

時間:2025-10-20 02:00:00(任務觸發時間)

容器狀態:Container A, B, C 同時運行

─────────────────────────────────────────────────────────

T0 (02:00:00.000)
  Trigger 到達觸發時間 (next_fire_time = 02:00:00)
  QRTZ_TRIGGERS.TRIGGER_STATE = 'WAITING'
  
  ↓

T1 (02:00:00.050)
  所有容器同時掃描資料庫檢查待執行任務
  
  Container A: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
  Container B: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
  Container C: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
  
  結果:三個容器都發現 'dataSync' 任務需要執行
  
  ↓

T2 (02:00:00.100)
  三個容器同時嘗試取得 TRIGGER_ACCESS 鎖
  
  Container A: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
              ✅ 成功取得鎖
              
  Container B: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
              ❌ 失敗(ERROR: could not obtain lock)
              
  Container C: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
              ❌ 失敗(ERROR: could not obtain lock)
  
  ↓

T3 (02:00:00.150)
  Container A 更新任務狀態並開始執行
  
  UPDATE QRTZ_TRIGGERS 
  SET TRIGGER_STATE='EXECUTING', FIRED_TIME=1729382400150
  WHERE TRIGGER_NAME='dataSync'
  
  INSERT INTO QRTZ_FIRED_TRIGGERS (
    instance_name='ecs-task-12345-abc',
    fired_time=1729382400150,
    state='EXECUTING'
  )
  
  COMMIT; -- 釋放鎖
  
  ↓

T4 (02:00:00.200)
  Container B, C 檢查任務狀態
  
  SELECT TRIGGER_STATE FROM QRTZ_TRIGGERS WHERE TRIGGER_NAME='dataSync'
  結果:TRIGGER_STATE = 'EXECUTING'
  
  ⏭️ 跳過執行(任務已被 Container A 執行)
  
  ↓

T5 (02:00:05.000) - 假設任務執行需要 5 秒
  Container A 完成任務執行
  
  UPDATE QRTZ_TRIGGERS 
  SET TRIGGER_STATE='WAITING', 
      PREV_FIRE_TIME=1729382400000,
      NEXT_FIRE_TIME=1729468800000  -- 明天 02:00
  WHERE TRIGGER_NAME='dataSync'
  
  DELETE FROM QRTZ_FIRED_TRIGGERS 
  WHERE TRIGGER_NAME='dataSync' AND instance_name='ecs-task-12345-abc'
  
  ↓

結果:✅ 任務只被執行一次
     ✅ Container A 負責執行
     ✅ Container B, C 自動跳過

實際執行日誌範例

// Container A (ecs-task-12345-abc)
2025-10-20 02:00:00.050 INFO  [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.100 INFO  [JobRunShell] Acquiring trigger: DEFAULT.dataSync
2025-10-20 02:00:00.120 INFO  [JobRunShell] Acquired trigger: DEFAULT.dataSync
2025-10-20 02:00:00.150 INFO  [JobRunShell] Executing job: DataSyncJob
2025-10-20 02:00:05.000 INFO  [JobRunShell] Job executed successfully

// Container B (ecs-task-67890-def)
2025-10-20 02:00:00.050 INFO  [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.100 WARN  [JobRunShell] Unable to acquire trigger: DEFAULT.dataSync
2025-10-20 02:00:00.105 INFO  [QuartzScheduler] Trigger already acquired by another instance

// Container C (ecs-task-11111-ghi)
2025-10-20 02:00:00.050 INFO  [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.102 WARN  [JobRunShell] Unable to acquire trigger: DEFAULT.dataSync
2025-10-20 02:00:00.107 INFO  [QuartzScheduler] Trigger already acquired by another instance

監控與驗證

1. 資料庫層級監控

檢查集群狀態

-- 查看所有活躍的容器實例
SELECT 
  instance_name,
  TO_TIMESTAMP(last_checkin_time / 1000) AT TIME ZONE 'Asia/Taipei' AS last_checkin,
  checkin_interval / 1000 AS checkin_interval_sec,
  CASE 
    WHEN (EXTRACT(EPOCH FROM NOW()) * 1000 - last_checkin_time) > (checkin_interval * 3)
    THEN '⚠️ 可能失效'
    ELSE '✅ 正常'
  END AS status
FROM QRTZ_SCHEDULER_STATE
WHERE sched_name = 'MyECSScheduler'
ORDER BY last_checkin_time DESC;

追蹤任務執行歷史

-- 查看最近 10 次任務執行記錄
SELECT 
  ft.trigger_name,
  ft.job_name,
  ft.instance_name,
  TO_TIMESTAMP(ft.fired_time / 1000) AT TIME ZONE 'Asia/Taipei' AS executed_at,
  ft.state
FROM QRTZ_FIRED_TRIGGERS ft
WHERE ft.sched_name = 'MyECSScheduler'
ORDER BY ft.fired_time DESC
LIMIT 10;

檢測任務重複執行

-- 查找同一時間被多個實例執行的任務(異常情況)
SELECT 
  trigger_name,
  TO_TIMESTAMP(fired_time / 1000) AT TIME ZONE 'Asia/Taipei' AS fired_at,
  COUNT(DISTINCT instance_name) AS execution_count,
  STRING_AGG(instance_name, ', ') AS instances
FROM QRTZ_FIRED_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
  AND fired_time > (EXTRACT(EPOCH FROM NOW() - INTERVAL '1 hour') * 1000)
GROUP BY trigger_name, fired_time
HAVING COUNT(DISTINCT instance_name) > 1;

2. 應用程式層級監控

Spring Boot Actuator 整合

@Component
public class QuartzClusterHealthIndicator implements HealthIndicator {
    
    @Autowired
    private Scheduler scheduler;
    
    @Autowired
    private JdbcTemplate jdbcTemplate;
    
    @Override
    public Health health() {
        try {
            String instanceId = scheduler.getSchedulerInstanceId();
            boolean isClustered = scheduler.getMetaData().isJobStoreClustered();
            
            // 檢查集群中的活躍實例數
            Integer activeInstances = jdbcTemplate.queryForObject(
                "SELECT COUNT(*) FROM QRTZ_SCHEDULER_STATE " +
                "WHERE sched_name = ? AND (? - last_checkin_time) < ?",
                Integer.class,
                "MyECSScheduler",
                System.currentTimeMillis(),
                60000  // 60 秒內有心跳視為活躍
            );
            
            return Health.up()
                .withDetail("instanceId", instanceId)
                .withDetail("clustered", isClustered)
                .withDetail("activeInstances", activeInstances)
                .withDetail("currentlyExecutingJobs", 
                           scheduler.getCurrentlyExecutingJobs().size())
                .build();
                
        } catch (Exception e) {
            return Health.down()
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

自定義監控 Metrics

@Component
public class QuartzMetricsCollector {
    
    @Autowired
    private Scheduler scheduler;
    
    @Autowired
    private MeterRegistry meterRegistry;
    
    @Scheduled(fixedRate = 30000) // 每 30 秒
    public void collectMetrics() throws SchedulerException {
        String instanceId = scheduler.getSchedulerInstanceId();
        
        // 記錄當前執行中的任務數
        int executingJobs = scheduler.getCurrentlyExecutingJobs().size();
        meterRegistry.gauge("quartz.jobs.executing", 
                           Tags.of("instance", instanceId), 
                           executingJobs);
        
        // 記錄已執行任務總數
        SchedulerMetaData metaData = scheduler.getMetaData();
        meterRegistry.counter("quartz.jobs.executed.total",
                             Tags.of("instance", instanceId))
                     .increment(metaData.getNumberOfJobsExecuted());
        
        log.info("Quartz Metrics - Instance: {}, Executing: {}, Total Executed: {}",
                 instanceId, executingJobs, metaData.getNumberOfJobsExecuted());
    }
}

3. CloudWatch 整合

@Component
public class QuartzCloudWatchMetrics {
    
    @Autowired
    private AmazonCloudWatch cloudWatch;
    
    @Autowired
    private Scheduler scheduler;
    
    @Scheduled(fixedRate = 60000) // 每分鐘
    public void publishToCloudWatch() throws SchedulerException {
        String instanceId = scheduler.getSchedulerInstanceId();
        
        // 發送自定義 Metric 到 CloudWatch
        PutMetricDataRequest request = new PutMetricDataRequest()
            .withNamespace("QuartzScheduler/Cluster")
            .withMetricData(
                new MetricDatum()
                    .withMetricName("ExecutingJobs")
                    .withValue((double) scheduler.getCurrentlyExecutingJobs().size())
                    .withUnit(StandardUnit.Count)
                    .withTimestamp(new Date())
                    .withDimensions(
                        new Dimension()
                            .withName("InstanceId")
                            .withValue(instanceId)
                    )
            );
        
        cloudWatch.putMetricData(request);
    }
}

性能優化策略

1. 批次獲取 Trigger

減少資料庫查詢次數,一次獲取多個待觸發的任務:

# 一次最多獲取 10 個 Trigger
org.quartz.scheduler.batchTriggerAcquisitionMaxCount=10

# 提前 5 秒獲取即將觸發的任務
org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=5000

2. 連線池優化

# HikariCP 連線池配置
# 根據容器數量和資料庫最大連線數調整
# 公式:maximum-pool-size ≤ (DB max_connections / container count)

# 假設 RDS 最大連線數 100,容器數量 10
spring.datasource.hikari.maximum-pool-size=8
spring.datasource.hikari.minimum-idle=4
spring.datasource.hikari.connection-timeout=20000
spring.datasource.hikari.validation-timeout=5000
spring.datasource.hikari.leak-detection-threshold=60000

3. 資料庫索引優化

-- 確保關鍵索引存在(官方 SQL 腳本應已包含)
CREATE INDEX IF NOT EXISTS idx_qrtz_t_nft_st 
ON QRTZ_TRIGGERS(SCHED_NAME, TRIGGER_STATE, NEXT_FIRE_TIME);

CREATE INDEX IF NOT EXISTS idx_qrtz_t_nft_st_misfire 
ON QRTZ_TRIGGERS(SCHED_NAME, MISFIRE_INSTR, NEXT_FIRE_TIME, TRIGGER_STATE);

CREATE INDEX IF NOT EXISTS idx_qrtz_ft_trig_inst_name 
ON QRTZ_FIRED_TRIGGERS(SCHED_NAME, INSTANCE_NAME);

-- 定期分析資料表以更新統計資訊(PostgreSQL)
ANALYZE QRTZ_TRIGGERS;
ANALYZE QRTZ_FIRED_TRIGGERS;
ANALYZE QRTZ_SCHEDULER_STATE;

4. 清理歷史資料

-- 定期清理已觸發記錄(透過 Cron Job)
DELETE FROM QRTZ_FIRED_TRIGGERS 
WHERE sched_name = 'MyECSScheduler'
  AND fired_time < (EXTRACT(EPOCH FROM NOW() - INTERVAL '7 days') * 1000);

-- 清理孤立的鎖記錄(異常情況)
DELETE FROM QRTZ_LOCKS
WHERE sched_name NOT IN (
  SELECT DISTINCT sched_name FROM QRTZ_SCHEDULER_STATE
  WHERE (EXTRACT(EPOCH FROM NOW()) * 1000 - last_checkin_time) < 300000
);

常見問題排除

問題 1:任務重複執行

症狀:同一個任務在同一時間被多個容器執行

可能原因:

  • 容器的系統時間不同步
  • isClustered 設定為 false
  • 不同容器使用了不同的 instanceName

解決方案:

# 1. 檢查容器時間同步
aws ecs execute-command 
  --cluster production-cluster 
  --task task-id-12345 
  --container quartz-app 
  --interactive 
  --command "date && ntpq -p"

# 2. 驗證配置
aws ecs execute-command 
  --cluster production-cluster 
  --task task-id-12345 
  --container quartz-app 
  --interactive 
  --command "cat /app/application.properties | grep isClustered"
-- 3. 檢查資料庫中的實例名稱是否一致
SELECT DISTINCT sched_name 
FROM QRTZ_SCHEDULER_STATE;
-- 應該只有一個結果:MyECSScheduler

問題 2:任務未執行

症狀:預定時間到了,但任務沒有執行

可能原因:

  • 所有容器都無法取得資料庫鎖
  • 資料庫連線池耗盡
  • Trigger 狀態錯誤(卡在 BLOCKED 或 ERROR)

診斷步驟:

-- 1. 檢查 Trigger 狀態
SELECT 
  trigger_name,
  trigger_state,
  TO_TIMESTAMP(next_fire_time / 1000) AT TIME ZONE 'Asia/Taipei' AS next_fire,
  TO_TIMESTAMP(prev_fire_time / 1000) AT TIME ZONE 'Asia/Taipei' AS prev_fire
FROM QRTZ_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
  AND trigger_name = 'dataSync';

-- 2. 檢查是否有卡住的鎖
SELECT * FROM QRTZ_LOCKS 
WHERE sched_name = 'MyECSScheduler';

-- 3. 檢查資料庫連線數
SELECT 
  count(*) AS active_connections,
  max_conn.setting AS max_connections
FROM pg_stat_activity,
     (SELECT setting FROM pg_settings WHERE name = 'max_connections') max_conn
WHERE datname = 'quartz_db'
GROUP BY max_conn.setting;

解決方案:

-- 如果 Trigger 卡在 BLOCKED 狀態,手動重置
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'WAITING'
WHERE sched_name = 'MyECSScheduler'
  AND trigger_state = 'BLOCKED';

問題 3:容器故障後任務不恢復

症狀:某個容器當機後,其負責的任務沒有被其他容器接管

可能原因:

  • clusterCheckinInterval 設定過長
  • 失效偵測倍數過高

解決方案:

# 縮短心跳間隔(預設 15000 毫秒)
org.quartz.jobStore.clusterCheckinInterval=10000

# Quartz 內部預設失效偵測為 7 倍 checkin interval
# 10000 × 7 = 70 秒後會偵測到失效
-- 手動標記失效實例(緊急情況)
DELETE FROM QRTZ_SCHEDULER_STATE
WHERE instance_name = 'ecs-task-failed-instance'
  AND sched_name = 'MyECSScheduler';

最佳實踐總結

1. 配置檢查清單

配置項 必須值 說明
isClustered true 啟用集群模式
instanceId AUTO 自動生成唯一 ID
instanceName 所有容器相同 例如 MyECSScheduler
job-store-type jdbc 使用資料庫持久化
dataSource 共用 RDS 所有容器連接同一資料庫

2. Job 類設計原則

// ✅ 正確:無狀態、冪等、異常處理完善
@DisallowConcurrentExecution  // 防止同一 Job 並發執行
@PersistJobDataAfterExecution  // 執行後持久化 JobDataMap
public class DataSyncJob implements Job {
    
    private static final Logger log = LoggerFactory.getLogger(DataSyncJob.class);
    
    @Override
    public void execute(JobExecutionContext context) throws JobExecutionException {
        String instanceId = context.getScheduler().getSchedulerInstanceId();
        log.info("[{}] Starting DataSyncJob", instanceId);
        
        try {
            // 冪等性檢查
            if (isAlreadyProcessed(context)) {
                log.warn("[{}] Job already processed, skipping", instanceId);
                return;
            }
            
            // 執行業務邏輯
            performDataSync();
            
            // 標記為已處理
            markAsProcessed(context);
            
            log.info("[{}] DataSyncJob completed successfully", instanceId);
            
        } catch (Exception e) {
            log.error("[{}] DataSyncJob failed", instanceId, e);
            
            // refireImmediately = false,不立即重試
            // 避免在集群中造成連鎖失敗
            throw new JobExecutionException(e, false);
        }
    }
    
    private boolean isAlreadyProcessed(JobExecutionContext context) {
        // 實作冪等性檢查邏輯
        return false;
    }
    
    private void performDataSync() {
        // 業務邏輯
    }
    
    private void markAsProcessed(JobExecutionContext context) {
        // 標記處理狀態
    }
}

3. 維護自動化

-- 建立定期清理任務(PostgreSQL cron extension)
SELECT cron.schedule(
  'quartz-cleanup',
  '0 3 * * *',  -- 每天凌晨 3:00
  $$
  DELETE FROM QRTZ_FIRED_TRIGGERS 
  WHERE fired_time < (EXTRACT(EPOCH FROM NOW() - INTERVAL '7 days') * 1000);
  $$
);

4. 安全性建議

  • 資料庫用戶權限最小化:僅授予 Quartz 表的 SELECT、INSERT、UPDATE、DELETE 權限
  • 使用 Secrets Manager:資料庫密碼透過 AWS Secrets Manager 管理
  • 啟用 SSL 連線:RDS 連線使用 SSL/TLS 加密
  • 網路隔離:Quartz 容器與 RDS 在私有子網路中通訊
# 啟用 SSL 連線
spring.datasource.url=jdbc:postgresql://quartz-db.abc.rds.amazonaws.com:5432/quartz?ssl=true&sslmode=require

結論

Quartz 集群模式透過資料庫鎖機制心跳檢測,完美解決了多容器環境下的任務調度挑戰。在 AWS ECS 環境中,只需確保以下關鍵配置:

  • ✅ 啟用集群模式(isClustered=true
  • ✅ 自動生成實例 ID(instanceId=AUTO
  • ✅ 所有容器使用相同的 Scheduler 名稱
  • ✅ 共用同一個 RDS 資料庫
  • ✅ 適當的連線池與性能優化

透過本文介紹的配置方法、監控策略和最佳實踐,您可以建立一個高可用、可擴展、防重複執行的企業級任務調度系統,無論容器數量如何變化(Auto Scaling),都能確保任務調度的可靠性與一致性。

關鍵優勢:

  • 🔒 防止任務重複執行
  • 🔄 自動故障轉移
  • 📈 支援水平擴展
  • 🛡️ 生產級可靠性

建議在開發環境先使用 2-3 個容器進行測試驗證,確認集群模式運作正常後,再部署至生產環境並啟用 Auto Scaling。

相關文章

Leave a Comment