Quartz 集群模式完整指南:ECS 多容器環境下的任務調度防重複機制
在現代微服務架構中,應用程式通常會在多個容器實例間進行負載均衡與自動擴展。當使用 Quartz Scheduler 進行任務調度時,一個關鍵問題隨之而來:如何確保同一個定時任務不會被多個容器重複執行?
本文將深入探討 Quartz 集群模式的運作原理,並提供在 AWS ECS 環境下的完整配置指南,確保任務調度的可靠性與一致性。
問題場景:多容器環境的挑戰
假設您的應用程式部署在 AWS ECS 上,使用 Auto Scaling 動態調整容器數量:
- 正常時段:3 個容器實例運行
- 高峰時段:自動擴展至 10 個容器實例
- 定時任務:每天凌晨 2:00 執行資料同步任務
問題:如果沒有適當配置,凌晨 2:00 時所有 10 個容器都會同時執行資料同步任務,導致:
- ❌ 資料重複處理
- ❌ 資料庫鎖衝突
- ❌ 系統資源浪費
- ❌ 資料不一致性
解決方案:Quartz 集群模式透過資料庫鎖機制,確保同一時間只有一個容器執行特定任務。
Quartz 集群模式核心機制
1. 資料庫鎖機制(QRTZ_LOCKS 表)
Quartz 使用資料庫悲觀鎖來協調多個容器實例,防止任務重複執行。
QRTZ_LOCKS 資料表結構
CREATE TABLE QRTZ_LOCKS (
SCHED_NAME VARCHAR(120) NOT NULL,
LOCK_NAME VARCHAR(40) NOT NULL,
PRIMARY KEY (SCHED_NAME, LOCK_NAME)
);
-- 系統預設的鎖類型
-- TRIGGER_ACCESS: 控制 Trigger 的讀取和更新
-- STATE_ACCESS: 控制 Scheduler 狀態的更新
-- JOB_ACCESS: 控制 Job 的讀取和更新
鎖的獲取流程
-- Container A 嘗試取得鎖(PostgreSQL 範例)
BEGIN TRANSACTION;
-- 使用 SELECT FOR UPDATE 取得排他鎖
SELECT * FROM QRTZ_LOCKS
WHERE SCHED_NAME = 'MyScheduler' AND LOCK_NAME = 'TRIGGER_ACCESS'
FOR UPDATE NOWAIT;
-- 如果成功,更新 Trigger 狀態為「執行中」
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'EXECUTING',
FIRED_TIME = CURRENT_TIMESTAMP
WHERE TRIGGER_NAME = 'dataSync'
AND TRIGGER_GROUP = 'DEFAULT'
AND TRIGGER_STATE = 'WAITING';
-- 記錄執行資訊
INSERT INTO QRTZ_FIRED_TRIGGERS (
SCHED_NAME, ENTRY_ID, TRIGGER_NAME, TRIGGER_GROUP,
INSTANCE_NAME, FIRED_TIME, STATE
) VALUES (
'MyScheduler', 'unique_entry_id', 'dataSync', 'DEFAULT',
'ecs-task-12345-abc', CURRENT_TIMESTAMP, 'EXECUTING'
);
COMMIT;
其他容器的處理
-- Container B 同時嘗試取得鎖
BEGIN TRANSACTION;
SELECT * FROM QRTZ_LOCKS
WHERE SCHED_NAME = 'MyScheduler' AND LOCK_NAME = 'TRIGGER_ACCESS'
FOR UPDATE NOWAIT;
-- ❌ 失敗!資料庫回傳錯誤:lock not available
-- Container B 會等待或跳過,依據配置而定
ROLLBACK;
2. 心跳檢測與故障轉移
每個容器實例都會定期更新 QRTZ_SCHEDULER_STATE 表,宣告自己的存活狀態。
QRTZ_SCHEDULER_STATE 資料表
CREATE TABLE QRTZ_SCHEDULER_STATE (
SCHED_NAME VARCHAR(120) NOT NULL,
INSTANCE_NAME VARCHAR(200) NOT NULL,
LAST_CHECKIN_TIME BIGINT NOT NULL,
CHECKIN_INTERVAL BIGINT NOT NULL,
PRIMARY KEY (SCHED_NAME, INSTANCE_NAME)
);
心跳範例
-- 查詢所有活躍的容器實例
SELECT
instance_name,
TO_TIMESTAMP(last_checkin_time / 1000) AS last_checkin,
checkin_interval,
EXTRACT(EPOCH FROM (NOW() - TO_TIMESTAMP(last_checkin_time / 1000))) * 1000 AS ms_since_checkin
FROM QRTZ_SCHEDULER_STATE
WHERE sched_name = 'MyECSScheduler'
ORDER BY last_checkin_time DESC;
-- 實際輸出範例
/*
instance_name | last_checkin | checkin_interval | ms_since_checkin
---------------------------|-------------------------|------------------|------------------
ecs-task-12345-abc-001 | 2025-10-20 10:30:45 | 20000 | 5234
ecs-task-67890-def-002 | 2025-10-20 10:30:47 | 20000 | 3125
ecs-task-11111-ghi-003 | 2025-10-20 10:30:43 | 20000 | 7456
ecs-task-22222-jkl-004 | 2025-10-20 10:29:10 | 20000 | 100234 ⚠️ 疑似失效
*/
故障轉移機制
Quartz 會自動偵測失效的容器實例:
- 檢測閾值:
last_checkin_time超過checkin_interval × 倍數(預設 7 倍) - 自動接管:健康的容器實例會接管失效實例的任務
- 恢復執行:確保任務不會因為單一容器當機而停止
-- 標記失效實例的任務為「可執行」狀態
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'WAITING'
WHERE TRIGGER_STATE = 'ACQUIRED'
AND EXISTS (
SELECT 1 FROM QRTZ_FIRED_TRIGGERS ft
WHERE ft.trigger_name = QRTZ_TRIGGERS.trigger_name
AND ft.instance_name = 'ecs-task-22222-jkl-004' -- 失效實例
);
ECS 環境完整配置
1. Spring Boot 應用程式配置
application.properties
# ========================================
# 資料源配置(使用 RDS PostgreSQL)
# ========================================
spring.datasource.url=jdbc:postgresql://${RDS_ENDPOINT}:5432/quartz_db
spring.datasource.username=quartz_user
spring.datasource.password=${DB_PASSWORD}
spring.datasource.driver-class-name=org.postgresql.Driver
# ========================================
# HikariCP 連線池配置
# ========================================
spring.datasource.hikari.maximum-pool-size=10
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.connection-timeout=30000
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.pool-name=QuartzHikariPool
# ========================================
# Quartz 核心配置
# ========================================
spring.quartz.job-store-type=jdbc
spring.quartz.jdbc.initialize-schema=never # 使用 Flyway/Liquibase 管理
# ========================================
# 集群模式配置(關鍵!)
# ========================================
# 啟用集群模式
spring.quartz.properties.org.quartz.jobStore.isClustered=true
# 心跳間隔(毫秒),預設 15000
spring.quartz.properties.org.quartz.jobStore.clusterCheckinInterval=20000
# 自動生成唯一的實例 ID
spring.quartz.properties.org.quartz.scheduler.instanceId=AUTO
# 所有容器必須使用相同的 scheduler name
spring.quartz.properties.org.quartz.scheduler.instanceName=MyECSScheduler
# ========================================
# JobStore 配置
# ========================================
spring.quartz.properties.org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
spring.quartz.properties.org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
spring.quartz.properties.org.quartz.jobStore.tablePrefix=QRTZ_
spring.quartz.properties.org.quartz.jobStore.useProperties=false
# 優化:批次獲取 Trigger
spring.quartz.properties.org.quartz.scheduler.batchTriggerAcquisitionMaxCount=10
spring.quartz.properties.org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=5000
# ========================================
# 線程池配置
# ========================================
spring.quartz.properties.org.quartz.threadPool.class=org.quartz.simpl.SimpleThreadPool
spring.quartz.properties.org.quartz.threadPool.threadCount=5
spring.quartz.properties.org.quartz.threadPool.threadPriority=5
spring.quartz.properties.org.quartz.threadPool.threadsInheritContextClassLoaderOfInitializingThread=true
Java 配置類(可選)
@Configuration
public class QuartzClusterConfig {
@Value("${spring.datasource.url}")
private String datasourceUrl;
@Bean
public SchedulerFactoryBean schedulerFactoryBean(DataSource dataSource) {
SchedulerFactoryBean factory = new SchedulerFactoryBean();
factory.setDataSource(dataSource);
factory.setOverwriteExistingJobs(true);
factory.setAutoStartup(true);
factory.setWaitForJobsToCompleteOnShutdown(true);
factory.setQuartzProperties(quartzProperties());
return factory;
}
private Properties quartzProperties() {
Properties props = new Properties();
// 集群配置
props.put("org.quartz.scheduler.instanceId", "AUTO");
props.put("org.quartz.scheduler.instanceName", "MyECSScheduler");
props.put("org.quartz.jobStore.isClustered", "true");
props.put("org.quartz.jobStore.clusterCheckinInterval", "20000");
// JobStore 配置
props.put("org.quartz.jobStore.class", "org.quartz.impl.jdbcjobstore.JobStoreTX");
props.put("org.quartz.jobStore.driverDelegateClass",
"org.quartz.impl.jdbcjobstore.PostgreSQLDelegate");
props.put("org.quartz.jobStore.tablePrefix", "QRTZ_");
return props;
}
}
2. ECS Task Definition 配置
{
"family": "quartz-scheduler-cluster",
"taskRoleArn": "arn:aws:iam::123456789012:role/ecsTaskRole",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"containerDefinitions": [
{
"name": "quartz-app",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/quartz-app:latest",
"essential": true,
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"environment": [
{
"name": "SPRING_PROFILES_ACTIVE",
"value": "production"
},
{
"name": "RDS_ENDPOINT",
"value": "quartz-db.cluster-abc123.us-east-1.rds.amazonaws.com"
},
{
"name": "JAVA_OPTS",
"value": "-Xms512m -Xmx768m -Duser.timezone=Asia/Taipei"
}
],
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:quartz-db-password-abc123"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/quartz-scheduler",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "quartz"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/actuator/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
3. ECS Service 配置(Auto Scaling)
{
"serviceName": "quartz-scheduler-service",
"cluster": "production-cluster",
"taskDefinition": "quartz-scheduler-cluster:5",
"desiredCount": 3,
"launchType": "FARGATE",
"networkConfiguration": {
"awsvpcConfiguration": {
"subnets": [
"subnet-abc123",
"subnet-def456"
],
"securityGroups": ["sg-quartz-app"],
"assignPublicIp": "DISABLED"
}
},
"loadBalancers": [
{
"targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/quartz-tg/abc123",
"containerName": "quartz-app",
"containerPort": 8080
}
],
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100
}
}
任務執行流程詳解
時間軸流程圖
時間:2025-10-20 02:00:00(任務觸發時間)
容器狀態:Container A, B, C 同時運行
─────────────────────────────────────────────────────────
T0 (02:00:00.000)
Trigger 到達觸發時間 (next_fire_time = 02:00:00)
QRTZ_TRIGGERS.TRIGGER_STATE = 'WAITING'
↓
T1 (02:00:00.050)
所有容器同時掃描資料庫檢查待執行任務
Container A: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
Container B: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
Container C: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
結果:三個容器都發現 'dataSync' 任務需要執行
↓
T2 (02:00:00.100)
三個容器同時嘗試取得 TRIGGER_ACCESS 鎖
Container A: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
✅ 成功取得鎖
Container B: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
❌ 失敗(ERROR: could not obtain lock)
Container C: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
❌ 失敗(ERROR: could not obtain lock)
↓
T3 (02:00:00.150)
Container A 更新任務狀態並開始執行
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE='EXECUTING', FIRED_TIME=1729382400150
WHERE TRIGGER_NAME='dataSync'
INSERT INTO QRTZ_FIRED_TRIGGERS (
instance_name='ecs-task-12345-abc',
fired_time=1729382400150,
state='EXECUTING'
)
COMMIT; -- 釋放鎖
↓
T4 (02:00:00.200)
Container B, C 檢查任務狀態
SELECT TRIGGER_STATE FROM QRTZ_TRIGGERS WHERE TRIGGER_NAME='dataSync'
結果:TRIGGER_STATE = 'EXECUTING'
⏭️ 跳過執行(任務已被 Container A 執行)
↓
T5 (02:00:05.000) - 假設任務執行需要 5 秒
Container A 完成任務執行
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE='WAITING',
PREV_FIRE_TIME=1729382400000,
NEXT_FIRE_TIME=1729468800000 -- 明天 02:00
WHERE TRIGGER_NAME='dataSync'
DELETE FROM QRTZ_FIRED_TRIGGERS
WHERE TRIGGER_NAME='dataSync' AND instance_name='ecs-task-12345-abc'
↓
結果:✅ 任務只被執行一次
✅ Container A 負責執行
✅ Container B, C 自動跳過
實際執行日誌範例
// Container A (ecs-task-12345-abc)
2025-10-20 02:00:00.050 INFO [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.100 INFO [JobRunShell] Acquiring trigger: DEFAULT.dataSync
2025-10-20 02:00:00.120 INFO [JobRunShell] Acquired trigger: DEFAULT.dataSync
2025-10-20 02:00:00.150 INFO [JobRunShell] Executing job: DataSyncJob
2025-10-20 02:00:05.000 INFO [JobRunShell] Job executed successfully
// Container B (ecs-task-67890-def)
2025-10-20 02:00:00.050 INFO [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.100 WARN [JobRunShell] Unable to acquire trigger: DEFAULT.dataSync
2025-10-20 02:00:00.105 INFO [QuartzScheduler] Trigger already acquired by another instance
// Container C (ecs-task-11111-ghi)
2025-10-20 02:00:00.050 INFO [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.102 WARN [JobRunShell] Unable to acquire trigger: DEFAULT.dataSync
2025-10-20 02:00:00.107 INFO [QuartzScheduler] Trigger already acquired by another instance
監控與驗證
1. 資料庫層級監控
檢查集群狀態
-- 查看所有活躍的容器實例
SELECT
instance_name,
TO_TIMESTAMP(last_checkin_time / 1000) AT TIME ZONE 'Asia/Taipei' AS last_checkin,
checkin_interval / 1000 AS checkin_interval_sec,
CASE
WHEN (EXTRACT(EPOCH FROM NOW()) * 1000 - last_checkin_time) > (checkin_interval * 3)
THEN '⚠️ 可能失效'
ELSE '✅ 正常'
END AS status
FROM QRTZ_SCHEDULER_STATE
WHERE sched_name = 'MyECSScheduler'
ORDER BY last_checkin_time DESC;
追蹤任務執行歷史
-- 查看最近 10 次任務執行記錄
SELECT
ft.trigger_name,
ft.job_name,
ft.instance_name,
TO_TIMESTAMP(ft.fired_time / 1000) AT TIME ZONE 'Asia/Taipei' AS executed_at,
ft.state
FROM QRTZ_FIRED_TRIGGERS ft
WHERE ft.sched_name = 'MyECSScheduler'
ORDER BY ft.fired_time DESC
LIMIT 10;
檢測任務重複執行
-- 查找同一時間被多個實例執行的任務(異常情況)
SELECT
trigger_name,
TO_TIMESTAMP(fired_time / 1000) AT TIME ZONE 'Asia/Taipei' AS fired_at,
COUNT(DISTINCT instance_name) AS execution_count,
STRING_AGG(instance_name, ', ') AS instances
FROM QRTZ_FIRED_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
AND fired_time > (EXTRACT(EPOCH FROM NOW() - INTERVAL '1 hour') * 1000)
GROUP BY trigger_name, fired_time
HAVING COUNT(DISTINCT instance_name) > 1;
2. 應用程式層級監控
Spring Boot Actuator 整合
@Component
public class QuartzClusterHealthIndicator implements HealthIndicator {
@Autowired
private Scheduler scheduler;
@Autowired
private JdbcTemplate jdbcTemplate;
@Override
public Health health() {
try {
String instanceId = scheduler.getSchedulerInstanceId();
boolean isClustered = scheduler.getMetaData().isJobStoreClustered();
// 檢查集群中的活躍實例數
Integer activeInstances = jdbcTemplate.queryForObject(
"SELECT COUNT(*) FROM QRTZ_SCHEDULER_STATE " +
"WHERE sched_name = ? AND (? - last_checkin_time) < ?",
Integer.class,
"MyECSScheduler",
System.currentTimeMillis(),
60000 // 60 秒內有心跳視為活躍
);
return Health.up()
.withDetail("instanceId", instanceId)
.withDetail("clustered", isClustered)
.withDetail("activeInstances", activeInstances)
.withDetail("currentlyExecutingJobs",
scheduler.getCurrentlyExecutingJobs().size())
.build();
} catch (Exception e) {
return Health.down()
.withDetail("error", e.getMessage())
.build();
}
}
}
自定義監控 Metrics
@Component
public class QuartzMetricsCollector {
@Autowired
private Scheduler scheduler;
@Autowired
private MeterRegistry meterRegistry;
@Scheduled(fixedRate = 30000) // 每 30 秒
public void collectMetrics() throws SchedulerException {
String instanceId = scheduler.getSchedulerInstanceId();
// 記錄當前執行中的任務數
int executingJobs = scheduler.getCurrentlyExecutingJobs().size();
meterRegistry.gauge("quartz.jobs.executing",
Tags.of("instance", instanceId),
executingJobs);
// 記錄已執行任務總數
SchedulerMetaData metaData = scheduler.getMetaData();
meterRegistry.counter("quartz.jobs.executed.total",
Tags.of("instance", instanceId))
.increment(metaData.getNumberOfJobsExecuted());
log.info("Quartz Metrics - Instance: {}, Executing: {}, Total Executed: {}",
instanceId, executingJobs, metaData.getNumberOfJobsExecuted());
}
}
3. CloudWatch 整合
@Component
public class QuartzCloudWatchMetrics {
@Autowired
private AmazonCloudWatch cloudWatch;
@Autowired
private Scheduler scheduler;
@Scheduled(fixedRate = 60000) // 每分鐘
public void publishToCloudWatch() throws SchedulerException {
String instanceId = scheduler.getSchedulerInstanceId();
// 發送自定義 Metric 到 CloudWatch
PutMetricDataRequest request = new PutMetricDataRequest()
.withNamespace("QuartzScheduler/Cluster")
.withMetricData(
new MetricDatum()
.withMetricName("ExecutingJobs")
.withValue((double) scheduler.getCurrentlyExecutingJobs().size())
.withUnit(StandardUnit.Count)
.withTimestamp(new Date())
.withDimensions(
new Dimension()
.withName("InstanceId")
.withValue(instanceId)
)
);
cloudWatch.putMetricData(request);
}
}
性能優化策略
1. 批次獲取 Trigger
減少資料庫查詢次數,一次獲取多個待觸發的任務:
# 一次最多獲取 10 個 Trigger
org.quartz.scheduler.batchTriggerAcquisitionMaxCount=10
# 提前 5 秒獲取即將觸發的任務
org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=5000
2. 連線池優化
# HikariCP 連線池配置
# 根據容器數量和資料庫最大連線數調整
# 公式:maximum-pool-size ≤ (DB max_connections / container count)
# 假設 RDS 最大連線數 100,容器數量 10
spring.datasource.hikari.maximum-pool-size=8
spring.datasource.hikari.minimum-idle=4
spring.datasource.hikari.connection-timeout=20000
spring.datasource.hikari.validation-timeout=5000
spring.datasource.hikari.leak-detection-threshold=60000
3. 資料庫索引優化
-- 確保關鍵索引存在(官方 SQL 腳本應已包含)
CREATE INDEX IF NOT EXISTS idx_qrtz_t_nft_st
ON QRTZ_TRIGGERS(SCHED_NAME, TRIGGER_STATE, NEXT_FIRE_TIME);
CREATE INDEX IF NOT EXISTS idx_qrtz_t_nft_st_misfire
ON QRTZ_TRIGGERS(SCHED_NAME, MISFIRE_INSTR, NEXT_FIRE_TIME, TRIGGER_STATE);
CREATE INDEX IF NOT EXISTS idx_qrtz_ft_trig_inst_name
ON QRTZ_FIRED_TRIGGERS(SCHED_NAME, INSTANCE_NAME);
-- 定期分析資料表以更新統計資訊(PostgreSQL)
ANALYZE QRTZ_TRIGGERS;
ANALYZE QRTZ_FIRED_TRIGGERS;
ANALYZE QRTZ_SCHEDULER_STATE;
4. 清理歷史資料
-- 定期清理已觸發記錄(透過 Cron Job)
DELETE FROM QRTZ_FIRED_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
AND fired_time < (EXTRACT(EPOCH FROM NOW() - INTERVAL '7 days') * 1000);
-- 清理孤立的鎖記錄(異常情況)
DELETE FROM QRTZ_LOCKS
WHERE sched_name NOT IN (
SELECT DISTINCT sched_name FROM QRTZ_SCHEDULER_STATE
WHERE (EXTRACT(EPOCH FROM NOW()) * 1000 - last_checkin_time) < 300000
);
常見問題排除
問題 1:任務重複執行
症狀:同一個任務在同一時間被多個容器執行
可能原因:
- 容器的系統時間不同步
isClustered設定為false- 不同容器使用了不同的
instanceName
解決方案:
# 1. 檢查容器時間同步
aws ecs execute-command
--cluster production-cluster
--task task-id-12345
--container quartz-app
--interactive
--command "date && ntpq -p"
# 2. 驗證配置
aws ecs execute-command
--cluster production-cluster
--task task-id-12345
--container quartz-app
--interactive
--command "cat /app/application.properties | grep isClustered"
-- 3. 檢查資料庫中的實例名稱是否一致
SELECT DISTINCT sched_name
FROM QRTZ_SCHEDULER_STATE;
-- 應該只有一個結果:MyECSScheduler
問題 2:任務未執行
症狀:預定時間到了,但任務沒有執行
可能原因:
- 所有容器都無法取得資料庫鎖
- 資料庫連線池耗盡
- Trigger 狀態錯誤(卡在 BLOCKED 或 ERROR)
診斷步驟:
-- 1. 檢查 Trigger 狀態
SELECT
trigger_name,
trigger_state,
TO_TIMESTAMP(next_fire_time / 1000) AT TIME ZONE 'Asia/Taipei' AS next_fire,
TO_TIMESTAMP(prev_fire_time / 1000) AT TIME ZONE 'Asia/Taipei' AS prev_fire
FROM QRTZ_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
AND trigger_name = 'dataSync';
-- 2. 檢查是否有卡住的鎖
SELECT * FROM QRTZ_LOCKS
WHERE sched_name = 'MyECSScheduler';
-- 3. 檢查資料庫連線數
SELECT
count(*) AS active_connections,
max_conn.setting AS max_connections
FROM pg_stat_activity,
(SELECT setting FROM pg_settings WHERE name = 'max_connections') max_conn
WHERE datname = 'quartz_db'
GROUP BY max_conn.setting;
解決方案:
-- 如果 Trigger 卡在 BLOCKED 狀態,手動重置
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'WAITING'
WHERE sched_name = 'MyECSScheduler'
AND trigger_state = 'BLOCKED';
問題 3:容器故障後任務不恢復
症狀:某個容器當機後,其負責的任務沒有被其他容器接管
可能原因:
clusterCheckinInterval設定過長- 失效偵測倍數過高
解決方案:
# 縮短心跳間隔(預設 15000 毫秒)
org.quartz.jobStore.clusterCheckinInterval=10000
# Quartz 內部預設失效偵測為 7 倍 checkin interval
# 10000 × 7 = 70 秒後會偵測到失效
-- 手動標記失效實例(緊急情況)
DELETE FROM QRTZ_SCHEDULER_STATE
WHERE instance_name = 'ecs-task-failed-instance'
AND sched_name = 'MyECSScheduler';
最佳實踐總結
1. 配置檢查清單
| 配置項 | 必須值 | 說明 |
|---|---|---|
isClustered |
true |
啟用集群模式 |
instanceId |
AUTO |
自動生成唯一 ID |
instanceName |
所有容器相同 | 例如 MyECSScheduler |
job-store-type |
jdbc |
使用資料庫持久化 |
dataSource |
共用 RDS | 所有容器連接同一資料庫 |
2. Job 類設計原則
// ✅ 正確:無狀態、冪等、異常處理完善
@DisallowConcurrentExecution // 防止同一 Job 並發執行
@PersistJobDataAfterExecution // 執行後持久化 JobDataMap
public class DataSyncJob implements Job {
private static final Logger log = LoggerFactory.getLogger(DataSyncJob.class);
@Override
public void execute(JobExecutionContext context) throws JobExecutionException {
String instanceId = context.getScheduler().getSchedulerInstanceId();
log.info("[{}] Starting DataSyncJob", instanceId);
try {
// 冪等性檢查
if (isAlreadyProcessed(context)) {
log.warn("[{}] Job already processed, skipping", instanceId);
return;
}
// 執行業務邏輯
performDataSync();
// 標記為已處理
markAsProcessed(context);
log.info("[{}] DataSyncJob completed successfully", instanceId);
} catch (Exception e) {
log.error("[{}] DataSyncJob failed", instanceId, e);
// refireImmediately = false,不立即重試
// 避免在集群中造成連鎖失敗
throw new JobExecutionException(e, false);
}
}
private boolean isAlreadyProcessed(JobExecutionContext context) {
// 實作冪等性檢查邏輯
return false;
}
private void performDataSync() {
// 業務邏輯
}
private void markAsProcessed(JobExecutionContext context) {
// 標記處理狀態
}
}
3. 維護自動化
-- 建立定期清理任務(PostgreSQL cron extension)
SELECT cron.schedule(
'quartz-cleanup',
'0 3 * * *', -- 每天凌晨 3:00
$$
DELETE FROM QRTZ_FIRED_TRIGGERS
WHERE fired_time < (EXTRACT(EPOCH FROM NOW() - INTERVAL '7 days') * 1000);
$$
);
4. 安全性建議
- 資料庫用戶權限最小化:僅授予 Quartz 表的 SELECT、INSERT、UPDATE、DELETE 權限
- 使用 Secrets Manager:資料庫密碼透過 AWS Secrets Manager 管理
- 啟用 SSL 連線:RDS 連線使用 SSL/TLS 加密
- 網路隔離:Quartz 容器與 RDS 在私有子網路中通訊
# 啟用 SSL 連線
spring.datasource.url=jdbc:postgresql://quartz-db.abc.rds.amazonaws.com:5432/quartz?ssl=true&sslmode=require
結論
Quartz 集群模式透過資料庫鎖機制與心跳檢測,完美解決了多容器環境下的任務調度挑戰。在 AWS ECS 環境中,只需確保以下關鍵配置:
- ✅ 啟用集群模式(
isClustered=true) - ✅ 自動生成實例 ID(
instanceId=AUTO) - ✅ 所有容器使用相同的 Scheduler 名稱
- ✅ 共用同一個 RDS 資料庫
- ✅ 適當的連線池與性能優化
透過本文介紹的配置方法、監控策略和最佳實踐,您可以建立一個高可用、可擴展、防重複執行的企業級任務調度系統,無論容器數量如何變化(Auto Scaling),都能確保任務調度的可靠性與一致性。
關鍵優勢:
- 🔒 防止任務重複執行
- 🔄 自動故障轉移
- 📈 支援水平擴展
- 🛡️ 生產級可靠性
建議在開發環境先使用 2-3 個容器進行測試驗證,確認集群模式運作正常後,再部署至生產環境並啟用 Auto Scaling。