🌏 閱讀中文版本
Quartz Clustering Complete Guide: Preventing Duplicate Task Execution in ECS Multi-Container Environments
In modern microservice architectures, applications typically run across multiple container instances for load balancing and auto-scaling. When using Quartz Scheduler for task scheduling, a critical question arises: How do we ensure the same scheduled task doesn’t execute multiple times across different containers?
This article provides an in-depth exploration of Quartz clustering mechanisms and offers a complete configuration guide for AWS ECS environments, ensuring reliable and consistent task scheduling.
The Challenge: Multi-Container Environment Scenarios
Imagine your application is deployed on AWS ECS with Auto Scaling dynamically adjusting container counts:
- Normal periods: 3 container instances running
- Peak hours: Auto-scaled to 10 container instances
- Scheduled task: Daily data synchronization at 2:00 AM
The Problem: Without proper configuration, at 2:00 AM all 10 containers would simultaneously execute the data sync task, causing:
- ❌ Duplicate data processing
- ❌ Database lock conflicts
- ❌ System resource waste
- ❌ Data inconsistency
The Solution: Quartz clustering mode uses database locking mechanisms to ensure only one container executes a specific task at any given time.
Quartz Clustering Core Mechanisms
1. Database Locking Mechanism (QRTZ_LOCKS Table)
Quartz uses pessimistic database locking to coordinate multiple container instances, preventing duplicate task execution.
QRTZ_LOCKS Table Structure
CREATE TABLE QRTZ_LOCKS (
SCHED_NAME VARCHAR(120) NOT NULL,
LOCK_NAME VARCHAR(40) NOT NULL,
PRIMARY KEY (SCHED_NAME, LOCK_NAME)
);
-- System default lock types
-- TRIGGER_ACCESS: Controls Trigger read and update operations
-- STATE_ACCESS: Controls Scheduler state updates
-- JOB_ACCESS: Controls Job read and update operations
Lock Acquisition Flow
-- Container A attempts to acquire lock (PostgreSQL example)
BEGIN TRANSACTION;
-- Use SELECT FOR UPDATE to acquire exclusive lock
SELECT * FROM QRTZ_LOCKS
WHERE SCHED_NAME = 'MyScheduler' AND LOCK_NAME = 'TRIGGER_ACCESS'
FOR UPDATE NOWAIT;
-- If successful, update Trigger state to "EXECUTING"
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'EXECUTING',
FIRED_TIME = CURRENT_TIMESTAMP
WHERE TRIGGER_NAME = 'dataSync'
AND TRIGGER_GROUP = 'DEFAULT'
AND TRIGGER_STATE = 'WAITING';
-- Record execution information
INSERT INTO QRTZ_FIRED_TRIGGERS (
SCHED_NAME, ENTRY_ID, TRIGGER_NAME, TRIGGER_GROUP,
INSTANCE_NAME, FIRED_TIME, STATE
) VALUES (
'MyScheduler', 'unique_entry_id', 'dataSync', 'DEFAULT',
'ecs-task-12345-abc', CURRENT_TIMESTAMP, 'EXECUTING'
);
COMMIT;
Other Container Handling
-- Container B simultaneously attempts to acquire lock
BEGIN TRANSACTION;
SELECT * FROM QRTZ_LOCKS
WHERE SCHED_NAME = 'MyScheduler' AND LOCK_NAME = 'TRIGGER_ACCESS'
FOR UPDATE NOWAIT;
-- ❌ Fails! Database returns error: lock not available
-- Container B will wait or skip, depending on configuration
ROLLBACK;
2. Heartbeat Detection and Failover
Each container instance periodically updates the QRTZ_SCHEDULER_STATE table to announce its alive status.
QRTZ_SCHEDULER_STATE Table
CREATE TABLE QRTZ_SCHEDULER_STATE (
SCHED_NAME VARCHAR(120) NOT NULL,
INSTANCE_NAME VARCHAR(200) NOT NULL,
LAST_CHECKIN_TIME BIGINT NOT NULL,
CHECKIN_INTERVAL BIGINT NOT NULL,
PRIMARY KEY (SCHED_NAME, INSTANCE_NAME)
);
Heartbeat Example
-- Query all active container instances
SELECT
instance_name,
TO_TIMESTAMP(last_checkin_time / 1000) AS last_checkin,
checkin_interval,
EXTRACT(EPOCH FROM (NOW() - TO_TIMESTAMP(last_checkin_time / 1000))) * 1000 AS ms_since_checkin
FROM QRTZ_SCHEDULER_STATE
WHERE sched_name = 'MyECSScheduler'
ORDER BY last_checkin_time DESC;
-- Sample output
/*
instance_name | last_checkin | checkin_interval | ms_since_checkin
---------------------------|-------------------------|------------------|------------------
ecs-task-12345-abc-001 | 2025-10-20 10:30:45 | 20000 | 5234
ecs-task-67890-def-002 | 2025-10-20 10:30:47 | 20000 | 3125
ecs-task-11111-ghi-003 | 2025-10-20 10:30:43 | 20000 | 7456
ecs-task-22222-jkl-004 | 2025-10-20 10:29:10 | 20000 | 100234 ⚠️ Likely failed
*/
Failover Mechanism
Quartz automatically detects failed container instances:
- Detection Threshold:
last_checkin_timeexceedscheckin_interval × multiplier(default 7x) - Automatic Takeover: Healthy container instances take over failed instance’s tasks
- Resume Execution: Ensures tasks don’t stop due to single container crashes
-- Mark failed instance's tasks as "executable" state
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'WAITING'
WHERE TRIGGER_STATE = 'ACQUIRED'
AND EXISTS (
SELECT 1 FROM QRTZ_FIRED_TRIGGERS ft
WHERE ft.trigger_name = QRTZ_TRIGGERS.trigger_name
AND ft.instance_name = 'ecs-task-22222-jkl-004' -- Failed instance
);
Complete ECS Environment Configuration
1. Spring Boot Application Configuration
application.properties
# ========================================
# DataSource Configuration (RDS PostgreSQL)
# ========================================
spring.datasource.url=jdbc:postgresql://${RDS_ENDPOINT}:5432/quartz_db
spring.datasource.username=quartz_user
spring.datasource.password=${DB_PASSWORD}
spring.datasource.driver-class-name=org.postgresql.Driver
# ========================================
# HikariCP Connection Pool Configuration
# ========================================
spring.datasource.hikari.maximum-pool-size=10
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.connection-timeout=30000
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.pool-name=QuartzHikariPool
# ========================================
# Quartz Core Configuration
# ========================================
spring.quartz.job-store-type=jdbc
spring.quartz.jdbc.initialize-schema=never # Use Flyway/Liquibase for management
# ========================================
# Clustering Configuration (CRITICAL!)
# ========================================
# Enable clustering mode
spring.quartz.properties.org.quartz.jobStore.isClustered=true
# Heartbeat interval (milliseconds), default 15000
spring.quartz.properties.org.quartz.jobStore.clusterCheckinInterval=20000
# Automatically generate unique instance ID
spring.quartz.properties.org.quartz.scheduler.instanceId=AUTO
# All containers must use the same scheduler name
spring.quartz.properties.org.quartz.scheduler.instanceName=MyECSScheduler
# ========================================
# JobStore Configuration
# ========================================
spring.quartz.properties.org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
spring.quartz.properties.org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
spring.quartz.properties.org.quartz.jobStore.tablePrefix=QRTZ_
spring.quartz.properties.org.quartz.jobStore.useProperties=false
# Optimization: Batch Trigger acquisition
spring.quartz.properties.org.quartz.scheduler.batchTriggerAcquisitionMaxCount=10
spring.quartz.properties.org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=5000
# ========================================
# ThreadPool Configuration
# ========================================
spring.quartz.properties.org.quartz.threadPool.class=org.quartz.simpl.SimpleThreadPool
spring.quartz.properties.org.quartz.threadPool.threadCount=5
spring.quartz.properties.org.quartz.threadPool.threadPriority=5
spring.quartz.properties.org.quartz.threadPool.threadsInheritContextClassLoaderOfInitializingThread=true
Java Configuration Class (Optional)
@Configuration
public class QuartzClusterConfig {
@Value("${spring.datasource.url}")
private String datasourceUrl;
@Bean
public SchedulerFactoryBean schedulerFactoryBean(DataSource dataSource) {
SchedulerFactoryBean factory = new SchedulerFactoryBean();
factory.setDataSource(dataSource);
factory.setOverwriteExistingJobs(true);
factory.setAutoStartup(true);
factory.setWaitForJobsToCompleteOnShutdown(true);
factory.setQuartzProperties(quartzProperties());
return factory;
}
private Properties quartzProperties() {
Properties props = new Properties();
// Clustering configuration
props.put("org.quartz.scheduler.instanceId", "AUTO");
props.put("org.quartz.scheduler.instanceName", "MyECSScheduler");
props.put("org.quartz.jobStore.isClustered", "true");
props.put("org.quartz.jobStore.clusterCheckinInterval", "20000");
// JobStore configuration
props.put("org.quartz.jobStore.class", "org.quartz.impl.jdbcjobstore.JobStoreTX");
props.put("org.quartz.jobStore.driverDelegateClass",
"org.quartz.impl.jdbcjobstore.PostgreSQLDelegate");
props.put("org.quartz.jobStore.tablePrefix", "QRTZ_");
return props;
}
}
2. ECS Task Definition Configuration
{
"family": "quartz-scheduler-cluster",
"taskRoleArn": "arn:aws:iam::123456789012:role/ecsTaskRole",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"containerDefinitions": [
{
"name": "quartz-app",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/quartz-app:latest",
"essential": true,
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"environment": [
{
"name": "SPRING_PROFILES_ACTIVE",
"value": "production"
},
{
"name": "RDS_ENDPOINT",
"value": "quartz-db.cluster-abc123.us-east-1.rds.amazonaws.com"
},
{
"name": "JAVA_OPTS",
"value": "-Xms512m -Xmx768m -Duser.timezone=UTC"
}
],
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:quartz-db-password-abc123"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/quartz-scheduler",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "quartz"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/actuator/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
3. ECS Service Configuration (Auto Scaling)
{
"serviceName": "quartz-scheduler-service",
"cluster": "production-cluster",
"taskDefinition": "quartz-scheduler-cluster:5",
"desiredCount": 3,
"launchType": "FARGATE",
"networkConfiguration": {
"awsvpcConfiguration": {
"subnets": [
"subnet-abc123",
"subnet-def456"
],
"securityGroups": ["sg-quartz-app"],
"assignPublicIp": "DISABLED"
}
},
"loadBalancers": [
{
"targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/quartz-tg/abc123",
"containerName": "quartz-app",
"containerPort": 8080
}
],
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100
}
}
Detailed Task Execution Flow
Timeline Flowchart
Time: 2025-10-20 02:00:00 (Task trigger time)
Container Status: Container A, B, C simultaneously running
─────────────────────────────────────────────────────────
T0 (02:00:00.000)
Trigger reaches trigger time (next_fire_time = 02:00:00)
QRTZ_TRIGGERS.TRIGGER_STATE = 'WAITING'
↓
T1 (02:00:00.050)
All containers simultaneously scan database for pending tasks
Container A: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
Container B: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
Container C: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
Result: All three containers discover 'dataSync' task needs execution
↓
T2 (02:00:00.100)
Three containers simultaneously attempt to acquire TRIGGER_ACCESS lock
Container A: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
✅ Lock acquired successfully
Container B: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
❌ Failed (ERROR: could not obtain lock)
Container C: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
❌ Failed (ERROR: could not obtain lock)
↓
T3 (02:00:00.150)
Container A updates task state and begins execution
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE='EXECUTING', FIRED_TIME=1729382400150
WHERE TRIGGER_NAME='dataSync'
INSERT INTO QRTZ_FIRED_TRIGGERS (
instance_name='ecs-task-12345-abc',
fired_time=1729382400150,
state='EXECUTING'
)
COMMIT; -- Release lock
↓
T4 (02:00:00.200)
Container B, C check task status
SELECT TRIGGER_STATE FROM QRTZ_TRIGGERS WHERE TRIGGER_NAME='dataSync'
Result: TRIGGER_STATE = 'EXECUTING'
⏭️ Skip execution (task already executed by Container A)
↓
T5 (02:00:05.000) - Assuming task execution takes 5 seconds
Container A completes task execution
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE='WAITING',
PREV_FIRE_TIME=1729382400000,
NEXT_FIRE_TIME=1729468800000 -- Tomorrow 02:00
WHERE TRIGGER_NAME='dataSync'
DELETE FROM QRTZ_FIRED_TRIGGERS
WHERE TRIGGER_NAME='dataSync' AND instance_name='ecs-task-12345-abc'
↓
Result: ✅ Task executed only once
✅ Container A handled execution
✅ Container B, C automatically skipped
Actual Execution Log Example
// Container A (ecs-task-12345-abc)
2025-10-20 02:00:00.050 INFO [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.100 INFO [JobRunShell] Acquiring trigger: DEFAULT.dataSync
2025-10-20 02:00:00.120 INFO [JobRunShell] Acquired trigger: DEFAULT.dataSync
2025-10-20 02:00:00.150 INFO [JobRunShell] Executing job: DataSyncJob
2025-10-20 02:00:05.000 INFO [JobRunShell] Job executed successfully
// Container B (ecs-task-67890-def)
2025-10-20 02:00:00.050 INFO [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.100 WARN [JobRunShell] Unable to acquire trigger: DEFAULT.dataSync
2025-10-20 02:00:00.105 INFO [QuartzScheduler] Trigger already acquired by another instance
// Container C (ecs-task-11111-ghi)
2025-10-20 02:00:00.050 INFO [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.102 WARN [JobRunShell] Unable to acquire trigger: DEFAULT.dataSync
2025-10-20 02:00:00.107 INFO [QuartzScheduler] Trigger already acquired by another instance
Monitoring and Verification
1. Database-Level Monitoring
Check Cluster Status
-- View all active container instances
SELECT
instance_name,
TO_TIMESTAMP(last_checkin_time / 1000) AS last_checkin,
checkin_interval / 1000 AS checkin_interval_sec,
CASE
WHEN (EXTRACT(EPOCH FROM NOW()) * 1000 - last_checkin_time) > (checkin_interval * 3)
THEN '⚠️ Possibly Failed'
ELSE '✅ Healthy'
END AS status
FROM QRTZ_SCHEDULER_STATE
WHERE sched_name = 'MyECSScheduler'
ORDER BY last_checkin_time DESC;
Track Task Execution History
-- View most recent 10 task execution records
SELECT
ft.trigger_name,
ft.job_name,
ft.instance_name,
TO_TIMESTAMP(ft.fired_time / 1000) AS executed_at,
ft.state
FROM QRTZ_FIRED_TRIGGERS ft
WHERE ft.sched_name = 'MyECSScheduler'
ORDER BY ft.fired_time DESC
LIMIT 10;
Detect Duplicate Task Execution
-- Find tasks executed by multiple instances at same time (anomaly)
SELECT
trigger_name,
TO_TIMESTAMP(fired_time / 1000) AS fired_at,
COUNT(DISTINCT instance_name) AS execution_count,
STRING_AGG(instance_name, ', ') AS instances
FROM QRTZ_FIRED_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
AND fired_time > (EXTRACT(EPOCH FROM NOW() - INTERVAL '1 hour') * 1000)
GROUP BY trigger_name, fired_time
HAVING COUNT(DISTINCT instance_name) > 1;
2. Application-Level Monitoring
Spring Boot Actuator Integration
@Component
public class QuartzClusterHealthIndicator implements HealthIndicator {
@Autowired
private Scheduler scheduler;
@Autowired
private JdbcTemplate jdbcTemplate;
@Override
public Health health() {
try {
String instanceId = scheduler.getSchedulerInstanceId();
boolean isClustered = scheduler.getMetaData().isJobStoreClustered();
// Check number of active instances in cluster
Integer activeInstances = jdbcTemplate.queryForObject(
"SELECT COUNT(*) FROM QRTZ_SCHEDULER_STATE " +
"WHERE sched_name = ? AND (? - last_checkin_time) < ?",
Integer.class,
"MyECSScheduler",
System.currentTimeMillis(),
60000 // Heartbeat within 60 seconds considered active
);
return Health.up()
.withDetail("instanceId", instanceId)
.withDetail("clustered", isClustered)
.withDetail("activeInstances", activeInstances)
.withDetail("currentlyExecutingJobs",
scheduler.getCurrentlyExecutingJobs().size())
.build();
} catch (Exception e) {
return Health.down()
.withDetail("error", e.getMessage())
.build();
}
}
}
Custom Monitoring Metrics
@Component
public class QuartzMetricsCollector {
@Autowired
private Scheduler scheduler;
@Autowired
private MeterRegistry meterRegistry;
@Scheduled(fixedRate = 30000) // Every 30 seconds
public void collectMetrics() throws SchedulerException {
String instanceId = scheduler.getSchedulerInstanceId();
// Record currently executing job count
int executingJobs = scheduler.getCurrentlyExecutingJobs().size();
meterRegistry.gauge("quartz.jobs.executing",
Tags.of("instance", instanceId),
executingJobs);
// Record total executed jobs count
SchedulerMetaData metaData = scheduler.getMetaData();
meterRegistry.counter("quartz.jobs.executed.total",
Tags.of("instance", instanceId))
.increment(metaData.getNumberOfJobsExecuted());
log.info("Quartz Metrics - Instance: {}, Executing: {}, Total Executed: {}",
instanceId, executingJobs, metaData.getNumberOfJobsExecuted());
}
}
3. CloudWatch Integration
@Component
public class QuartzCloudWatchMetrics {
@Autowired
private AmazonCloudWatch cloudWatch;
@Autowired
private Scheduler scheduler;
@Scheduled(fixedRate = 60000) // Every minute
public void publishToCloudWatch() throws SchedulerException {
String instanceId = scheduler.getSchedulerInstanceId();
// Send custom metric to CloudWatch
PutMetricDataRequest request = new PutMetricDataRequest()
.withNamespace("QuartzScheduler/Cluster")
.withMetricData(
new MetricDatum()
.withMetricName("ExecutingJobs")
.withValue((double) scheduler.getCurrentlyExecutingJobs().size())
.withUnit(StandardUnit.Count)
.withTimestamp(new Date())
.withDimensions(
new Dimension()
.withName("InstanceId")
.withValue(instanceId)
)
);
cloudWatch.putMetricData(request);
}
}
Performance Optimization Strategies
1. Batch Trigger Acquisition
Reduce database query frequency by acquiring multiple pending triggers at once:
# Acquire up to 10 Triggers at once
org.quartz.scheduler.batchTriggerAcquisitionMaxCount=10
# Acquire tasks 5 seconds before trigger time
org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=5000
2. Connection Pool Optimization
# HikariCP Connection Pool Configuration
# Adjust based on container count and database max connections
# Formula: maximum-pool-size ≤ (DB max_connections / container count)
# Assuming RDS max connections 100, container count 10
spring.datasource.hikari.maximum-pool-size=8
spring.datasource.hikari.minimum-idle=4
spring.datasource.hikari.connection-timeout=20000
spring.datasource.hikari.validation-timeout=5000
spring.datasource.hikari.leak-detection-threshold=60000
3. Database Index Optimization
-- Ensure critical indexes exist (official SQL scripts should include them)
CREATE INDEX IF NOT EXISTS idx_qrtz_t_nft_st
ON QRTZ_TRIGGERS(SCHED_NAME, TRIGGER_STATE, NEXT_FIRE_TIME);
CREATE INDEX IF NOT EXISTS idx_qrtz_t_nft_st_misfire
ON QRTZ_TRIGGERS(SCHED_NAME, MISFIRE_INSTR, NEXT_FIRE_TIME, TRIGGER_STATE);
CREATE INDEX IF NOT EXISTS idx_qrtz_ft_trig_inst_name
ON QRTZ_FIRED_TRIGGERS(SCHED_NAME, INSTANCE_NAME);
-- Regularly analyze tables to update statistics (PostgreSQL)
ANALYZE QRTZ_TRIGGERS;
ANALYZE QRTZ_FIRED_TRIGGERS;
ANALYZE QRTZ_SCHEDULER_STATE;
4. Clean Historical Data
-- Periodically clean fired records (via Cron Job)
DELETE FROM QRTZ_FIRED_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
AND fired_time < (EXTRACT(EPOCH FROM NOW() - INTERVAL '7 days') * 1000);
-- Clean orphaned lock records (anomaly cases)
DELETE FROM QRTZ_LOCKS
WHERE sched_name NOT IN (
SELECT DISTINCT sched_name FROM QRTZ_SCHEDULER_STATE
WHERE (EXTRACT(EPOCH FROM NOW()) * 1000 - last_checkin_time) < 300000
);
Troubleshooting Common Issues
Issue 1: Duplicate Task Execution
Symptom: Same task executed by multiple containers at same time
Possible Causes:
- Container system times not synchronized
isClusteredset tofalse- Different containers using different
instanceName
Solutions:
# 1. Check container time synchronization
aws ecs execute-command
--cluster production-cluster
--task task-id-12345
--container quartz-app
--interactive
--command "date && ntpq -p"
# 2. Verify configuration
aws ecs execute-command
--cluster production-cluster
--task task-id-12345
--container quartz-app
--interactive
--command "cat /app/application.properties | grep isClustered"
-- 3. Check database instance names consistency
SELECT DISTINCT sched_name
FROM QRTZ_SCHEDULER_STATE;
-- Should return only one result: MyECSScheduler
Issue 2: Tasks Not Executing
Symptom: Scheduled time arrived but task didn’t execute
Possible Causes:
- All containers unable to acquire database lock
- Database connection pool exhausted
- Trigger state stuck (BLOCKED or ERROR)
Diagnostic Steps:
-- 1. Check Trigger status
SELECT
trigger_name,
trigger_state,
TO_TIMESTAMP(next_fire_time / 1000) AS next_fire,
TO_TIMESTAMP(prev_fire_time / 1000) AS prev_fire
FROM QRTZ_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
AND trigger_name = 'dataSync';
-- 2. Check for stuck locks
SELECT * FROM QRTZ_LOCKS
WHERE sched_name = 'MyECSScheduler';
-- 3. Check database connection count
SELECT
count(*) AS active_connections,
max_conn.setting AS max_connections
FROM pg_stat_activity,
(SELECT setting FROM pg_settings WHERE name = 'max_connections') max_conn
WHERE datname = 'quartz_db'
GROUP BY max_conn.setting;
Solutions:
-- If Trigger stuck in BLOCKED state, manually reset
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'WAITING'
WHERE sched_name = 'MyECSScheduler'
AND trigger_state = 'BLOCKED';
Issue 3: Tasks Not Recovering After Container Failure
Symptom: After container crashes, its assigned tasks aren’t taken over by other containers
Possible Causes:
clusterCheckinIntervalset too long- Failure detection multiplier too high
Solutions:
# Shorten heartbeat interval (default 15000 milliseconds)
org.quartz.jobStore.clusterCheckinInterval=10000
# Quartz internal default failure detection is 7x checkin interval
# 10000 × 7 = 70 seconds to detect failure
-- Manually mark failed instance (emergency case)
DELETE FROM QRTZ_SCHEDULER_STATE
WHERE instance_name = 'ecs-task-failed-instance'
AND sched_name = 'MyECSScheduler';
Best Practices Summary
1. Configuration Checklist
| Configuration Item | Required Value | Description |
|---|---|---|
isClustered |
true |
Enable clustering mode |
instanceId |
AUTO |
Auto-generate unique ID |
instanceName |
Same across all containers | E.g., MyECSScheduler |
job-store-type |
jdbc |
Use database persistence |
dataSource |
Shared RDS | All containers connect to same database |
2. Job Class Design Principles
// ✅ Correct: Stateless, idempotent, robust exception handling
@DisallowConcurrentExecution // Prevent concurrent execution of same Job
@PersistJobDataAfterExecution // Persist JobDataMap after execution
public class DataSyncJob implements Job {
private static final Logger log = LoggerFactory.getLogger(DataSyncJob.class);
@Override
public void execute(JobExecutionContext context) throws JobExecutionException {
String instanceId = context.getScheduler().getSchedulerInstanceId();
log.info("[{}] Starting DataSyncJob", instanceId);
try {
// Idempotency check
if (isAlreadyProcessed(context)) {
log.warn("[{}] Job already processed, skipping", instanceId);
return;
}
// Execute business logic
performDataSync();
// Mark as processed
markAsProcessed(context);
log.info("[{}] DataSyncJob completed successfully", instanceId);
} catch (Exception e) {
log.error("[{}] DataSyncJob failed", instanceId, e);
// refireImmediately = false, don't retry immediately
// Avoid cascade failures in cluster
throw new JobExecutionException(e, false);
}
}
private boolean isAlreadyProcessed(JobExecutionContext context) {
// Implement idempotency check logic
return false;
}
private void performDataSync() {
// Business logic
}
private void markAsProcessed(JobExecutionContext context) {
// Mark processing status
}
}
3. Maintenance Automation
-- Create periodic cleanup task (PostgreSQL cron extension)
SELECT cron.schedule(
'quartz-cleanup',
'0 3 * * *', -- Daily at 3:00 AM
$$
DELETE FROM QRTZ_FIRED_TRIGGERS
WHERE fired_time < (EXTRACT(EPOCH FROM NOW() - INTERVAL '7 days') * 1000);
$$
);
4. Security Recommendations
- Minimize Database User Permissions: Grant only SELECT, INSERT, UPDATE, DELETE on Quartz tables
- Use Secrets Manager: Manage database passwords via AWS Secrets Manager
- Enable SSL Connections: Use SSL/TLS encryption for RDS connections
- Network Isolation: Quartz containers and RDS communicate within private subnets
# Enable SSL connection
spring.datasource.url=jdbc:postgresql://quartz-db.abc.rds.amazonaws.com:5432/quartz?ssl=true&sslmode=require
Conclusion
Quartz clustering mode perfectly solves multi-container environment task scheduling challenges through database locking mechanisms and heartbeat detection. In AWS ECS environments, ensuring these key configurations:
- ✅ Enable clustering mode (
isClustered=true) - ✅ Auto-generate instance ID (
instanceId=AUTO) - ✅ All containers use same Scheduler name
- ✅ Share same RDS database
- ✅ Proper connection pool and performance optimization
Through the configuration methods, monitoring strategies, and best practices introduced in this article, you can build a highly available, scalable, duplicate-execution-proof enterprise-grade task scheduling system. Regardless of container count changes (Auto Scaling), task scheduling reliability and consistency are maintained.
Key Advantages:
- 🔒 Prevent duplicate task execution
- 🔄 Automatic failover
- 📈 Support horizontal scaling
- 🛡️ Production-grade reliability
It’s recommended to first test and validate with 2-3 containers in development environments, confirm clustering mode operates correctly, then deploy to production and enable Auto Scaling.
Related Articles
- Quartz Data Persistence Complete Guide: Configuration, Advantages & Best Practices
- Multiple Methods for Image Upload in Java Web Applications
- How to Troubleshoot and Resolve Spring Boot Configuration Default Value Issues
- Complete Guide to Implementing SSL Pinning and Custom Trust Chain in Android
- Spring Boot JSON Serialization: Solving String vs Long Type Inconsistency