Quartz Clustering Complete Guide: Preventing Duplicate Task Execution in ECS Multi-Container Environments

🌏 閱讀中文版本


Quartz Clustering Complete Guide: Preventing Duplicate Task Execution in ECS Multi-Container Environments

In modern microservice architectures, applications typically run across multiple container instances for load balancing and auto-scaling. When using Quartz Scheduler for task scheduling, a critical question arises: How do we ensure the same scheduled task doesn’t execute multiple times across different containers?

This article provides an in-depth exploration of Quartz clustering mechanisms and offers a complete configuration guide for AWS ECS environments, ensuring reliable and consistent task scheduling.

The Challenge: Multi-Container Environment Scenarios

Imagine your application is deployed on AWS ECS with Auto Scaling dynamically adjusting container counts:

  • Normal periods: 3 container instances running
  • Peak hours: Auto-scaled to 10 container instances
  • Scheduled task: Daily data synchronization at 2:00 AM

The Problem: Without proper configuration, at 2:00 AM all 10 containers would simultaneously execute the data sync task, causing:

  • ❌ Duplicate data processing
  • ❌ Database lock conflicts
  • ❌ System resource waste
  • ❌ Data inconsistency

The Solution: Quartz clustering mode uses database locking mechanisms to ensure only one container executes a specific task at any given time.

Quartz Clustering Core Mechanisms

1. Database Locking Mechanism (QRTZ_LOCKS Table)

Quartz uses pessimistic database locking to coordinate multiple container instances, preventing duplicate task execution.

QRTZ_LOCKS Table Structure

CREATE TABLE QRTZ_LOCKS (
  SCHED_NAME VARCHAR(120) NOT NULL,
  LOCK_NAME VARCHAR(40) NOT NULL,
  PRIMARY KEY (SCHED_NAME, LOCK_NAME)
);

-- System default lock types
-- TRIGGER_ACCESS: Controls Trigger read and update operations
-- STATE_ACCESS: Controls Scheduler state updates
-- JOB_ACCESS: Controls Job read and update operations

Lock Acquisition Flow

-- Container A attempts to acquire lock (PostgreSQL example)
BEGIN TRANSACTION;

-- Use SELECT FOR UPDATE to acquire exclusive lock
SELECT * FROM QRTZ_LOCKS 
WHERE SCHED_NAME = 'MyScheduler' AND LOCK_NAME = 'TRIGGER_ACCESS'
FOR UPDATE NOWAIT;

-- If successful, update Trigger state to "EXECUTING"
UPDATE QRTZ_TRIGGERS 
SET TRIGGER_STATE = 'EXECUTING', 
    FIRED_TIME = CURRENT_TIMESTAMP
WHERE TRIGGER_NAME = 'dataSync' 
  AND TRIGGER_GROUP = 'DEFAULT'
  AND TRIGGER_STATE = 'WAITING';

-- Record execution information
INSERT INTO QRTZ_FIRED_TRIGGERS (
  SCHED_NAME, ENTRY_ID, TRIGGER_NAME, TRIGGER_GROUP,
  INSTANCE_NAME, FIRED_TIME, STATE
) VALUES (
  'MyScheduler', 'unique_entry_id', 'dataSync', 'DEFAULT',
  'ecs-task-12345-abc', CURRENT_TIMESTAMP, 'EXECUTING'
);

COMMIT;

Other Container Handling

-- Container B simultaneously attempts to acquire lock
BEGIN TRANSACTION;

SELECT * FROM QRTZ_LOCKS 
WHERE SCHED_NAME = 'MyScheduler' AND LOCK_NAME = 'TRIGGER_ACCESS'
FOR UPDATE NOWAIT;
-- ❌ Fails! Database returns error: lock not available

-- Container B will wait or skip, depending on configuration
ROLLBACK;

2. Heartbeat Detection and Failover

Each container instance periodically updates the QRTZ_SCHEDULER_STATE table to announce its alive status.

QRTZ_SCHEDULER_STATE Table

CREATE TABLE QRTZ_SCHEDULER_STATE (
  SCHED_NAME VARCHAR(120) NOT NULL,
  INSTANCE_NAME VARCHAR(200) NOT NULL,
  LAST_CHECKIN_TIME BIGINT NOT NULL,
  CHECKIN_INTERVAL BIGINT NOT NULL,
  PRIMARY KEY (SCHED_NAME, INSTANCE_NAME)
);

Heartbeat Example

-- Query all active container instances
SELECT 
  instance_name,
  TO_TIMESTAMP(last_checkin_time / 1000) AS last_checkin,
  checkin_interval,
  EXTRACT(EPOCH FROM (NOW() - TO_TIMESTAMP(last_checkin_time / 1000))) * 1000 AS ms_since_checkin
FROM QRTZ_SCHEDULER_STATE
WHERE sched_name = 'MyECSScheduler'
ORDER BY last_checkin_time DESC;

-- Sample output
/*
instance_name              | last_checkin            | checkin_interval | ms_since_checkin
---------------------------|-------------------------|------------------|------------------
ecs-task-12345-abc-001     | 2025-10-20 10:30:45     | 20000            | 5234
ecs-task-67890-def-002     | 2025-10-20 10:30:47     | 20000            | 3125
ecs-task-11111-ghi-003     | 2025-10-20 10:30:43     | 20000            | 7456
ecs-task-22222-jkl-004     | 2025-10-20 10:29:10     | 20000            | 100234  ⚠️ Likely failed
*/

Failover Mechanism

Quartz automatically detects failed container instances:

  • Detection Threshold: last_checkin_time exceeds checkin_interval × multiplier (default 7x)
  • Automatic Takeover: Healthy container instances take over failed instance’s tasks
  • Resume Execution: Ensures tasks don’t stop due to single container crashes
-- Mark failed instance's tasks as "executable" state
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'WAITING'
WHERE TRIGGER_STATE = 'ACQUIRED'
  AND EXISTS (
    SELECT 1 FROM QRTZ_FIRED_TRIGGERS ft
    WHERE ft.trigger_name = QRTZ_TRIGGERS.trigger_name
      AND ft.instance_name = 'ecs-task-22222-jkl-004'  -- Failed instance
  );

Complete ECS Environment Configuration

1. Spring Boot Application Configuration

application.properties

# ========================================
# DataSource Configuration (RDS PostgreSQL)
# ========================================
spring.datasource.url=jdbc:postgresql://${RDS_ENDPOINT}:5432/quartz_db
spring.datasource.username=quartz_user
spring.datasource.password=${DB_PASSWORD}
spring.datasource.driver-class-name=org.postgresql.Driver

# ========================================
# HikariCP Connection Pool Configuration
# ========================================
spring.datasource.hikari.maximum-pool-size=10
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.connection-timeout=30000
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.pool-name=QuartzHikariPool

# ========================================
# Quartz Core Configuration
# ========================================
spring.quartz.job-store-type=jdbc
spring.quartz.jdbc.initialize-schema=never  # Use Flyway/Liquibase for management

# ========================================
# Clustering Configuration (CRITICAL!)
# ========================================
# Enable clustering mode
spring.quartz.properties.org.quartz.jobStore.isClustered=true

# Heartbeat interval (milliseconds), default 15000
spring.quartz.properties.org.quartz.jobStore.clusterCheckinInterval=20000

# Automatically generate unique instance ID
spring.quartz.properties.org.quartz.scheduler.instanceId=AUTO

# All containers must use the same scheduler name
spring.quartz.properties.org.quartz.scheduler.instanceName=MyECSScheduler

# ========================================
# JobStore Configuration
# ========================================
spring.quartz.properties.org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
spring.quartz.properties.org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
spring.quartz.properties.org.quartz.jobStore.tablePrefix=QRTZ_
spring.quartz.properties.org.quartz.jobStore.useProperties=false

# Optimization: Batch Trigger acquisition
spring.quartz.properties.org.quartz.scheduler.batchTriggerAcquisitionMaxCount=10
spring.quartz.properties.org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=5000

# ========================================
# ThreadPool Configuration
# ========================================
spring.quartz.properties.org.quartz.threadPool.class=org.quartz.simpl.SimpleThreadPool
spring.quartz.properties.org.quartz.threadPool.threadCount=5
spring.quartz.properties.org.quartz.threadPool.threadPriority=5
spring.quartz.properties.org.quartz.threadPool.threadsInheritContextClassLoaderOfInitializingThread=true

Java Configuration Class (Optional)

@Configuration
public class QuartzClusterConfig {
    
    @Value("${spring.datasource.url}")
    private String datasourceUrl;
    
    @Bean
    public SchedulerFactoryBean schedulerFactoryBean(DataSource dataSource) {
        SchedulerFactoryBean factory = new SchedulerFactoryBean();
        factory.setDataSource(dataSource);
        factory.setOverwriteExistingJobs(true);
        factory.setAutoStartup(true);
        factory.setWaitForJobsToCompleteOnShutdown(true);
        factory.setQuartzProperties(quartzProperties());
        
        return factory;
    }
    
    private Properties quartzProperties() {
        Properties props = new Properties();
        
        // Clustering configuration
        props.put("org.quartz.scheduler.instanceId", "AUTO");
        props.put("org.quartz.scheduler.instanceName", "MyECSScheduler");
        props.put("org.quartz.jobStore.isClustered", "true");
        props.put("org.quartz.jobStore.clusterCheckinInterval", "20000");
        
        // JobStore configuration
        props.put("org.quartz.jobStore.class", "org.quartz.impl.jdbcjobstore.JobStoreTX");
        props.put("org.quartz.jobStore.driverDelegateClass", 
                  "org.quartz.impl.jdbcjobstore.PostgreSQLDelegate");
        props.put("org.quartz.jobStore.tablePrefix", "QRTZ_");
        
        return props;
    }
}

2. ECS Task Definition Configuration

{
  "family": "quartz-scheduler-cluster",
  "taskRoleArn": "arn:aws:iam::123456789012:role/ecsTaskRole",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "quartz-app",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/quartz-app:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "SPRING_PROFILES_ACTIVE",
          "value": "production"
        },
        {
          "name": "RDS_ENDPOINT",
          "value": "quartz-db.cluster-abc123.us-east-1.rds.amazonaws.com"
        },
        {
          "name": "JAVA_OPTS",
          "value": "-Xms512m -Xmx768m -Duser.timezone=UTC"
        }
      ],
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:quartz-db-password-abc123"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/quartz-scheduler",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "quartz"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/actuator/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

3. ECS Service Configuration (Auto Scaling)

{
  "serviceName": "quartz-scheduler-service",
  "cluster": "production-cluster",
  "taskDefinition": "quartz-scheduler-cluster:5",
  "desiredCount": 3,
  "launchType": "FARGATE",
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": [
        "subnet-abc123",
        "subnet-def456"
      ],
      "securityGroups": ["sg-quartz-app"],
      "assignPublicIp": "DISABLED"
    }
  },
  "loadBalancers": [
    {
      "targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/quartz-tg/abc123",
      "containerName": "quartz-app",
      "containerPort": 8080
    }
  ],
  "deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
  }
}

Detailed Task Execution Flow

Timeline Flowchart

Time: 2025-10-20 02:00:00 (Task trigger time)

Container Status: Container A, B, C simultaneously running

─────────────────────────────────────────────────────────

T0 (02:00:00.000)
  Trigger reaches trigger time (next_fire_time = 02:00:00)
  QRTZ_TRIGGERS.TRIGGER_STATE = 'WAITING'
  
  ↓

T1 (02:00:00.050)
  All containers simultaneously scan database for pending tasks
  
  Container A: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
  Container B: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
  Container C: SELECT * FROM QRTZ_TRIGGERS WHERE next_fire_time <= NOW()
  
  Result: All three containers discover 'dataSync' task needs execution
  
  ↓

T2 (02:00:00.100)
  Three containers simultaneously attempt to acquire TRIGGER_ACCESS lock
  
  Container A: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
              ✅ Lock acquired successfully
              
  Container B: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
              ❌ Failed (ERROR: could not obtain lock)
              
  Container C: SELECT * FROM QRTZ_LOCKS ... FOR UPDATE NOWAIT
              ❌ Failed (ERROR: could not obtain lock)
  
  ↓

T3 (02:00:00.150)
  Container A updates task state and begins execution
  
  UPDATE QRTZ_TRIGGERS 
  SET TRIGGER_STATE='EXECUTING', FIRED_TIME=1729382400150
  WHERE TRIGGER_NAME='dataSync'
  
  INSERT INTO QRTZ_FIRED_TRIGGERS (
    instance_name='ecs-task-12345-abc',
    fired_time=1729382400150,
    state='EXECUTING'
  )
  
  COMMIT; -- Release lock
  
  ↓

T4 (02:00:00.200)
  Container B, C check task status
  
  SELECT TRIGGER_STATE FROM QRTZ_TRIGGERS WHERE TRIGGER_NAME='dataSync'
  Result: TRIGGER_STATE = 'EXECUTING'
  
  ⏭️ Skip execution (task already executed by Container A)
  
  ↓

T5 (02:00:05.000) - Assuming task execution takes 5 seconds
  Container A completes task execution
  
  UPDATE QRTZ_TRIGGERS 
  SET TRIGGER_STATE='WAITING', 
      PREV_FIRE_TIME=1729382400000,
      NEXT_FIRE_TIME=1729468800000  -- Tomorrow 02:00
  WHERE TRIGGER_NAME='dataSync'
  
  DELETE FROM QRTZ_FIRED_TRIGGERS 
  WHERE TRIGGER_NAME='dataSync' AND instance_name='ecs-task-12345-abc'
  
  ↓

Result: ✅ Task executed only once
       ✅ Container A handled execution
       ✅ Container B, C automatically skipped

Actual Execution Log Example

// Container A (ecs-task-12345-abc)
2025-10-20 02:00:00.050 INFO  [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.100 INFO  [JobRunShell] Acquiring trigger: DEFAULT.dataSync
2025-10-20 02:00:00.120 INFO  [JobRunShell] Acquired trigger: DEFAULT.dataSync
2025-10-20 02:00:00.150 INFO  [JobRunShell] Executing job: DataSyncJob
2025-10-20 02:00:05.000 INFO  [JobRunShell] Job executed successfully

// Container B (ecs-task-67890-def)
2025-10-20 02:00:00.050 INFO  [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.100 WARN  [JobRunShell] Unable to acquire trigger: DEFAULT.dataSync
2025-10-20 02:00:00.105 INFO  [QuartzScheduler] Trigger already acquired by another instance

// Container C (ecs-task-11111-ghi)
2025-10-20 02:00:00.050 INFO  [QuartzScheduler] Batch acquisition: 1 trigger(s)
2025-10-20 02:00:00.102 WARN  [JobRunShell] Unable to acquire trigger: DEFAULT.dataSync
2025-10-20 02:00:00.107 INFO  [QuartzScheduler] Trigger already acquired by another instance

Monitoring and Verification

1. Database-Level Monitoring

Check Cluster Status

-- View all active container instances
SELECT 
  instance_name,
  TO_TIMESTAMP(last_checkin_time / 1000) AS last_checkin,
  checkin_interval / 1000 AS checkin_interval_sec,
  CASE 
    WHEN (EXTRACT(EPOCH FROM NOW()) * 1000 - last_checkin_time) > (checkin_interval * 3)
    THEN '⚠️ Possibly Failed'
    ELSE '✅ Healthy'
  END AS status
FROM QRTZ_SCHEDULER_STATE
WHERE sched_name = 'MyECSScheduler'
ORDER BY last_checkin_time DESC;

Track Task Execution History

-- View most recent 10 task execution records
SELECT 
  ft.trigger_name,
  ft.job_name,
  ft.instance_name,
  TO_TIMESTAMP(ft.fired_time / 1000) AS executed_at,
  ft.state
FROM QRTZ_FIRED_TRIGGERS ft
WHERE ft.sched_name = 'MyECSScheduler'
ORDER BY ft.fired_time DESC
LIMIT 10;

Detect Duplicate Task Execution

-- Find tasks executed by multiple instances at same time (anomaly)
SELECT 
  trigger_name,
  TO_TIMESTAMP(fired_time / 1000) AS fired_at,
  COUNT(DISTINCT instance_name) AS execution_count,
  STRING_AGG(instance_name, ', ') AS instances
FROM QRTZ_FIRED_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
  AND fired_time > (EXTRACT(EPOCH FROM NOW() - INTERVAL '1 hour') * 1000)
GROUP BY trigger_name, fired_time
HAVING COUNT(DISTINCT instance_name) > 1;

2. Application-Level Monitoring

Spring Boot Actuator Integration

@Component
public class QuartzClusterHealthIndicator implements HealthIndicator {
    
    @Autowired
    private Scheduler scheduler;
    
    @Autowired
    private JdbcTemplate jdbcTemplate;
    
    @Override
    public Health health() {
        try {
            String instanceId = scheduler.getSchedulerInstanceId();
            boolean isClustered = scheduler.getMetaData().isJobStoreClustered();
            
            // Check number of active instances in cluster
            Integer activeInstances = jdbcTemplate.queryForObject(
                "SELECT COUNT(*) FROM QRTZ_SCHEDULER_STATE " +
                "WHERE sched_name = ? AND (? - last_checkin_time) < ?",
                Integer.class,
                "MyECSScheduler",
                System.currentTimeMillis(),
                60000  // Heartbeat within 60 seconds considered active
            );
            
            return Health.up()
                .withDetail("instanceId", instanceId)
                .withDetail("clustered", isClustered)
                .withDetail("activeInstances", activeInstances)
                .withDetail("currentlyExecutingJobs", 
                           scheduler.getCurrentlyExecutingJobs().size())
                .build();
                
        } catch (Exception e) {
            return Health.down()
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

Custom Monitoring Metrics

@Component
public class QuartzMetricsCollector {
    
    @Autowired
    private Scheduler scheduler;
    
    @Autowired
    private MeterRegistry meterRegistry;
    
    @Scheduled(fixedRate = 30000) // Every 30 seconds
    public void collectMetrics() throws SchedulerException {
        String instanceId = scheduler.getSchedulerInstanceId();
        
        // Record currently executing job count
        int executingJobs = scheduler.getCurrentlyExecutingJobs().size();
        meterRegistry.gauge("quartz.jobs.executing", 
                           Tags.of("instance", instanceId), 
                           executingJobs);
        
        // Record total executed jobs count
        SchedulerMetaData metaData = scheduler.getMetaData();
        meterRegistry.counter("quartz.jobs.executed.total",
                             Tags.of("instance", instanceId))
                     .increment(metaData.getNumberOfJobsExecuted());
        
        log.info("Quartz Metrics - Instance: {}, Executing: {}, Total Executed: {}",
                 instanceId, executingJobs, metaData.getNumberOfJobsExecuted());
    }
}

3. CloudWatch Integration

@Component
public class QuartzCloudWatchMetrics {
    
    @Autowired
    private AmazonCloudWatch cloudWatch;
    
    @Autowired
    private Scheduler scheduler;
    
    @Scheduled(fixedRate = 60000) // Every minute
    public void publishToCloudWatch() throws SchedulerException {
        String instanceId = scheduler.getSchedulerInstanceId();
        
        // Send custom metric to CloudWatch
        PutMetricDataRequest request = new PutMetricDataRequest()
            .withNamespace("QuartzScheduler/Cluster")
            .withMetricData(
                new MetricDatum()
                    .withMetricName("ExecutingJobs")
                    .withValue((double) scheduler.getCurrentlyExecutingJobs().size())
                    .withUnit(StandardUnit.Count)
                    .withTimestamp(new Date())
                    .withDimensions(
                        new Dimension()
                            .withName("InstanceId")
                            .withValue(instanceId)
                    )
            );
        
        cloudWatch.putMetricData(request);
    }
}

Performance Optimization Strategies

1. Batch Trigger Acquisition

Reduce database query frequency by acquiring multiple pending triggers at once:

# Acquire up to 10 Triggers at once
org.quartz.scheduler.batchTriggerAcquisitionMaxCount=10

# Acquire tasks 5 seconds before trigger time
org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow=5000

2. Connection Pool Optimization

# HikariCP Connection Pool Configuration
# Adjust based on container count and database max connections
# Formula: maximum-pool-size ≤ (DB max_connections / container count)

# Assuming RDS max connections 100, container count 10
spring.datasource.hikari.maximum-pool-size=8
spring.datasource.hikari.minimum-idle=4
spring.datasource.hikari.connection-timeout=20000
spring.datasource.hikari.validation-timeout=5000
spring.datasource.hikari.leak-detection-threshold=60000

3. Database Index Optimization

-- Ensure critical indexes exist (official SQL scripts should include them)
CREATE INDEX IF NOT EXISTS idx_qrtz_t_nft_st 
ON QRTZ_TRIGGERS(SCHED_NAME, TRIGGER_STATE, NEXT_FIRE_TIME);

CREATE INDEX IF NOT EXISTS idx_qrtz_t_nft_st_misfire 
ON QRTZ_TRIGGERS(SCHED_NAME, MISFIRE_INSTR, NEXT_FIRE_TIME, TRIGGER_STATE);

CREATE INDEX IF NOT EXISTS idx_qrtz_ft_trig_inst_name 
ON QRTZ_FIRED_TRIGGERS(SCHED_NAME, INSTANCE_NAME);

-- Regularly analyze tables to update statistics (PostgreSQL)
ANALYZE QRTZ_TRIGGERS;
ANALYZE QRTZ_FIRED_TRIGGERS;
ANALYZE QRTZ_SCHEDULER_STATE;

4. Clean Historical Data

-- Periodically clean fired records (via Cron Job)
DELETE FROM QRTZ_FIRED_TRIGGERS 
WHERE sched_name = 'MyECSScheduler'
  AND fired_time < (EXTRACT(EPOCH FROM NOW() - INTERVAL '7 days') * 1000);

-- Clean orphaned lock records (anomaly cases)
DELETE FROM QRTZ_LOCKS
WHERE sched_name NOT IN (
  SELECT DISTINCT sched_name FROM QRTZ_SCHEDULER_STATE
  WHERE (EXTRACT(EPOCH FROM NOW()) * 1000 - last_checkin_time) < 300000
);

Troubleshooting Common Issues

Issue 1: Duplicate Task Execution

Symptom: Same task executed by multiple containers at same time

Possible Causes:

  • Container system times not synchronized
  • isClustered set to false
  • Different containers using different instanceName

Solutions:

# 1. Check container time synchronization
aws ecs execute-command 
  --cluster production-cluster 
  --task task-id-12345 
  --container quartz-app 
  --interactive 
  --command "date && ntpq -p"

# 2. Verify configuration
aws ecs execute-command 
  --cluster production-cluster 
  --task task-id-12345 
  --container quartz-app 
  --interactive 
  --command "cat /app/application.properties | grep isClustered"
-- 3. Check database instance names consistency
SELECT DISTINCT sched_name 
FROM QRTZ_SCHEDULER_STATE;
-- Should return only one result: MyECSScheduler

Issue 2: Tasks Not Executing

Symptom: Scheduled time arrived but task didn’t execute

Possible Causes:

  • All containers unable to acquire database lock
  • Database connection pool exhausted
  • Trigger state stuck (BLOCKED or ERROR)

Diagnostic Steps:

-- 1. Check Trigger status
SELECT 
  trigger_name,
  trigger_state,
  TO_TIMESTAMP(next_fire_time / 1000) AS next_fire,
  TO_TIMESTAMP(prev_fire_time / 1000) AS prev_fire
FROM QRTZ_TRIGGERS
WHERE sched_name = 'MyECSScheduler'
  AND trigger_name = 'dataSync';

-- 2. Check for stuck locks
SELECT * FROM QRTZ_LOCKS 
WHERE sched_name = 'MyECSScheduler';

-- 3. Check database connection count
SELECT 
  count(*) AS active_connections,
  max_conn.setting AS max_connections
FROM pg_stat_activity,
     (SELECT setting FROM pg_settings WHERE name = 'max_connections') max_conn
WHERE datname = 'quartz_db'
GROUP BY max_conn.setting;

Solutions:

-- If Trigger stuck in BLOCKED state, manually reset
UPDATE QRTZ_TRIGGERS
SET TRIGGER_STATE = 'WAITING'
WHERE sched_name = 'MyECSScheduler'
  AND trigger_state = 'BLOCKED';

Issue 3: Tasks Not Recovering After Container Failure

Symptom: After container crashes, its assigned tasks aren’t taken over by other containers

Possible Causes:

  • clusterCheckinInterval set too long
  • Failure detection multiplier too high

Solutions:

# Shorten heartbeat interval (default 15000 milliseconds)
org.quartz.jobStore.clusterCheckinInterval=10000

# Quartz internal default failure detection is 7x checkin interval
# 10000 × 7 = 70 seconds to detect failure
-- Manually mark failed instance (emergency case)
DELETE FROM QRTZ_SCHEDULER_STATE
WHERE instance_name = 'ecs-task-failed-instance'
  AND sched_name = 'MyECSScheduler';

Best Practices Summary

1. Configuration Checklist

Configuration Item Required Value Description
isClustered true Enable clustering mode
instanceId AUTO Auto-generate unique ID
instanceName Same across all containers E.g., MyECSScheduler
job-store-type jdbc Use database persistence
dataSource Shared RDS All containers connect to same database

2. Job Class Design Principles

// ✅ Correct: Stateless, idempotent, robust exception handling
@DisallowConcurrentExecution  // Prevent concurrent execution of same Job
@PersistJobDataAfterExecution  // Persist JobDataMap after execution
public class DataSyncJob implements Job {
    
    private static final Logger log = LoggerFactory.getLogger(DataSyncJob.class);
    
    @Override
    public void execute(JobExecutionContext context) throws JobExecutionException {
        String instanceId = context.getScheduler().getSchedulerInstanceId();
        log.info("[{}] Starting DataSyncJob", instanceId);
        
        try {
            // Idempotency check
            if (isAlreadyProcessed(context)) {
                log.warn("[{}] Job already processed, skipping", instanceId);
                return;
            }
            
            // Execute business logic
            performDataSync();
            
            // Mark as processed
            markAsProcessed(context);
            
            log.info("[{}] DataSyncJob completed successfully", instanceId);
            
        } catch (Exception e) {
            log.error("[{}] DataSyncJob failed", instanceId, e);
            
            // refireImmediately = false, don't retry immediately
            // Avoid cascade failures in cluster
            throw new JobExecutionException(e, false);
        }
    }
    
    private boolean isAlreadyProcessed(JobExecutionContext context) {
        // Implement idempotency check logic
        return false;
    }
    
    private void performDataSync() {
        // Business logic
    }
    
    private void markAsProcessed(JobExecutionContext context) {
        // Mark processing status
    }
}

3. Maintenance Automation

-- Create periodic cleanup task (PostgreSQL cron extension)
SELECT cron.schedule(
  'quartz-cleanup',
  '0 3 * * *',  -- Daily at 3:00 AM
  $$
  DELETE FROM QRTZ_FIRED_TRIGGERS 
  WHERE fired_time < (EXTRACT(EPOCH FROM NOW() - INTERVAL '7 days') * 1000);
  $$
);

4. Security Recommendations

  • Minimize Database User Permissions: Grant only SELECT, INSERT, UPDATE, DELETE on Quartz tables
  • Use Secrets Manager: Manage database passwords via AWS Secrets Manager
  • Enable SSL Connections: Use SSL/TLS encryption for RDS connections
  • Network Isolation: Quartz containers and RDS communicate within private subnets
# Enable SSL connection
spring.datasource.url=jdbc:postgresql://quartz-db.abc.rds.amazonaws.com:5432/quartz?ssl=true&sslmode=require

Conclusion

Quartz clustering mode perfectly solves multi-container environment task scheduling challenges through database locking mechanisms and heartbeat detection. In AWS ECS environments, ensuring these key configurations:

  • ✅ Enable clustering mode (isClustered=true)
  • ✅ Auto-generate instance ID (instanceId=AUTO)
  • ✅ All containers use same Scheduler name
  • ✅ Share same RDS database
  • ✅ Proper connection pool and performance optimization

Through the configuration methods, monitoring strategies, and best practices introduced in this article, you can build a highly available, scalable, duplicate-execution-proof enterprise-grade task scheduling system. Regardless of container count changes (Auto Scaling), task scheduling reliability and consistency are maintained.

Key Advantages:

  • 🔒 Prevent duplicate task execution
  • 🔄 Automatic failover
  • 📈 Support horizontal scaling
  • 🛡️ Production-grade reliability

It’s recommended to first test and validate with 2-3 containers in development environments, confirm clustering mode operates correctly, then deploy to production and enable Auto Scaling.

Related Articles

Leave a Comment