Skip to content

Runbook: Celery Queue Issues

Last Updated: 2026-02-21 Severity: High Estimated TTR: 45 minutes Owner: DevOps Team

Symptoms

  • Background tasks not executing
  • Rota generation stuck in "pending" state
  • Emails not sending
  • Scheduled tasks not running

Detection

  • Alert: HighCeleryQueueDepth (queue depth > 100)
  • Dashboard: Celery Flower monitoring
  • Query: Check queue depth directly

Diagnosis Steps

  1. Check Celery worker status:

    # Check if workers are running
    docker compose ps celery_worker
    
    # Check worker stats
    docker compose exec web celery -A rota inspect active
    
    # Check registered tasks
    docker compose exec web celery -A rota inspect registered
    
    Expected: Workers running and responding If different: Workers may be down or stuck

  2. Check queue depth:

    # Check Redis queue length
    docker compose exec redis redis-cli LLEN celery
    
    # Or with Celery
    docker compose exec web celery -A rota inspect active_queues
    
    Expected: Queue depth < 100 If different: Tasks may be backing up

  3. Check Celery logs:

    # Worker logs
    docker compose logs celery_worker -f --tail=100
    
    # Beat logs (scheduled tasks are in the same container)
    docker compose logs celery_worker -f --tail=100 | grep beat
    
    Expected: Normal task execution logs If different: Look for errors or stuck tasks

  4. Verify broker connectivity:

    # Test Redis connection
    docker compose exec redis redis-cli ping
    
    # Check Redis stats
    docker compose exec redis redis-cli INFO stats
    
    Expected: Redis responding and healthy If different: Broker may be down

Root Causes

Cause Likelihood How to Confirm
Worker crashed High Worker process not running
Queue full / stuck Medium Queue depth not decreasing
Task serialization error Medium Error logs mentioning pickle/serialization
Worker prefetch issue Low Tasks queued but not starting
Broker connection issue Medium Redis connection failures

Resolution Steps

For Worker Crashed

  1. Restart Celery worker:

    docker compose restart celery_worker
    

  2. Verify worker started:

    docker compose exec web celery -A rota inspect active
    

  3. Check for crash reason:

    docker compose logs celery_worker --tail=100
    

  4. Verify: Tasks processing normally

For Queue Full/Stuck

  1. Check stuck tasks:

    # Use Flower to inspect tasks
    # Or check Redis directly
    docker compose exec redis redis-cli --csv KEYS celery-task-meta-*
    

  2. Purge stuck queue (caution - loses tasks):

    # First, try to recover tasks
    docker compose exec web celery -A rota purge -Q celery
    
    # Or just restart workers to clear prefetch cache
    docker compose restart celery_worker
    

  3. Adjust worker concurrency if needed:

    # In docker-compose.yml, adjust the celery_worker command:
    # celery -A rota worker --concurrency=4
    

  4. Verify: Queue depth decreasing

For Task Serialization Errors

  1. Check task definitions:

    # Ensure task arguments are JSON-serializable
    @app.task
    def my_task(clinician_id, start_date, end_date):
        # Use IDs, not model instances
        pass
    

  2. Clear stuck tasks from queue:

    docker compose exec redis redis-cli DEL celery
    

  3. Re-queue tasks with proper serialization:

    # Re-submit task with correct arguments
    from tasks.celery_app import app
    app.send_task('module.task_name', args=[clinician_id, start_date, end_date])
    

  4. Verify: Tasks executing without errors

For Broker Connection Issues

  1. Restart Redis:

    docker compose restart redis
    

  2. Check Redis memory:

    docker compose exec redis redis-cli INFO memory
    
    If maxmemory reached, consider clearing old task results:
    docker compose exec redis sh -c 'redis-cli --scan --pattern celery-task-meta-* | xargs redis-cli DEL'
    

  3. Verify: Workers can connect to broker

Verification

After applying fix, verify: - [ ] Workers processing tasks - [ ] Queue depth decreasing to normal - [ ] No error logs in Celery output - [ ] Background tasks completing successfully

Prevention

  • Set up monitoring for worker health
  • Configure automatic worker restart on failure
  • Set up alerts for queue depth
  • Implement task result expiration (result_expires)
  • Regular Celery health checks
  • Document task dependencies and requirements

Escalation

  • If unresolved after 1 hour, escalate to: DevOps Lead
  • On-call contact: See on-call roster
  • Related runbooks: calculation_failures.md, performance_degradation.md
  • Related services: Redis, Celery