Runbook: Celery Queue Issues¶

Last Updated: 2026-02-21 Severity: High Estimated TTR: 45 minutes Owner: DevOps Team

Symptoms¶

Background tasks not executing
Rota generation stuck in "pending" state
Emails not sending
Scheduled tasks not running

Detection¶

Alert: HighCeleryQueueDepth (queue depth > 100)
Dashboard: Celery Flower monitoring
Query: Check queue depth directly

Diagnosis Steps¶

Check Celery worker status:

# Check if workers are running
docker compose ps celery_worker

# Check worker stats
docker compose exec web celery -A rota inspect active

# Check registered tasks
docker compose exec web celery -A rota inspect registered

Expected: Workers running and responding If different: Workers may be down or stuck

Check queue depth:

# Check Redis queue length
docker compose exec redis redis-cli LLEN celery

# Or with Celery
docker compose exec web celery -A rota inspect active_queues

Expected: Queue depth < 100 If different: Tasks may be backing up

Check Celery logs:

# Worker logs
docker compose logs celery_worker -f --tail=100

# Beat logs (scheduled tasks are in the same container)
docker compose logs celery_worker -f --tail=100 | grep beat

Expected: Normal task execution logs If different: Look for errors or stuck tasks

Verify broker connectivity:

# Test Redis connection
docker compose exec redis redis-cli ping

# Check Redis stats
docker compose exec redis redis-cli INFO stats

Expected: Redis responding and healthy If different: Broker may be down

Root Causes¶

Cause	Likelihood	How to Confirm
Worker crashed	High	Worker process not running
Queue full / stuck	Medium	Queue depth not decreasing
Task serialization error	Medium	Error logs mentioning pickle/serialization
Worker prefetch issue	Low	Tasks queued but not starting
Broker connection issue	Medium	Redis connection failures

Resolution Steps¶

For Worker Crashed¶

Restart Celery worker:
```
docker compose restart celery_worker
```

Verify worker started:

docker compose exec web celery -A rota inspect active

Check for crash reason:

docker compose logs celery_worker --tail=100

Verify: Tasks processing normally

For Queue Full/Stuck¶

Check stuck tasks:

# Use Flower to inspect tasks
# Or check Redis directly
docker compose exec redis redis-cli --csv KEYS celery-task-meta-*

Purge stuck queue (caution - loses tasks):

# First, try to recover tasks
docker compose exec web celery -A rota purge -Q celery

# Or just restart workers to clear prefetch cache
docker compose restart celery_worker

Adjust worker concurrency if needed:

# In docker-compose.yml, adjust the celery_worker command:
# celery -A rota worker --concurrency=4

Verify: Queue depth decreasing

For Task Serialization Errors¶

Check task definitions:

# Ensure task arguments are JSON-serializable
@app.task
def my_task(clinician_id, start_date, end_date):
    # Use IDs, not model instances
    pass

Clear stuck tasks from queue:

docker compose exec redis redis-cli DEL celery

Re-queue tasks with proper serialization:

# Re-submit task with correct arguments
from tasks.celery_app import app
app.send_task('module.task_name', args=[clinician_id, start_date, end_date])

Verify: Tasks executing without errors

For Broker Connection Issues¶

Restart Redis:
```
docker compose restart redis
```

Check Redis memory:

docker compose exec redis redis-cli INFO memory

If maxmemory reached, consider clearing old task results:

docker compose exec redis sh -c 'redis-cli --scan --pattern celery-task-meta-* | xargs redis-cli DEL'

Verify: Workers can connect to broker

Verification¶

After applying fix, verify: - [ ] Workers processing tasks - [ ] Queue depth decreasing to normal - [ ] No error logs in Celery output - [ ] Background tasks completing successfully

Prevention¶

Set up monitoring for worker health
Configure automatic worker restart on failure
Set up alerts for queue depth
Implement task result expiration (result_expires)
Regular Celery health checks
Document task dependencies and requirements

Escalation¶

If unresolved after 1 hour, escalate to: DevOps Lead
On-call contact: See on-call roster

Related runbooks: calculation_failures.md, performance_degradation.md
Related services: Redis, Celery