Runbook: Celery Queue Issues¶
Last Updated: 2026-02-21 Severity: High Estimated TTR: 45 minutes Owner: DevOps Team
Symptoms¶
- Background tasks not executing
- Rota generation stuck in "pending" state
- Emails not sending
- Scheduled tasks not running
Detection¶
- Alert:
HighCeleryQueueDepth(queue depth > 100) - Dashboard: Celery Flower monitoring
- Query: Check queue depth directly
Diagnosis Steps¶
-
Check Celery worker status:
Expected: Workers running and responding If different: Workers may be down or stuck -
Check queue depth:
Expected: Queue depth < 100 If different: Tasks may be backing up -
Check Celery logs:
Expected: Normal task execution logs If different: Look for errors or stuck tasks -
Verify broker connectivity:
Expected: Redis responding and healthy If different: Broker may be down
Root Causes¶
| Cause | Likelihood | How to Confirm |
|---|---|---|
| Worker crashed | High | Worker process not running |
| Queue full / stuck | Medium | Queue depth not decreasing |
| Task serialization error | Medium | Error logs mentioning pickle/serialization |
| Worker prefetch issue | Low | Tasks queued but not starting |
| Broker connection issue | Medium | Redis connection failures |
Resolution Steps¶
For Worker Crashed¶
-
Restart Celery worker:
-
Verify worker started:
-
Check for crash reason:
-
Verify: Tasks processing normally
For Queue Full/Stuck¶
-
Check stuck tasks:
-
Purge stuck queue (caution - loses tasks):
-
Adjust worker concurrency if needed:
-
Verify: Queue depth decreasing
For Task Serialization Errors¶
-
Check task definitions:
-
Clear stuck tasks from queue:
-
Re-queue tasks with proper serialization:
-
Verify: Tasks executing without errors
For Broker Connection Issues¶
-
Restart Redis:
-
Check Redis memory:
If maxmemory reached, consider clearing old task results: -
Verify: Workers can connect to broker
Verification¶
After applying fix, verify: - [ ] Workers processing tasks - [ ] Queue depth decreasing to normal - [ ] No error logs in Celery output - [ ] Background tasks completing successfully
Prevention¶
- Set up monitoring for worker health
- Configure automatic worker restart on failure
- Set up alerts for queue depth
- Implement task result expiration (result_expires)
- Regular Celery health checks
- Document task dependencies and requirements
Escalation¶
- If unresolved after 1 hour, escalate to: DevOps Lead
- On-call contact: See on-call roster
Related Issues¶
- Related runbooks:
calculation_failures.md,performance_degradation.md - Related services: Redis, Celery