Runbook: Memory Issues¶
Last Updated: 2026-02-21 Severity: Critical Estimated TTR: 30 minutes Owner: DevOps Team
Symptoms¶
- OOM (Out of Memory) errors
- High memory usage on server
- Processes getting killed
- Server becomes unresponsive
Detection¶
- Alert:
HighMemoryUsage(> 500MB for calculations) - Dashboard: Server Metrics Dashboard
- Query: Check system memory usage
Diagnosis Steps¶
-
Check current memory usage:
Expected: Total memory < 80%, individual processes < 500MB If different: Memory leak or excessive memory usage# Overall container resource usage docker stats --no-stream # Per-container memory breakdown docker compose ps docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" # Django/Gunicorn memory docker stats rota-cc-web-1 --no-stream # Celery worker memory docker stats rota-cc-celery_worker-1 --no-stream -
Check for memory leaks:
Expected: Stable memory usage If different: Memory leak detected -
Check Django debug toolbar or profiling:
Expected: Memory usage reasonable for operation If different: Identify and optimize memory-intensive operations -
Check for connection pool exhaustion:
Expected: < 100 connections If different: Connection pool may be exhausted
Root Causes¶
| Cause | Likelihood | How to Confirm |
|---|---|---|
| Large result sets not paginated | High | Loading 100+ clinicians at once |
| Memory leak in long-running process | Medium | Memory increases over time |
| Connection pool too large | Medium | Many DB connections |
| Caching too much data | Low | Cache size grows unbounded |
| Celery task holding references | Medium | Tasks not releasing memory |
Resolution Steps¶
For Large Result Sets¶
-
Implement pagination:
-
Use iterator() for large querysets:
-
Verify: Memory usage reduced
For Memory Leaks¶
-
Restart affected services:
-
Identify leak source:
-
Fix leak (usually circular references or unbounded caches):
-
Verify: Memory usage stable over time
For Connection Pool Issues¶
-
Reduce connection pool size:
-
Close connections explicitly:
-
Verify: Connection count reduced
For Celery Memory Issues¶
-
Configure task result expiration:
-
Restart workers with limited concurrency:
-
Verify: Worker memory usage reduced
Verification¶
After applying fix, verify: - [ ] Total memory usage < 80% - [ ] No OOM errors in logs - [ ] Memory usage stable over time - [ ] Application responsive
Prevention¶
- Set up memory monitoring and alerts
- Regular memory profiling in CI/CD
- Use pagination for all large datasets
- Implement connection pooling limits
- Regular service restarts (daily/weekly)
- Load testing to identify memory limits
Escalation¶
- If OOM causing downtime, escalate immediately: DevOps Lead
- If unresolved after 1 hour, escalate to: Senior DevOps Engineer
- On-call contact: See on-call roster
Related Issues¶
- Related runbooks:
performance_degradation.md,celery_queue_issues.md - Related alerts:
HighMemoryUsage,OomKiller