Runbook: Memory Issues¶

Last Updated: 2026-02-21 Severity: Critical Estimated TTR: 30 minutes Owner: DevOps Team

Symptoms¶

OOM (Out of Memory) errors
High memory usage on server
Processes getting killed
Server becomes unresponsive

Detection¶

Alert: HighMemoryUsage (> 500MB for calculations)
Dashboard: Server Metrics Dashboard
Query: Check system memory usage

Diagnosis Steps¶

Check current memory usage:

# Overall container resource usage
docker stats --no-stream

# Per-container memory breakdown
docker compose ps
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Django/Gunicorn memory
docker stats rota-cc-web-1 --no-stream

# Celery worker memory
docker stats rota-cc-celery_worker-1 --no-stream

Expected: Total memory < 80%, individual processes < 500MB If different: Memory leak or excessive memory usage

Check for memory leaks:

# Monitor memory over time
docker stats --no-stream --format "{{.Name}}: {{.MemUsage}}" | grep rota-cc

Expected: Stable memory usage If different: Memory leak detected

Check Django debug toolbar or profiling:

# Use memory_profiler to identify heavy functions
from memory_profiler import profile

@profile
def calculate_for_many_clinicians(clinicians):
    # Calculation code
    pass

Expected: Memory usage reasonable for operation If different: Identify and optimize memory-intensive operations

Check for connection pool exhaustion:

# Check database connections
docker compose exec postgres psql -U rota rota_cc -c "SELECT count(*) FROM pg_stat_activity;"

Expected: < 100 connections If different: Connection pool may be exhausted

Root Causes¶

Cause	Likelihood	How to Confirm
Large result sets not paginated	High	Loading 100+ clinicians at once
Memory leak in long-running process	Medium	Memory increases over time
Connection pool too large	Medium	Many DB connections
Caching too much data	Low	Cache size grows unbounded
Celery task holding references	Medium	Tasks not releasing memory

Resolution Steps¶

For Large Result Sets¶

Implement pagination:

# Instead of loading all at once
clinicians = Clinician.objects.all()

# Use paginator
from django.core.paginator import Paginator
paginator = Paginator(clinicians, 25)
for page in paginator.page_range:
    clinicians_page = paginator.get_page(page)
    process_clinicians(clinicians_page)

Use iterator() for large querysets:

# Don't evaluate queryset in memory
for clinician in Clinician.objects.iterator():
    process_clinician(clinician)

Verify: Memory usage reduced

For Memory Leaks¶

Restart affected services:

# Restart web application
docker compose restart web

# Restart Celery workers
docker compose restart celery_worker

Identify leak source:

# Use tracemalloc to track allocations
import tracemalloc
tracemalloc.start()

# Run suspected leaking code
calculate_for_many_clinicians(clinicians)

# Show snapshot
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
    print(stat)

Fix leak (usually circular references or unbounded caches):

# Explicitly delete large objects
result = calculate()
process(result)
del result
gc.collect()

Verify: Memory usage stable over time

For Connection Pool Issues¶

Reduce connection pool size:

# In settings.py
DATABASES = {
    'default': {
        'CONN_MAX_AGE': 600,
        'OPTIONS': {
            'MAX_CONNS': 20,  # Reduce from default
        }
    }
}

Close connections explicitly:

from django.db import connections

def process_batch():
    # Do work
    pass
# Close connections when done
connections.close_all()

Verify: Connection count reduced

For Celery Memory Issues¶

Configure task result expiration:

# In celery.py
app.conf.update(
    result_expires=3600,  # Expire results after 1 hour
    task_acks_late=True,  # Ack after task completes
    worker_prefetch_multiplier=1,  # Reduce prefetch
)

Restart workers with limited concurrency:

# In docker-compose.yml, adjust the celery_worker command:
# celery -A rota worker --concurrency=2 --max-tasks-per-child=100
# Then apply: docker compose up -d celery_worker

Verify: Worker memory usage reduced

Verification¶

After applying fix, verify: - [ ] Total memory usage < 80% - [ ] No OOM errors in logs - [ ] Memory usage stable over time - [ ] Application responsive

Prevention¶

Set up memory monitoring and alerts
Regular memory profiling in CI/CD
Use pagination for all large datasets
Implement connection pooling limits
Regular service restarts (daily/weekly)
Load testing to identify memory limits

Escalation¶

If OOM causing downtime, escalate immediately: DevOps Lead
If unresolved after 1 hour, escalate to: Senior DevOps Engineer
On-call contact: See on-call roster

Related runbooks: performance_degradation.md, celery_queue_issues.md
Related alerts: HighMemoryUsage, OomKiller