Skip to content

Runbook: Memory Issues

Last Updated: 2026-02-21 Severity: Critical Estimated TTR: 30 minutes Owner: DevOps Team

Symptoms

  • OOM (Out of Memory) errors
  • High memory usage on server
  • Processes getting killed
  • Server becomes unresponsive

Detection

  • Alert: HighMemoryUsage (> 500MB for calculations)
  • Dashboard: Server Metrics Dashboard
  • Query: Check system memory usage

Diagnosis Steps

  1. Check current memory usage:

    # Overall container resource usage
    docker stats --no-stream
    
    # Per-container memory breakdown
    docker compose ps
    docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
    
    # Django/Gunicorn memory
    docker stats rota-cc-web-1 --no-stream
    
    # Celery worker memory
    docker stats rota-cc-celery_worker-1 --no-stream
    
    Expected: Total memory < 80%, individual processes < 500MB If different: Memory leak or excessive memory usage

  2. Check for memory leaks:

    # Monitor memory over time
    docker stats --no-stream --format "{{.Name}}: {{.MemUsage}}" | grep rota-cc
    
    Expected: Stable memory usage If different: Memory leak detected

  3. Check Django debug toolbar or profiling:

    # Use memory_profiler to identify heavy functions
    from memory_profiler import profile
    
    @profile
    def calculate_for_many_clinicians(clinicians):
        # Calculation code
        pass
    
    Expected: Memory usage reasonable for operation If different: Identify and optimize memory-intensive operations

  4. Check for connection pool exhaustion:

    # Check database connections
    docker compose exec postgres psql -U rota rota_cc -c "SELECT count(*) FROM pg_stat_activity;"
    
    Expected: < 100 connections If different: Connection pool may be exhausted

Root Causes

Cause Likelihood How to Confirm
Large result sets not paginated High Loading 100+ clinicians at once
Memory leak in long-running process Medium Memory increases over time
Connection pool too large Medium Many DB connections
Caching too much data Low Cache size grows unbounded
Celery task holding references Medium Tasks not releasing memory

Resolution Steps

For Large Result Sets

  1. Implement pagination:

    # Instead of loading all at once
    clinicians = Clinician.objects.all()
    
    # Use paginator
    from django.core.paginator import Paginator
    paginator = Paginator(clinicians, 25)
    for page in paginator.page_range:
        clinicians_page = paginator.get_page(page)
        process_clinicians(clinicians_page)
    

  2. Use iterator() for large querysets:

    # Don't evaluate queryset in memory
    for clinician in Clinician.objects.iterator():
        process_clinician(clinician)
    

  3. Verify: Memory usage reduced

For Memory Leaks

  1. Restart affected services:

    # Restart web application
    docker compose restart web
    
    # Restart Celery workers
    docker compose restart celery_worker
    

  2. Identify leak source:

    # Use tracemalloc to track allocations
    import tracemalloc
    tracemalloc.start()
    
    # Run suspected leaking code
    calculate_for_many_clinicians(clinicians)
    
    # Show snapshot
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    for stat in top_stats[:10]:
        print(stat)
    

  3. Fix leak (usually circular references or unbounded caches):

    # Explicitly delete large objects
    result = calculate()
    process(result)
    del result
    gc.collect()
    

  4. Verify: Memory usage stable over time

For Connection Pool Issues

  1. Reduce connection pool size:

    # In settings.py
    DATABASES = {
        'default': {
            'CONN_MAX_AGE': 600,
            'OPTIONS': {
                'MAX_CONNS': 20,  # Reduce from default
            }
        }
    }
    

  2. Close connections explicitly:

    from django.db import connections
    
    def process_batch():
        # Do work
        pass
    # Close connections when done
    connections.close_all()
    

  3. Verify: Connection count reduced

For Celery Memory Issues

  1. Configure task result expiration:

    # In celery.py
    app.conf.update(
        result_expires=3600,  # Expire results after 1 hour
        task_acks_late=True,  # Ack after task completes
        worker_prefetch_multiplier=1,  # Reduce prefetch
    )
    

  2. Restart workers with limited concurrency:

    # In docker-compose.yml, adjust the celery_worker command:
    # celery -A rota worker --concurrency=2 --max-tasks-per-child=100
    # Then apply: docker compose up -d celery_worker
    

  3. Verify: Worker memory usage reduced

Verification

After applying fix, verify: - [ ] Total memory usage < 80% - [ ] No OOM errors in logs - [ ] Memory usage stable over time - [ ] Application responsive

Prevention

  • Set up memory monitoring and alerts
  • Regular memory profiling in CI/CD
  • Use pagination for all large datasets
  • Implement connection pooling limits
  • Regular service restarts (daily/weekly)
  • Load testing to identify memory limits

Escalation

  • If OOM causing downtime, escalate immediately: DevOps Lead
  • If unresolved after 1 hour, escalate to: Senior DevOps Engineer
  • On-call contact: See on-call roster
  • Related runbooks: performance_degradation.md, celery_queue_issues.md
  • Related alerts: HighMemoryUsage, OomKiller