Runbook: Performance Degradation¶

Last Updated: 2026-02-21 Severity: Medium Estimated TTR: 1 hour Owner: Development Team

Symptoms¶

Rota generation takes longer than expected
API response times increased
Users report slow page loads
High CPU or memory usage on server

Detection¶

Alert: SlowCalculationPerformance (p95 > 5s)
Dashboard: Application Performance Dashboard
Query: Check APM tools (Datadog/NewRelic) for increased latency

Diagnosis Steps¶

Check current response times:

# Check recent request logs
docker compose logs web -f --tail=100 | grep "calculation"

# Use Django Debug Toolbar or APM to identify slow queries
docker compose exec web python manage.py shell
>>> from django.db import connection
>>> from django.db import reset_queries
>>> from calculations.date_utils import get_bank_holidays_in_year
>>> reset_queries()
>>> get_bank_holidays_in_year(2024)
>>> len(connection.queries)

Expected: < 20 queries per calculation, p95 < 5s If different: May have N+1 query issues or missing indexes

Check server resource usage:

# CPU and memory
docker stats --no-stream

# Database connections
docker compose exec postgres psql -U rota rota_cc -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

Expected: CPU < 70%, Memory < 80%, < 100 DB connections If different: Resource exhaustion may be causing slowdowns

Run performance benchmarks:

# Run benchmarks to compare against baseline
docker compose exec web pytest calculations/tests/benchmarks/ --benchmark-only --benchmark-compare=<baseline_file>

Expected: Within 10% of baseline If different: Performance regression detected

Check for long-running queries:

# PostgreSQL slow query log
docker compose logs postgres -f --tail=100

# Or check pg_stat_statements
docker compose exec postgres psql -U rota rota_cc -c "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;"

Expected: No queries taking > 1s If different: May need query optimization or indexing

Root Causes¶

Cause	Likelihood	How to Confirm
Large date range calculations	High	Check if users requesting 3+ year ranges
High clinician count	High	Check if calculations for 100+ clinicians
Missing database indexes	Medium	Check query execution plans
N+1 query problems	High	Review code for loops with queries
Cache not working	Medium	Check cache hit rates

Resolution Steps¶

For Large Date Range Calculations¶

Check current cache configuration:

docker compose exec redis redis-cli INFO stats | grep keyspace

Implement or optimize result caching:

# Ensure calculations use caching
from django.core.cache import cache

def calculate_with_cache(clinician_id, start_date, end_date):
    cache_key = f"calc_{clinician_id}_{start_date}_{end_date}"
    result = cache.get(cache_key)
    if not result:
        result = perform_calculation(clinician_id, start_date, end_date)
        cache.set(cache_key, result, timeout=3600)
    return result

Verify: Response time improved to acceptable levels

For High Clinician Count¶

Implement batch processing:

# Process clinicians in batches
from itertools import islice

def batch_process_clinicians(clinicians, batch_size=25):
    clinicians = list(clinicians)
    for i in range(0, len(clinicians), batch_size):
        batch = clinicians[i:i + batch_size]
        process_batch(batch)

Use Celery for async processing:

# Offload large calculations to background tasks
from tasks.celery_app import app

@app.task
def calculate_clinicians_async(clinician_ids, start_date, end_date):
    for clinician_id in clinician_ids:
        calculate_for_clinician(clinician_id, start_date, end_date)

Verify: Large calculations complete within acceptable time

For Missing Database Indexes¶

Identify slow queries and add indexes:

# Create migration
docker compose exec web python manage.py makemigrations --empty your_app

# In the migration file:
from django.db import migrations, models

class Migration(migrations.Migration):
    dependencies = [('your_app', 'previous_migration')]

    operations = [
        migrations.RunSQL(
            "CREATE INDEX CONCURRENTLY idx_clinician_shift_dates ON config_shift(clinician_id, date);"
        ),
    ]

Verify: Query times improved

For N+1 Query Problems¶

Identify problematic code:

# Look for patterns like:
for clinician in clinicians:
    shifts = clinician.shift_set.all()  # N+1!

Fix with select_related/prefetch_related:

# Use prefetch_related for reverse FK
clinicians = Clinician.objects.prefetch_related('shift_set').all()

# Or use bulk queries
shifts = Shift.objects.filter(clinician__in=clinicians)

Verify: Reduced query count

Verification¶

After applying fix, verify: - [ ] p95 response time < 5s - [ ] No performance regression in benchmarks - [ ] CPU and memory usage normal - [ ] User complaints stopped

Prevention¶

Set up continuous benchmarking in CI/CD
Implement query review in code review process
Use Django Debug Toolbar in development
Regular performance testing before releases
Monitor and alert on query performance metrics

Escalation¶

If unresolved after 2 hours, escalate to: Tech Lead
On-call contact: See on-call roster

Related runbooks: calculation_failures.md, celery_queue_issues.md
Related alerts: SlowCalculationPerformance, HighDatabaseQueryDuration