Skip to content

Runbook: Calculation Failures

Last Updated: 2026-02-21 Severity: High Estimated TTR: 30 minutes Owner: Development Team

Symptoms

  • Rota generation fails or hangs
  • Calculation API returns 500 errors
  • Users report incorrect or missing shift calculations
  • Celery task failures in logs

Detection

  • Alert: HighCalculationErrorRate (error rate > 0.1/min)
  • Dashboard: Calculations Metrics Dashboard
  • Query: Check Sentry for calculation errors

Diagnosis Steps

  1. Check recent calculation errors:

    # Check Django logs
    docker compose logs web -f --tail=100 | grep ERROR
    
    # Check Celery worker logs
    docker compose logs celery_worker -f --tail=100 | grep ERROR
    
    # Check Sentry for recent errors
    # Navigate to Sentry project and filter by "calculation"
    
    Expected: No new errors in the last 15 minutes If different: Note the error type and frequency for investigation

  2. Verify Celery workers are running:

    # Check worker status
    docker compose exec web python manage.py celery inspect active
    
    # Check worker queue depth
    docker compose exec web python manage.py celery inspect active_queues
    
    Expected: Workers are active and responding If different: Workers may be down or overwhelmed

  3. Check database connectivity:

    # Test database connection
    docker compose exec web python manage.py dbshell
    
    Expected: Database shell connects successfully If different: Database may be down or network issues

  4. Verify cache status:

    # Check Redis/cache connectivity
    docker compose exec redis redis-cli ping
    
    Expected: Response "PONG" If different: Cache may be down affecting calculations

Root Causes

Cause Likelihood How to Confirm
Invalid input data High Check logs for ValidationError exceptions
Database connection issues Medium Check database connectivity and query performance
Missing clinician configuration High Look for "No working term found" errors
Cache corruption Low Clear cache and retry calculations
Celery worker exhaustion Medium Check worker queue depth and restart workers

Resolution Steps

For Invalid Input Data

  1. Identify the problematic data:

    # Check logs for specific validation errors
    docker compose logs web --tail=200 | grep "ValidationError"
    

  2. Fix the data through Django admin or API:

    # Example: Fix missing working term
    from config.models import Clinician, WorkingTerm
    clinician = Clinician.objects.get(id=<id>)
    WorkingTerm.objects.create(
        clinician=clinician,
        type=WorkingTermType.PARTNER.value,
        start_date=date(2024, 1, 1),
        percentage=100.0
    )
    

  3. Verify fix: Re-run calculation for affected clinician

For Database Connection Issues

  1. Check database status:

    # PostgreSQL
    docker compose ps postgres
    
    # Check connection count
    docker compose exec postgres psql -U rota rota_cc -c "SELECT count(*) FROM pg_stat_activity;"
    

  2. If connection pool exhausted:

    # Restart Django application
    docker compose restart web
    
    # Or restart Celery workers
    docker compose restart celery_worker
    

  3. Verify: Check application health endpoint

For Celery Worker Issues

  1. Restart Celery workers:

    docker compose restart celery_worker
    

  2. Clear stuck tasks:

    # Purge stuck tasks from queue (use caution)
    docker compose exec web celery -A rota purge -Q celery
    

  3. Verify: Check worker is processing tasks

For Cache Issues

  1. Clear application cache:

    docker compose exec redis redis-cli FLUSHDB
    

  2. Restart cache-dependent services:

    docker compose restart celery_worker
    

  3. Verify: Re-run calculation and check results

Verification

After applying fix, verify: - [ ] No new errors in logs for 15 minutes - [ ] Test calculation completes successfully - [ ] Sentry error rate returns to baseline - [ ] Celery queue depth returns to normal

Prevention

  • Add input validation at API boundaries
  • Implement circuit breakers for external dependencies
  • Set up monitoring for cache hit rates
  • Regular database maintenance (vacuum, analyze)
  • Implement rate limiting for calculation requests

Escalation

  • If unresolved after 1 hour, escalate to: Tech Lead
  • On-call contact: See on-call roster
  • Related runbooks: performance_degradation.md, data_inconsistency.md
  • Related alerts: HighCalculationErrorRate, SlowCalculationPerformance