Runbook: Calculation Failures¶

Last Updated: 2026-02-21 Severity: High Estimated TTR: 30 minutes Owner: Development Team

Symptoms¶

Rota generation fails or hangs
Calculation API returns 500 errors
Users report incorrect or missing shift calculations
Celery task failures in logs

Detection¶

Alert: HighCalculationErrorRate (error rate > 0.1/min)
Dashboard: Calculations Metrics Dashboard
Query: Check Sentry for calculation errors

Diagnosis Steps¶

Check recent calculation errors:

# Check Django logs
docker compose logs web -f --tail=100 | grep ERROR

# Check Celery worker logs
docker compose logs celery_worker -f --tail=100 | grep ERROR

# Check Sentry for recent errors
# Navigate to Sentry project and filter by "calculation"

Expected: No new errors in the last 15 minutes If different: Note the error type and frequency for investigation

Verify Celery workers are running:

# Check worker status
docker compose exec web python manage.py celery inspect active

# Check worker queue depth
docker compose exec web python manage.py celery inspect active_queues

Expected: Workers are active and responding If different: Workers may be down or overwhelmed

Check database connectivity:
```
# Test database connection
docker compose exec web python manage.py dbshell
```
Expected: Database shell connects successfully If different: Database may be down or network issues
Verify cache status:
```
# Check Redis/cache connectivity
docker compose exec redis redis-cli ping
```
Expected: Response "PONG" If different: Cache may be down affecting calculations

Root Causes¶

Cause	Likelihood	How to Confirm
Invalid input data	High	Check logs for ValidationError exceptions
Database connection issues	Medium	Check database connectivity and query performance
Missing clinician configuration	High	Look for "No working term found" errors
Cache corruption	Low	Clear cache and retry calculations
Celery worker exhaustion	Medium	Check worker queue depth and restart workers

Resolution Steps¶

For Invalid Input Data¶

Identify the problematic data:

# Check logs for specific validation errors
docker compose logs web --tail=200 | grep "ValidationError"

Fix the data through Django admin or API:

# Example: Fix missing working term
from config.models import Clinician, WorkingTerm
clinician = Clinician.objects.get(id=<id>)
WorkingTerm.objects.create(
    clinician=clinician,
    type=WorkingTermType.PARTNER.value,
    start_date=date(2024, 1, 1),
    percentage=100.0
)

Verify fix: Re-run calculation for affected clinician

For Database Connection Issues¶

Check database status:

# PostgreSQL
docker compose ps postgres

# Check connection count
docker compose exec postgres psql -U rota rota_cc -c "SELECT count(*) FROM pg_stat_activity;"

If connection pool exhausted:

# Restart Django application
docker compose restart web

# Or restart Celery workers
docker compose restart celery_worker

Verify: Check application health endpoint

For Celery Worker Issues¶

Restart Celery workers:
```
docker compose restart celery_worker
```

Clear stuck tasks:

# Purge stuck tasks from queue (use caution)
docker compose exec web celery -A rota purge -Q celery

Verify: Check worker is processing tasks

For Cache Issues¶

Clear application cache:

docker compose exec redis redis-cli FLUSHDB

Restart cache-dependent services:
```
docker compose restart celery_worker
```
Verify: Re-run calculation and check results

Verification¶

After applying fix, verify: - [ ] No new errors in logs for 15 minutes - [ ] Test calculation completes successfully - [ ] Sentry error rate returns to baseline - [ ] Celery queue depth returns to normal

Prevention¶

Add input validation at API boundaries
Implement circuit breakers for external dependencies
Set up monitoring for cache hit rates
Regular database maintenance (vacuum, analyze)
Implement rate limiting for calculation requests

Escalation¶

If unresolved after 1 hour, escalate to: Tech Lead
On-call contact: See on-call roster

Related runbooks: performance_degradation.md, data_inconsistency.md
Related alerts: HighCalculationErrorRate, SlowCalculationPerformance