Runbook: Calculation Failures¶
Last Updated: 2026-02-21 Severity: High Estimated TTR: 30 minutes Owner: Development Team
Symptoms¶
- Rota generation fails or hangs
- Calculation API returns 500 errors
- Users report incorrect or missing shift calculations
- Celery task failures in logs
Detection¶
- Alert:
HighCalculationErrorRate(error rate > 0.1/min) - Dashboard: Calculations Metrics Dashboard
- Query: Check Sentry for calculation errors
Diagnosis Steps¶
-
Check recent calculation errors:
Expected: No new errors in the last 15 minutes If different: Note the error type and frequency for investigation -
Verify Celery workers are running:
Expected: Workers are active and responding If different: Workers may be down or overwhelmed -
Check database connectivity:
Expected: Database shell connects successfully If different: Database may be down or network issues -
Verify cache status:
Expected: Response "PONG" If different: Cache may be down affecting calculations
Root Causes¶
| Cause | Likelihood | How to Confirm |
|---|---|---|
| Invalid input data | High | Check logs for ValidationError exceptions |
| Database connection issues | Medium | Check database connectivity and query performance |
| Missing clinician configuration | High | Look for "No working term found" errors |
| Cache corruption | Low | Clear cache and retry calculations |
| Celery worker exhaustion | Medium | Check worker queue depth and restart workers |
Resolution Steps¶
For Invalid Input Data¶
-
Identify the problematic data:
-
Fix the data through Django admin or API:
-
Verify fix: Re-run calculation for affected clinician
For Database Connection Issues¶
-
Check database status:
-
If connection pool exhausted:
-
Verify: Check application health endpoint
For Celery Worker Issues¶
-
Restart Celery workers:
-
Clear stuck tasks:
-
Verify: Check worker is processing tasks
For Cache Issues¶
-
Clear application cache:
-
Restart cache-dependent services:
-
Verify: Re-run calculation and check results
Verification¶
After applying fix, verify: - [ ] No new errors in logs for 15 minutes - [ ] Test calculation completes successfully - [ ] Sentry error rate returns to baseline - [ ] Celery queue depth returns to normal
Prevention¶
- Add input validation at API boundaries
- Implement circuit breakers for external dependencies
- Set up monitoring for cache hit rates
- Regular database maintenance (vacuum, analyze)
- Implement rate limiting for calculation requests
Escalation¶
- If unresolved after 1 hour, escalate to: Tech Lead
- On-call contact: See on-call roster
Related Issues¶
- Related runbooks:
performance_degradation.md,data_inconsistency.md - Related alerts:
HighCalculationErrorRate,SlowCalculationPerformance