Runbook - Incident Response
Severity Levels
SEV1 - Outage
- Definition: total signaling outage, CDR loss imminent, or database unavailable.
- Ack target: 15 minutes.
- Communication: Telegram immediate plus Massimo.
SEV2 - Degraded
- Definition: PDD above 5s, ASR below 90%, partial outage, or frontend/API unavailable with workaround.
- Ack target: 30 minutes.
- Communication: Telegram immediate.
SEV3 - Minor
- Definition: single-customer edge case, no active service impact, or noisy non-critical alert.
- Ack target: 4 hours.
- Communication: daily standup or ticket comment.
Kickoff Procedure SEV1/SEV2
-
Acknowledge the alert in Telegram group
Akira Staging AlertswithACK <name>. -
Open an incident ticket through
@AkiraOpsBot:/ticket "SEV1 - <short description>" severity:critical -
Start a timestamped incident log.
-
Classify the likely subsystem: signaling, database, CDR pipeline, management app, certificate, or Vault.
-
Pick the matching runbook:
- Deploy regression: deploy.md#rollback
- PostgreSQL failover: dr.md#postgresql-primary-failover
- Certificate failure: cert-renewal.md
- Vault sealed: vault-unseal.md
-
Post an update every 15 minutes for SEV1 and every 30 minutes for SEV2.
Diagnostics
Global Snapshot
curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz || true
curl -fsS -I https://grafana.akira-staging.asheep.it || true
ssh root@akira-mgmt-01-staging '
docker ps --format "{{.Names}} {{.Status}}"
docker compose -f /opt/akira/docker-compose.yml ps
'
SEV1 Outage
Check signaling and database first.
ssh root@akira-sip-01-staging '
systemctl status kamailio --no-pager
journalctl -u kamailio --since "15 min ago" --no-pager | tail -80
'
ssh root@akira-db-01-staging '
systemctl status postgresql --no-pager
sudo -u postgres psql -d akira -c "SELECT now();"
'
Mitigation order:
- Roll back if the outage follows a deploy.
- Restart only the failed service if the cause is obvious and isolated.
- Fail over PostgreSQL if primary recovery would exceed 5 minutes.
- Escalate before attempting destructive recovery.
SEV2 Degraded
Use this path for frontend/API unreachable, high PDD, ASR drop, or partial signaling failures.
ssh root@akira-mgmt-01-staging '
docker logs --tail 120 akira-backend
docker logs --tail 120 akira-frontend
'
ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
./run_scenario.sh smoke_e2e_single.xml smoke_target.csv
'
Expected pilot baseline from TASK-237: ASR at least 95% and PDD p95 below 2s. If live metrics are worse than SEV2 thresholds, keep the incident at SEV2 until the trend recovers for 30 minutes.
SEV3 Minor
SEV3 work can stay in the ticket queue unless it worsens. Collect enough data for a task and avoid risky live changes.
Common Scenarios
Kamailio Down
ssh root@akira-sip-01-staging '
systemctl status kamailio --no-pager
kamailio -c -f /etc/kamailio/kamailio.cfg
journalctl -u kamailio --since "10 min ago" --no-pager | tail -120
'
Recover:
ssh root@akira-sip-01-staging '
systemctl restart kamailio
systemctl status kamailio --no-pager
'
Validate with SIPp smoke and CDR count.
CDR Pipeline Stuck
ssh root@akira-mgmt-01-staging '
docker logs --tail 120 akira-cdr-worker
'
ssh root@akira-mgmt-01-staging '
nats stream info AKIRA_CDR
nats consumer info AKIRA_CDR cdr_worker
'
Caveat: replay messages only after confirming idempotency for the affected consumer and time window.
Frontend Unreachable
curl -vkI https://mgmt.akira-staging.asheep.it
ssh root@akira-mgmt-01-staging '
docker logs --tail 120 akira-frontend
systemctl status caddy --no-pager
journalctl -u caddy --since "15 min ago" --no-pager | tail -120
'
If Caddy certificate errors appear, switch to cert-renewal.md.
Escalation Matrix
| Elapsed Time | SEV1 Action | SEV2 Action |
|---|---|---|
| T+0 | Ack and open incident | Ack and open ticket |
| T+15 min | Page secondary and Massimo if not mitigated | Continue diagnosis |
| T+30 min | Page Francesco, prepare rollback or DR | Page secondary |
| T+60 min | Keep bridge active | Reclassify or schedule follow-up |
Resolution
An incident is resolved only after:
- User-facing symptom is gone.
- Smoke test passes.
- CDR ingestion is monotonic if call path was affected.
- Metrics are stable for 30 minutes.
- Incident log has timestamps for detection, ack, mitigation, and resolution.
Postmortem Template
# Postmortem - <incident_id> - <date>
## Summary
[One-line description.]
## Impact
- Duration: <X> min
- Customers affected: <N>
- Calls lost: <N>
- Revenue impact: <amount or unknown>
## Timeline
- T+0: <event>
- T+X: <event>
- T+resolved: <event>
## Root Cause
[5-whys analysis.]
## What Went Well
- ...
## What Went Poorly
- ...
## Action Items
- [ ] <owner> - <action> - <due date>
Caveats
- Restore service first, root cause second.
- Do not run destructive commands without naming the target host in the incident channel.
- Do not downgrade SEV1 or SEV2 until the system has stayed stable for 30 minutes.