Runbook - Incident Response

Severity Levels

SEV1 - Outage

Definition: total signaling outage, CDR loss imminent, or database unavailable.
Ack target: 15 minutes.
Communication: Telegram immediate plus Massimo.

SEV2 - Degraded

Definition: PDD above 5s, ASR below 90%, partial outage, or frontend/API unavailable with workaround.
Ack target: 30 minutes.
Communication: Telegram immediate.

SEV3 - Minor

Definition: single-customer edge case, no active service impact, or noisy non-critical alert.
Ack target: 4 hours.
Communication: daily standup or ticket comment.

Kickoff Procedure SEV1/SEV2

Acknowledge the alert in Telegram group Akira Staging Alerts with ACK <name>.

Open an incident ticket through @AkiraOpsBot:

/ticket "SEV1 - <short description>" severity:critical

Start a timestamped incident log.
Classify the likely subsystem: signaling, database, CDR pipeline, management app, certificate, or Vault.
Pick the matching runbook:
- Deploy regression: deploy.md#rollback
- PostgreSQL failover: dr.md#postgresql-primary-failover
- Certificate failure: cert-renewal.md
- Vault sealed: vault-unseal.md
Post an update every 15 minutes for SEV1 and every 30 minutes for SEV2.

Diagnostics

Global Snapshot

curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz || true
curl -fsS -I https://grafana.akira-staging.asheep.it || true

ssh root@akira-mgmt-01-staging '
  docker ps --format "{{.Names}} {{.Status}}"
  docker compose -f /opt/akira/docker-compose.yml ps
'

SEV1 Outage

Check signaling and database first.

ssh root@akira-sip-01-staging '
  systemctl status kamailio --no-pager
  journalctl -u kamailio --since "15 min ago" --no-pager | tail -80
'

ssh root@akira-db-01-staging '
  systemctl status postgresql --no-pager
  sudo -u postgres psql -d akira -c "SELECT now();"
'

Mitigation order:

Roll back if the outage follows a deploy.
Restart only the failed service if the cause is obvious and isolated.
Fail over PostgreSQL if primary recovery would exceed 5 minutes.
Escalate before attempting destructive recovery.

SEV2 Degraded

Use this path for frontend/API unreachable, high PDD, ASR drop, or partial signaling failures.

ssh root@akira-mgmt-01-staging '
  docker logs --tail 120 akira-backend
  docker logs --tail 120 akira-frontend
'

ssh root@akira-sipp-01-staging '
  cd /opt/akira/sipp
  ./run_scenario.sh smoke_e2e_single.xml smoke_target.csv
'

Expected pilot baseline from TASK-237: ASR at least 95% and PDD p95 below 2s. If live metrics are worse than SEV2 thresholds, keep the incident at SEV2 until the trend recovers for 30 minutes.

SEV3 Minor

SEV3 work can stay in the ticket queue unless it worsens. Collect enough data for a task and avoid risky live changes.

Common Scenarios

Kamailio Down

ssh root@akira-sip-01-staging '
  systemctl status kamailio --no-pager
  kamailio -c -f /etc/kamailio/kamailio.cfg
  journalctl -u kamailio --since "10 min ago" --no-pager | tail -120
'

Recover:

ssh root@akira-sip-01-staging '
  systemctl restart kamailio
  systemctl status kamailio --no-pager
'

Validate with SIPp smoke and CDR count.

CDR Pipeline Stuck

ssh root@akira-mgmt-01-staging '
  docker logs --tail 120 akira-cdr-worker
'

ssh root@akira-mgmt-01-staging '
  nats stream info AKIRA_CDR
  nats consumer info AKIRA_CDR cdr_worker
'

Caveat: replay messages only after confirming idempotency for the affected consumer and time window.

Frontend Unreachable

curl -vkI https://mgmt.akira-staging.asheep.it

ssh root@akira-mgmt-01-staging '
  docker logs --tail 120 akira-frontend
  systemctl status caddy --no-pager
  journalctl -u caddy --since "15 min ago" --no-pager | tail -120
'

If Caddy certificate errors appear, switch to cert-renewal.md.

Escalation Matrix

Elapsed Time	SEV1 Action	SEV2 Action
T+0	Ack and open incident	Ack and open ticket
T+15 min	Page secondary and Massimo if not mitigated	Continue diagnosis
T+30 min	Page Francesco, prepare rollback or DR	Page secondary
T+60 min	Keep bridge active	Reclassify or schedule follow-up

Resolution

An incident is resolved only after:

User-facing symptom is gone.
Smoke test passes.
CDR ingestion is monotonic if call path was affected.
Metrics are stable for 30 minutes.
Incident log has timestamps for detection, ack, mitigation, and resolution.

Postmortem Template

# Postmortem - <incident_id> - <date>

## Summary

[One-line description.]

## Impact

- Duration: <X> min
- Customers affected: <N>
- Calls lost: <N>
- Revenue impact: <amount or unknown>

## Timeline

- T+0: <event>
- T+X: <event>
- T+resolved: <event>

## Root Cause

[5-whys analysis.]

## What Went Well

- ...

## What Went Poorly

- ...

## Action Items

- [ ] <owner> - <action> - <due date>

Caveats

Restore service first, root cause second.
Do not run destructive commands without naming the target host in the incident channel.
Do not downgrade SEV1 or SEV2 until the system has stayed stable for 30 minutes.

Severity Levels​

SEV1 - Outage​

SEV2 - Degraded​

SEV3 - Minor​

Kickoff Procedure SEV1/SEV2​

Diagnostics​

Global Snapshot​

SEV1 Outage​

SEV2 Degraded​

SEV3 Minor​

Common Scenarios​

Kamailio Down​

CDR Pipeline Stuck​

Frontend Unreachable​

Escalation Matrix​

Resolution​

Postmortem Template​

Caveats​