Passa al contenuto principale

Runbook - Incident Response

Severity Levels

SEV1 - Outage

  • Definition: total signaling outage, CDR loss imminent, or database unavailable.
  • Ack target: 15 minutes.
  • Communication: Telegram immediate plus Massimo.

SEV2 - Degraded

  • Definition: PDD above 5s, ASR below 90%, partial outage, or frontend/API unavailable with workaround.
  • Ack target: 30 minutes.
  • Communication: Telegram immediate.

SEV3 - Minor

  • Definition: single-customer edge case, no active service impact, or noisy non-critical alert.
  • Ack target: 4 hours.
  • Communication: daily standup or ticket comment.

Kickoff Procedure SEV1/SEV2

  1. Acknowledge the alert in Telegram group Akira Staging Alerts with ACK <name>.

  2. Open an incident ticket through @AkiraOpsBot:

    /ticket "SEV1 - <short description>" severity:critical
  3. Start a timestamped incident log.

  4. Classify the likely subsystem: signaling, database, CDR pipeline, management app, certificate, or Vault.

  5. Pick the matching runbook:

  6. Post an update every 15 minutes for SEV1 and every 30 minutes for SEV2.

Diagnostics

Global Snapshot

curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz || true
curl -fsS -I https://grafana.akira-staging.asheep.it || true

ssh root@akira-mgmt-01-staging '
docker ps --format "{{.Names}} {{.Status}}"
docker compose -f /opt/akira/docker-compose.yml ps
'

SEV1 Outage

Check signaling and database first.

ssh root@akira-sip-01-staging '
systemctl status kamailio --no-pager
journalctl -u kamailio --since "15 min ago" --no-pager | tail -80
'

ssh root@akira-db-01-staging '
systemctl status postgresql --no-pager
sudo -u postgres psql -d akira -c "SELECT now();"
'

Mitigation order:

  1. Roll back if the outage follows a deploy.
  2. Restart only the failed service if the cause is obvious and isolated.
  3. Fail over PostgreSQL if primary recovery would exceed 5 minutes.
  4. Escalate before attempting destructive recovery.

SEV2 Degraded

Use this path for frontend/API unreachable, high PDD, ASR drop, or partial signaling failures.

ssh root@akira-mgmt-01-staging '
docker logs --tail 120 akira-backend
docker logs --tail 120 akira-frontend
'

ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
./run_scenario.sh smoke_e2e_single.xml smoke_target.csv
'

Expected pilot baseline from TASK-237: ASR at least 95% and PDD p95 below 2s. If live metrics are worse than SEV2 thresholds, keep the incident at SEV2 until the trend recovers for 30 minutes.

SEV3 Minor

SEV3 work can stay in the ticket queue unless it worsens. Collect enough data for a task and avoid risky live changes.

Common Scenarios

Kamailio Down

ssh root@akira-sip-01-staging '
systemctl status kamailio --no-pager
kamailio -c -f /etc/kamailio/kamailio.cfg
journalctl -u kamailio --since "10 min ago" --no-pager | tail -120
'

Recover:

ssh root@akira-sip-01-staging '
systemctl restart kamailio
systemctl status kamailio --no-pager
'

Validate with SIPp smoke and CDR count.

CDR Pipeline Stuck

ssh root@akira-mgmt-01-staging '
docker logs --tail 120 akira-cdr-worker
'

ssh root@akira-mgmt-01-staging '
nats stream info AKIRA_CDR
nats consumer info AKIRA_CDR cdr_worker
'

Caveat: replay messages only after confirming idempotency for the affected consumer and time window.

Frontend Unreachable

curl -vkI https://mgmt.akira-staging.asheep.it

ssh root@akira-mgmt-01-staging '
docker logs --tail 120 akira-frontend
systemctl status caddy --no-pager
journalctl -u caddy --since "15 min ago" --no-pager | tail -120
'

If Caddy certificate errors appear, switch to cert-renewal.md.

Escalation Matrix

Elapsed TimeSEV1 ActionSEV2 Action
T+0Ack and open incidentAck and open ticket
T+15 minPage secondary and Massimo if not mitigatedContinue diagnosis
T+30 minPage Francesco, prepare rollback or DRPage secondary
T+60 minKeep bridge activeReclassify or schedule follow-up

Resolution

An incident is resolved only after:

  • User-facing symptom is gone.
  • Smoke test passes.
  • CDR ingestion is monotonic if call path was affected.
  • Metrics are stable for 30 minutes.
  • Incident log has timestamps for detection, ack, mitigation, and resolution.

Postmortem Template

# Postmortem - <incident_id> - <date>

## Summary

[One-line description.]

## Impact

- Duration: <X> min
- Customers affected: <N>
- Calls lost: <N>
- Revenue impact: <amount or unknown>

## Timeline

- T+0: <event>
- T+X: <event>
- T+resolved: <event>

## Root Cause

[5-whys analysis.]

## What Went Well

- ...

## What Went Poorly

- ...

## Action Items

- [ ] <owner> - <action> - <due date>

Caveats

  • Restore service first, root cause second.
  • Do not run destructive commands without naming the target host in the incident channel.
  • Do not downgrade SEV1 or SEV2 until the system has stayed stable for 30 minutes.