Passa al contenuto principale

Runbook - On-Call Rotation

Setup

  • Rotation: weekly, 7 days.
  • Primary: Massimo during pilot.
  • Secondary: Francesco.
  • Handoff: Monday 10:00 Europe/Rome in Telegram standup.
  • Primary ack target: 15 minutes for SEV1, 30 minutes for SEV2.

Prerequisites

  • Telegram access to Akira Staging Alerts.
  • @AkiraOpsBot admin access.
  • Grafana access at https://grafana.akira-staging.asheep.it.
  • SSH key and Tailscale access verified at shift start.
  • Ansible vault password file available when performing maintenance.

Shift Start Checklist

tailscale status | head
ssh root@akira-mgmt-01-staging 'hostname && uptime'
curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz
  • Review incidents from the last 7 days.
  • Review open NOC tickets.
  • Review deploy calendar.
  • Confirm Telegram alerts are not muted.
  • Confirm no stale acknowledged alert is still firing.

Duties

  • Acknowledge alerts inside the target.
  • Classify SEV level using incident-response.md.
  • Keep incident updates on schedule.
  • Run smoke tests after mitigation.
  • Convert post-incident action items into tracked tasks.
  • Handoff unresolved incidents with owner, current state, and next action.

Tools

  • Telegram alerts: @AkiraOpsBot and group Akira Staging Alerts.
  • Grafana: https://grafana.akira-staging.asheep.it.
  • Alertmanager: https://alerts.akira-staging.asheep.it.
  • AgentCore via @AkiraOpsBot for quick operational queries.
  • Runbooks: README.md.

Common AgentCore Queries

quanti CDR ingested oggi?
mostra ASR ultime 24h
mostra PDD p95 ultime 2h
chi sta consumando piu traffico oggi?
qual e il margin Acme SRL questa settimana?
ci sono alert critici aperti?

Caveat: AgentCore answers are helper context. Use direct Grafana, database, or host commands before destructive actions.

Escalation

SituationT+0T+15 minT+30 min
SEV1 outagePrimary ownsEscalate secondaryEscalate Francesco
SEV2 degradedPrimary ownsContinue diagnosisEscalate secondary
SEV3 minorTicket ownerNo pageReview next business day

If the primary does not acknowledge a SEV1 within 15 minutes, the secondary takes ownership and notes the takeover in the incident channel.

Handoff

At shift end, send:

On-call handoff <date>
- Open incidents:
- Risky alerts:
- Pending deploys:
- Customer-impacting tickets:
- Next actions:

The next on-call must acknowledge before the previous primary is considered released.

Validation Commands

Use these for a fast health check during handoff:

ssh root@akira-mgmt-01-staging '
docker compose -f /opt/akira/docker-compose.yml ps
'

ssh root@akira-db-01-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_last_hour
FROM cdr
WHERE answered_at > NOW() - INTERVAL '\''1 hour'\'';
"
'

ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
./run_scenario.sh smoke_e2e_single.xml smoke_target.csv
'

After-Action

  • SEV1 and SEV2 require a postmortem within 48 hours.
  • Action items must be tracked as tasks.
  • Repeated SEV3 alerts should become one cleanup task rather than repeated manual ack.

Caveats

  • Do not deploy while handing off unless both engineers explicitly agree who owns rollback.
  • Do not rely on memory for hostnames during a page. Copy commands from the matching runbook.
  • Do not suppress alerts without a ticket or task explaining the reason.