Runbook - On-Call Rotation
Setup
- Rotation: weekly, 7 days.
- Primary: Massimo during pilot.
- Secondary: Francesco.
- Handoff: Monday 10:00 Europe/Rome in Telegram standup.
- Primary ack target: 15 minutes for SEV1, 30 minutes for SEV2.
Prerequisites
- Telegram access to
Akira Staging Alerts. @AkiraOpsBotadmin access.- Grafana access at
https://grafana.akira-staging.asheep.it. - SSH key and Tailscale access verified at shift start.
- Ansible vault password file available when performing maintenance.
Shift Start Checklist
tailscale status | head
ssh root@akira-mgmt-01-staging 'hostname && uptime'
curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz
- Review incidents from the last 7 days.
- Review open NOC tickets.
- Review deploy calendar.
- Confirm Telegram alerts are not muted.
- Confirm no stale acknowledged alert is still firing.
Duties
- Acknowledge alerts inside the target.
- Classify SEV level using incident-response.md.
- Keep incident updates on schedule.
- Run smoke tests after mitigation.
- Convert post-incident action items into tracked tasks.
- Handoff unresolved incidents with owner, current state, and next action.
Tools
- Telegram alerts:
@AkiraOpsBotand groupAkira Staging Alerts. - Grafana:
https://grafana.akira-staging.asheep.it. - Alertmanager:
https://alerts.akira-staging.asheep.it. - AgentCore via
@AkiraOpsBotfor quick operational queries. - Runbooks: README.md.
Common AgentCore Queries
quanti CDR ingested oggi?
mostra ASR ultime 24h
mostra PDD p95 ultime 2h
chi sta consumando piu traffico oggi?
qual e il margin Acme SRL questa settimana?
ci sono alert critici aperti?
Caveat: AgentCore answers are helper context. Use direct Grafana, database, or host commands before destructive actions.
Escalation
| Situation | T+0 | T+15 min | T+30 min |
|---|---|---|---|
| SEV1 outage | Primary owns | Escalate secondary | Escalate Francesco |
| SEV2 degraded | Primary owns | Continue diagnosis | Escalate secondary |
| SEV3 minor | Ticket owner | No page | Review next business day |
If the primary does not acknowledge a SEV1 within 15 minutes, the secondary takes ownership and notes the takeover in the incident channel.
Handoff
At shift end, send:
On-call handoff <date>
- Open incidents:
- Risky alerts:
- Pending deploys:
- Customer-impacting tickets:
- Next actions:
The next on-call must acknowledge before the previous primary is considered released.
Validation Commands
Use these for a fast health check during handoff:
ssh root@akira-mgmt-01-staging '
docker compose -f /opt/akira/docker-compose.yml ps
'
ssh root@akira-db-01-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_last_hour
FROM cdr
WHERE answered_at > NOW() - INTERVAL '\''1 hour'\'';
"
'
ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
./run_scenario.sh smoke_e2e_single.xml smoke_target.csv
'
After-Action
- SEV1 and SEV2 require a postmortem within 48 hours.
- Action items must be tracked as tasks.
- Repeated SEV3 alerts should become one cleanup task rather than repeated manual ack.
Caveats
- Do not deploy while handing off unless both engineers explicitly agree who owns rollback.
- Do not rely on memory for hostnames during a page. Copy commands from the matching runbook.
- Do not suppress alerts without a ticket or task explaining the reason.