Passa al contenuto principale

On-Call Rotation - Akira Pilot

Schedule

  • Pilot Phase 1: solo Massimo, 24/7, no rotation.
  • Pilot Phase 2: two-engineer weekly rotation, business hours plus best-effort weekend.
  • GA: three-engineer rotation, 24/7 PagerDuty-like coverage.

Tools

  • Pager: Telegram bot @akira_oncall.
  • Dashboard: Grafana at http://grafana.akira.local.
  • Status: internal /status page.
  • Comms: Telegram channel akira-oncall.
  • Runbooks: this directory, starting from README.md.

On-call duties

  • Acknowledge alerts within 15 minutes.
  • Triage and respond through incident-response.md.
  • Escalate through the matrix in README.md.
  • Update the status page for customer-affecting incidents.
  • Log incidents in docs/incidents/YYYY-MM-DD-summary.md.
  • Convert post-incident action items into tracked tasks.

Handoff

At shift start:

  1. Review incidents from the last 7 days.
  2. Review open NOC tickets.
  3. Check deploy calendar and upcoming maintenance windows.
  4. Verify alerting and Telegram reachability.

At shift end:

  1. Send a summary to the next on-call engineer.
  2. Include open incidents, risky alerts, pending deploys, and customer-impacting tickets.
  3. Confirm the next on-call engineer has acknowledged the handoff.

Escalation

Use the escalation matrix in README.md. For SEV1, page secondary on-call and Massimo after 15 minutes if mitigation is not already in progress.