Akira - Operational Runbooks
This directory is the canonical index for Akira operational runbooks.
Quick Reference
| Situation | Runbook |
|---|---|
| Deploy new staging or production version | deploy.md |
| PostgreSQL primary down | dr.md#postgresql-primary-failover |
| App stack recovery | dr.md#app-stack-full-recovery |
| Full region disaster | dr.md#full-region-disaster |
| Frontend or API unreachable | incident-response.md#sev2-degraded |
| Incident kickoff | incident-response.md#kickoff-procedure-sev1sev2 |
| Setup on-call shift | oncall.md |
| HTTPS certificate expired or renewal failed | cert-renewal.md |
| Vault primary sealed | vault-unseal.md |
Conventions
- RTO target: maximum target recovery time after a failure.
- RPO target: maximum target data-loss window after a failure.
- Escalation: who to notify after the stated elapsed time.
- Prereq: credentials and tools required before touching production or staging.
- Caveat: known side effects and operations to avoid.
Tools Required
- SSH key Akira:
~/.ssh/akira_ed25519. - Tailscale up and connected to the Akira tailnet.
- Ansible vault password file:
~/.akira-vault-pass.txt. - Hetzner Cloud Console access:
https://console.hetzner.cloud/. - 1Password vault
Akira Stagingfor break-glass secrets and Vault unseal material. - Telegram
@AkiraOpsBotadmin access for alert ack and fast state queries. - Local repository at
/home/devcomm/akiraon the VPS or~/work/akiraon an operator laptop.
Pilot Baselines
These runbooks reference the current pilot validation targets:
- TASK-236: single SIP smoke path validates SIPp to Kamailio to RTPengine to FreeSWITCH to CDR.
- TASK-237: pilot load target is 10 cps, 75s average call duration, 900 concurrent cap, ASR at least 95%, PDD p95 under 2s.
- TASK-238: PostgreSQL failover target is RTO under 5 minutes and RPO under 30 seconds.
Deploy And Release
- deploy.md - operational staging and production deploy sequence, smoke checks, rollback, tag-only redeploy.
- deploy-procedure.md - older production deploy notes.
- deploy-staging-akira.md - staging bootstrap from scratch, VM triggers, known gotchas, and smoke matrix.
- deploy-signaling.md - signaling layer playbook, debug tags, and rollback notes.
- git-hygiene.md - branch and runner hygiene.
- supply-chain-security.md - SBOM and supply-chain checks.
Incident, Support, And On-Call
- incident-response.md - SEV triage, mitigation, resolution, and postmortem template.
- oncall.md - pilot on-call rotation, handoff, escalation, and AgentCore queries.
- oncall-rotation.md - older on-call schedule notes.
- customer-support.md - NOC versus billing ticket flow.
- postmortem-template.md - standalone post-incident template.
- alertmanager-setup.md - Alertmanager setup.
- background-jobs.md - worker registry and operations.
- observability-scrape-targets.md - adding Prometheus scrape targets.
- postgres-exporter-monitoring.md - Postgres exporter monitoring.
- homer-sip-trace.md - Homer/heplify-server SIP trace lookup topology, schema checks, and UTC gotchas.
Security And Access
- cert-renewal.md - Caddy certificate renewal and manual recovery.
- vault-unseal.md - Vault auto-unseal and manual recovery.
- vault-auto-unseal.md - detailed auto-unseal setup.
- vault-rotation.md - rolling Vault secret rotation.
- secret-rotation.md - manifest-based rotation helper.
- rbac-management.md - RBAC permission management.
- fail2ban-operations.md - Fail2Ban operations.
- fraud-detection-test.md - fraud detection test guide.
- break-glass-tailscale-down.md - break-glass access when Tailscale is down.
Disaster Recovery And Backups
- dr.md - PostgreSQL failover, app stack recovery, and region disaster procedures.
- backup-pg-nightly.md - nightly PostgreSQL backup.
- backup-cross-region.md - offsite Storage Box mirror.
- dr-restore-procedure.md - general DR restore procedure.
- dr-restore-pg-timescale.md - PostgreSQL and TimescaleDB restore.
- dr-restore-drill.md - DR restore drill.
- _drill-log.md - DR drill log.
Infrastructure Operations
- capacity-scaling.md - capacity scaling and trigger thresholds.
- nats-cluster-migration.md - NATS migration from pilot to GA.
- postgres-perf-tuning.md - Postgres performance tuning.
- keepalived-setup.md - Keepalived secret requirements.
- recording-storage-setup.md - Hetzner Storage Box setup for recordings.