Runbook - Disaster Recovery
Scope
This runbook covers pilot disaster recovery for PostgreSQL failover, management app stack recovery, and full-region disaster handling.
Pilot targets from TASK-238:
- PostgreSQL failover RTO target: under 5 minutes.
- PostgreSQL failover RPO target: under 30 seconds.
- CDR ingestion resume target: under 60 seconds after backend switch.
Prerequisites
- Tailscale connected.
- SSH access to
akira-db-01-staging,akira-db-02-staging,akira-mgmt-01-staging, andakira-sipp-01-staging. ~/.akira-vault-pass.txtpresent.- Confirm the incident channel has an owner before starting a destructive action.
PostgreSQL Primary Failover
Symptoms
- Alert
PostgresPrimaryDown. - Backend or CDR worker restart loop with database connection refused.
- Grafana database connection panels drop to zero.
- CDR ingestion stops increasing.
Diagnostics
ssh root@akira-db-01-staging '
systemctl status postgresql --no-pager
'
ssh root@akira-db-02-staging '
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
sudo -u postgres psql -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"
'
ssh root@akira-mgmt-01-staging '
docker ps --filter "name=akira-backend" --format "{{.Names}} {{.Status}}"
docker logs --tail 80 akira-backend
'
Continue only if db-01 is unrecoverable or recovery would exceed the RTO target and db-02 is healthy enough to promote.
Recovery Commands
Record the start time:
DR_START=$(date +%s)
date -u -Iseconds
Promote the replica:
ssh root@akira-db-02-staging '
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
sleep 2
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
'
Expected: pg_is_in_recovery is false.
Switch backend and CDR worker to db-02:
ssh root@akira-mgmt-01-staging '
sed -i "s|@akira-db-01-staging|@akira-db-02-staging|g" /opt/akira/.env
grep DATABASE_URL /opt/akira/.env
docker compose -f /opt/akira/docker-compose.yml restart backend cdr-worker
sleep 5
docker ps --filter "name=akira-backend\\|akira-cdr-worker" \
--format "{{.Names}}: {{.Status}}"
'
Validation
ssh root@akira-db-02-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_count, max(answered_at) AS latest_cdr
FROM cdr;
"
'
ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
./run_scenario.sh smoke_e2e_single.xml smoke_target.csv
'
ssh root@akira-db-02-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_last_5m
FROM cdr
WHERE answered_at > NOW() - INTERVAL '\''5 minutes'\'';
"
'
echo "RTO seconds: $(($(date +%s) - DR_START))"
Rebuild Old Primary As Replica
Caveat: this deletes the old db-01 data directory. Run only after confirming db-02 is the accepted primary.
ssh root@akira-db-01-staging '
systemctl stop postgresql
rm -rf /var/lib/postgresql/16/main/*
sudo -u postgres pg_basebackup \
-h akira-db-02-staging \
-U replicator \
-D /var/lib/postgresql/16/main \
-X stream \
-P -R
systemctl start postgresql
sleep 3
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
'
Expected: db-01 returns true for pg_is_in_recovery.
Escalation
| Elapsed Time | Action |
|---|---|
| T+10 min without recovery | Escalate to Massimo by Telegram DM. |
| T+20 min without recovery | Escalate to Francesco. |
| T+30 min without recovery | Stop experiments and prepare backup restore. |
App Stack Full Recovery
Use this when akira-mgmt-01-staging is lost or Docker services are
corrupted.
RTO target: 30 minutes from host availability.
App Diagnostics
ssh root@akira-mgmt-01-staging '
uptime
docker ps
docker compose -f /opt/akira/docker-compose.yml ps
journalctl -u docker --since "30 min ago" --no-pager | tail -80
'
App Recovery Commands
If the host is reachable, redeploy management:
cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt
If the host has been replaced from scratch, run bootstrap for management after inventory points to the new host:
cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/bootstrap_all.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--limit management
App Validation
curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz
curl -fsS -I https://grafana.akira-staging.asheep.it
ssh root@akira-mgmt-01-staging '
docker ps --format "{{.Names}} {{.Status}}"
'
Escalation: after 15 minutes without the management stack running, page the secondary on-call. After 30 minutes, escalate to Massimo.
Full-Region Disaster
Full-region disaster is out of pilot automation scope. Use this section to stabilize the incident and avoid partial recovery.
RTO target: best effort during pilot. A formal multi-region target belongs to the production wave.
- Declare SEV1 in the incident channel.
- Freeze deploys and destructive maintenance.
- Confirm whether Hetzner region, tailnet, DNS, or Akira hosts are affected.
- Preserve the latest available backup and WAL archive.
- Start restore planning from dr-restore-procedure.md and dr-restore-pg-timescale.md.
- Recover the database first, then stateful dependencies, then management, then signaling.
Validation is the same as a deploy smoke plus CDR monotonic growth.
Escalation: page Massimo immediately, page Francesco at T+15 minutes, and keep updates every 15 minutes until a recovery path is chosen.
Caveats
- Never promote a replica before checking whether split brain is possible.
- Do not delete old primary data until the new primary is validated.
- Do not invent RTO or RPO metrics. Record actual timestamps during the event.
- Restore from backup is the last resort if a healthy replica exists.