Passa al contenuto principale

Runbook - Disaster Recovery

Scope

This runbook covers pilot disaster recovery for PostgreSQL failover, management app stack recovery, and full-region disaster handling.

Pilot targets from TASK-238:

  • PostgreSQL failover RTO target: under 5 minutes.
  • PostgreSQL failover RPO target: under 30 seconds.
  • CDR ingestion resume target: under 60 seconds after backend switch.

Prerequisites

  • Tailscale connected.
  • SSH access to akira-db-01-staging, akira-db-02-staging, akira-mgmt-01-staging, and akira-sipp-01-staging.
  • ~/.akira-vault-pass.txt present.
  • Confirm the incident channel has an owner before starting a destructive action.

PostgreSQL Primary Failover

Symptoms

  • Alert PostgresPrimaryDown.
  • Backend or CDR worker restart loop with database connection refused.
  • Grafana database connection panels drop to zero.
  • CDR ingestion stops increasing.

Diagnostics

ssh root@akira-db-01-staging '
systemctl status postgresql --no-pager
'

ssh root@akira-db-02-staging '
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
sudo -u postgres psql -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"
'

ssh root@akira-mgmt-01-staging '
docker ps --filter "name=akira-backend" --format "{{.Names}} {{.Status}}"
docker logs --tail 80 akira-backend
'

Continue only if db-01 is unrecoverable or recovery would exceed the RTO target and db-02 is healthy enough to promote.

Recovery Commands

Record the start time:

DR_START=$(date +%s)
date -u -Iseconds

Promote the replica:

ssh root@akira-db-02-staging '
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
sleep 2
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
'

Expected: pg_is_in_recovery is false.

Switch backend and CDR worker to db-02:

ssh root@akira-mgmt-01-staging '
sed -i "s|@akira-db-01-staging|@akira-db-02-staging|g" /opt/akira/.env
grep DATABASE_URL /opt/akira/.env
docker compose -f /opt/akira/docker-compose.yml restart backend cdr-worker
sleep 5
docker ps --filter "name=akira-backend\\|akira-cdr-worker" \
--format "{{.Names}}: {{.Status}}"
'

Validation

ssh root@akira-db-02-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_count, max(answered_at) AS latest_cdr
FROM cdr;
"
'

ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
./run_scenario.sh smoke_e2e_single.xml smoke_target.csv
'

ssh root@akira-db-02-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_last_5m
FROM cdr
WHERE answered_at > NOW() - INTERVAL '\''5 minutes'\'';
"
'

echo "RTO seconds: $(($(date +%s) - DR_START))"

Rebuild Old Primary As Replica

Caveat: this deletes the old db-01 data directory. Run only after confirming db-02 is the accepted primary.

ssh root@akira-db-01-staging '
systemctl stop postgresql
rm -rf /var/lib/postgresql/16/main/*
sudo -u postgres pg_basebackup \
-h akira-db-02-staging \
-U replicator \
-D /var/lib/postgresql/16/main \
-X stream \
-P -R
systemctl start postgresql
sleep 3
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
'

Expected: db-01 returns true for pg_is_in_recovery.

Escalation

Elapsed TimeAction
T+10 min without recoveryEscalate to Massimo by Telegram DM.
T+20 min without recoveryEscalate to Francesco.
T+30 min without recoveryStop experiments and prepare backup restore.

App Stack Full Recovery

Use this when akira-mgmt-01-staging is lost or Docker services are corrupted.

RTO target: 30 minutes from host availability.

App Diagnostics

ssh root@akira-mgmt-01-staging '
uptime
docker ps
docker compose -f /opt/akira/docker-compose.yml ps
journalctl -u docker --since "30 min ago" --no-pager | tail -80
'

App Recovery Commands

If the host is reachable, redeploy management:

cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt

If the host has been replaced from scratch, run bootstrap for management after inventory points to the new host:

cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/bootstrap_all.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--limit management

App Validation

curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz
curl -fsS -I https://grafana.akira-staging.asheep.it

ssh root@akira-mgmt-01-staging '
docker ps --format "{{.Names}} {{.Status}}"
'

Escalation: after 15 minutes without the management stack running, page the secondary on-call. After 30 minutes, escalate to Massimo.

Full-Region Disaster

Full-region disaster is out of pilot automation scope. Use this section to stabilize the incident and avoid partial recovery.

RTO target: best effort during pilot. A formal multi-region target belongs to the production wave.

  1. Declare SEV1 in the incident channel.
  2. Freeze deploys and destructive maintenance.
  3. Confirm whether Hetzner region, tailnet, DNS, or Akira hosts are affected.
  4. Preserve the latest available backup and WAL archive.
  5. Start restore planning from dr-restore-procedure.md and dr-restore-pg-timescale.md.
  6. Recover the database first, then stateful dependencies, then management, then signaling.

Validation is the same as a deploy smoke plus CDR monotonic growth.

Escalation: page Massimo immediately, page Francesco at T+15 minutes, and keep updates every 15 minutes until a recovery path is chosen.

Caveats

  • Never promote a replica before checking whether split brain is possible.
  • Do not delete old primary data until the new primary is validated.
  • Do not invent RTO or RPO metrics. Record actual timestamps during the event.
  • Restore from backup is the last resort if a healthy replica exists.