PostgreSQL failover DR drill

Scope

Manual staging drill for TASK-238: promote akira-db-02-staging, switch the management backend to the promoted primary, validate CDR ingestion, rebuild the old primary as a replica, then fail back to the inventory primary akira-db-01-staging.

This drill is destructive. Run it from the operator Mac with Tailscale connected and SSH access to the staging hosts. The code runner is code-only and does not execute staging drills.

Targets:

RTO: under 300 seconds for service recovery on promoted db-02.
RPO: under 30 seconds, measured as replica replay lag immediately before stopping db-01.
CDR loss: under 5 rows against the 1 cps drill workload.
Backend CDR ingestion resumes within 60 seconds after backend switch.

Prerequisites

Tailscale is up on the operator Mac and can resolve Akira Magic DNS names.
SSH as root works for:
- akira-db-01-staging
- akira-db-02-staging
- akira-mgmt-01-staging
- akira-sipp-01-staging
The operator has confirmed this is staging, not production.
infra/scripts/dr-pg-failover-rto.sh is available from the repo root.
SIPp scenario files exist under /opt/akira/sipp on akira-sipp-01-staging.
Incident/channel owner is aware before the destructive stop.

Pre-Flight

ssh root@akira-db-01-staging '
  sudo -u postgres psql -Atq -c "
    SELECT pg_is_in_recovery()::text || '\''|'\'' || current_setting('\''wal_level'\'');
  "
'

Expected: false|replica.

ssh root@akira-db-02-staging '
  sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
  sudo -u postgres psql -Atq -c "
    SELECT COALESCE(EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())::int, 0);
  "
'

Expected: recovery is t or true, replication lag is under 5 seconds.

ssh root@akira-mgmt-01-staging '
  grep -E "^(DATABASE_URL|DATABASE_REPLICA_URL|ALEMBIC_DATABASE_URL)=" /opt/akira/.env
'

Expected: write database URL points at akira-db-01-staging or 100.64.1.11.

Automated Drill

From repo root on the operator Mac:

DR_CONFIRM_STAGING=YES DR_EXECUTE=1 infra/scripts/dr-pg-failover-rto.sh

Default behavior includes the recommended post-drill failback:

Snapshot CDR count on db-01.
Start SIPp background load at 1 cps for 10 minutes.
Measure replay lag on db-02 immediately before the primary stop.
Stop PostgreSQL on db-01.
Promote db-02.
Switch /opt/akira/.env on mgmt-01 from db-01 to db-02.
Restart backend and cdr-worker.
Validate CDR ingestion on db-02.
Re-clone db-01 as replica from db-02.
Fail back to db-01 to keep inventory and reality aligned.
Re-clone db-02 as replica from db-01.
Write DR-DRILL-{date}.md under ~/Documents/Claude/Projects/Akira.

To intentionally leave db-02 as primary after step 9, set DR_FAILBACK=0. That mode is for explicit operator choice only; the standard staging drill should keep DR_FAILBACK=1.

Manual Steps

Use these steps if the script cannot be used. Keep a local timeline with date -u -Iseconds after each step.

Start background workload:

ssh root@akira-sipp-01-staging '
  cd /opt/akira/sipp
  nohup sipp -sf scenarios/smoke_e2e_single.xml \
    -inf smoke_target.csv \
    -r 1 \
    -m 600 \
    -mi 100.64.0.51 \
    100.64.0.21:5060 \
    > /tmp/sipp_drill_background.log 2>&1 &
  echo "$!"
'

Stop db-01 and promote db-02:

ssh root@akira-db-01-staging 'systemctl stop postgresql'

ssh root@akira-db-02-staging '
  sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
  sleep 2
  sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'

Switch backend:

ssh root@akira-mgmt-01-staging '
  cp /opt/akira/.env /opt/akira/.env.pre-task-238-$(date -u +%Y%m%d%H%M%S)
  sed -i \
    -e "s|akira-db-01-staging|akira-db-02-staging|g" \
    -e "s|100.64.1.11|100.64.1.12|g" \
    /opt/akira/.env
  docker compose -f /opt/akira/docker-compose.yml restart backend cdr-worker
'

Validate CDR ingestion:

ssh root@akira-db-02-staging '
  sudo -u postgres psql -d akira -c "
    SELECT count(*) AS cdr_count, max(answered_at) AS latest_cdr
    FROM cdr;
  "
'

Re-clone db-01 as replica:

ssh root@akira-db-01-staging '
  systemctl stop postgresql || true
  find /var/lib/postgresql/16/main -mindepth 1 -maxdepth 1 -exec rm -rf {} +
  sudo -u postgres pg_basebackup \
    -h akira-db-02-staging \
    -U replicator \
    -D /var/lib/postgresql/16/main \
    -X stream \
    -P -R
  systemctl start postgresql
  sleep 3
  sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'

Recommended failback to original topology:

ssh root@akira-db-02-staging 'systemctl stop postgresql'

ssh root@akira-db-01-staging '
  sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
  sleep 2
  sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'

ssh root@akira-mgmt-01-staging '
  cp /opt/akira/.env /opt/akira/.env.pre-task-238-failback-$(date -u +%Y%m%d%H%M%S)
  sed -i \
    -e "s|akira-db-02-staging|akira-db-01-staging|g" \
    -e "s|100.64.1.12|100.64.1.11|g" \
    /opt/akira/.env
  docker compose -f /opt/akira/docker-compose.yml restart backend cdr-worker
'

ssh root@akira-db-02-staging '
  systemctl stop postgresql || true
  find /var/lib/postgresql/16/main -mindepth 1 -maxdepth 1 -exec rm -rf {} +
  sudo -u postgres pg_basebackup \
    -h akira-db-01-staging \
    -U replicator \
    -D /var/lib/postgresql/16/main \
    -X stream \
    -P -R
  systemctl start postgresql
'

Final Validation

ssh root@akira-db-01-staging '
  sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'

ssh root@akira-db-02-staging '
  sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
  sudo -u postgres psql -Atq -c "
    SELECT COALESCE(EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())::int, 0);
  "
'

ssh root@akira-mgmt-01-staging '
  grep -E "^(DATABASE_URL|DATABASE_REPLICA_URL|ALEMBIC_DATABASE_URL)=" /opt/akira/.env
  docker compose -f /opt/akira/docker-compose.yml ps backend cdr-worker
'

Expected final state:

akira-db-01-staging is primary: pg_is_in_recovery=false.
akira-db-02-staging is replica: pg_is_in_recovery=true.
db-02 replication lag is under 5 seconds.
Management write database URL points to db-01.
Backend and CDR worker are running.
Inventory still matches reality: pg_role: primary on db-01 and pg_role: replica on db-02.

Report

Use tests/dr/expected_outcome.md as the required report format. Store the completed report as:

~/Documents/Claude/Projects/Akira/DR-DRILL-{date}.md

Rollback Notes

If promotion fails, stop the drill and prepare backup/WAL restore using dr-restore-pg-timescale.md.
If backend does not resume after the env switch, inspect docker logs akira-backend and docker logs akira-cdr-worker.
If a base backup fails due to disk pressure, confirm the target host and retry after cleaning only the target PGDATA.

Scope​

Prerequisites​

Pre-Flight​

Automated Drill​

Manual Steps​

Final Validation​

Report​

Rollback Notes​