Passa al contenuto principale

PostgreSQL failover DR drill

Scope

Manual staging drill for TASK-238: promote akira-db-02-staging, switch the management backend to the promoted primary, validate CDR ingestion, rebuild the old primary as a replica, then fail back to the inventory primary akira-db-01-staging.

This drill is destructive. Run it from the operator Mac with Tailscale connected and SSH access to the staging hosts. The code runner is code-only and does not execute staging drills.

Targets:

  • RTO: under 300 seconds for service recovery on promoted db-02.
  • RPO: under 30 seconds, measured as replica replay lag immediately before stopping db-01.
  • CDR loss: under 5 rows against the 1 cps drill workload.
  • Backend CDR ingestion resumes within 60 seconds after backend switch.

Prerequisites

  • Tailscale is up on the operator Mac and can resolve Akira Magic DNS names.
  • SSH as root works for:
    • akira-db-01-staging
    • akira-db-02-staging
    • akira-mgmt-01-staging
    • akira-sipp-01-staging
  • The operator has confirmed this is staging, not production.
  • infra/scripts/dr-pg-failover-rto.sh is available from the repo root.
  • SIPp scenario files exist under /opt/akira/sipp on akira-sipp-01-staging.
  • Incident/channel owner is aware before the destructive stop.

Pre-Flight

ssh root@akira-db-01-staging '
sudo -u postgres psql -Atq -c "
SELECT pg_is_in_recovery()::text || '\''|'\'' || current_setting('\''wal_level'\'');
"
'

Expected: false|replica.

ssh root@akira-db-02-staging '
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
sudo -u postgres psql -Atq -c "
SELECT COALESCE(EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())::int, 0);
"
'

Expected: recovery is t or true, replication lag is under 5 seconds.

ssh root@akira-mgmt-01-staging '
grep -E "^(DATABASE_URL|DATABASE_REPLICA_URL|ALEMBIC_DATABASE_URL)=" /opt/akira/.env
'

Expected: write database URL points at akira-db-01-staging or 100.64.1.11.

Automated Drill

From repo root on the operator Mac:

DR_CONFIRM_STAGING=YES DR_EXECUTE=1 infra/scripts/dr-pg-failover-rto.sh

Default behavior includes the recommended post-drill failback:

  1. Snapshot CDR count on db-01.
  2. Start SIPp background load at 1 cps for 10 minutes.
  3. Measure replay lag on db-02 immediately before the primary stop.
  4. Stop PostgreSQL on db-01.
  5. Promote db-02.
  6. Switch /opt/akira/.env on mgmt-01 from db-01 to db-02.
  7. Restart backend and cdr-worker.
  8. Validate CDR ingestion on db-02.
  9. Re-clone db-01 as replica from db-02.
  10. Fail back to db-01 to keep inventory and reality aligned.
  11. Re-clone db-02 as replica from db-01.
  12. Write DR-DRILL-{date}.md under ~/Documents/Claude/Projects/Akira.

To intentionally leave db-02 as primary after step 9, set DR_FAILBACK=0. That mode is for explicit operator choice only; the standard staging drill should keep DR_FAILBACK=1.

Manual Steps

Use these steps if the script cannot be used. Keep a local timeline with date -u -Iseconds after each step.

Start background workload:

ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
nohup sipp -sf scenarios/smoke_e2e_single.xml \
-inf smoke_target.csv \
-r 1 \
-m 600 \
-mi 100.64.0.51 \
100.64.0.21:5060 \
> /tmp/sipp_drill_background.log 2>&1 &
echo "$!"
'

Stop db-01 and promote db-02:

ssh root@akira-db-01-staging 'systemctl stop postgresql'

ssh root@akira-db-02-staging '
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
sleep 2
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'

Switch backend:

ssh root@akira-mgmt-01-staging '
cp /opt/akira/.env /opt/akira/.env.pre-task-238-$(date -u +%Y%m%d%H%M%S)
sed -i \
-e "s|akira-db-01-staging|akira-db-02-staging|g" \
-e "s|100.64.1.11|100.64.1.12|g" \
/opt/akira/.env
docker compose -f /opt/akira/docker-compose.yml restart backend cdr-worker
'

Validate CDR ingestion:

ssh root@akira-db-02-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_count, max(answered_at) AS latest_cdr
FROM cdr;
"
'

Re-clone db-01 as replica:

ssh root@akira-db-01-staging '
systemctl stop postgresql || true
find /var/lib/postgresql/16/main -mindepth 1 -maxdepth 1 -exec rm -rf {} +
sudo -u postgres pg_basebackup \
-h akira-db-02-staging \
-U replicator \
-D /var/lib/postgresql/16/main \
-X stream \
-P -R
systemctl start postgresql
sleep 3
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'

Recommended failback to original topology:

ssh root@akira-db-02-staging 'systemctl stop postgresql'

ssh root@akira-db-01-staging '
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
sleep 2
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'

ssh root@akira-mgmt-01-staging '
cp /opt/akira/.env /opt/akira/.env.pre-task-238-failback-$(date -u +%Y%m%d%H%M%S)
sed -i \
-e "s|akira-db-02-staging|akira-db-01-staging|g" \
-e "s|100.64.1.12|100.64.1.11|g" \
/opt/akira/.env
docker compose -f /opt/akira/docker-compose.yml restart backend cdr-worker
'

ssh root@akira-db-02-staging '
systemctl stop postgresql || true
find /var/lib/postgresql/16/main -mindepth 1 -maxdepth 1 -exec rm -rf {} +
sudo -u postgres pg_basebackup \
-h akira-db-01-staging \
-U replicator \
-D /var/lib/postgresql/16/main \
-X stream \
-P -R
systemctl start postgresql
'

Final Validation

ssh root@akira-db-01-staging '
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'

ssh root@akira-db-02-staging '
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
sudo -u postgres psql -Atq -c "
SELECT COALESCE(EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())::int, 0);
"
'

ssh root@akira-mgmt-01-staging '
grep -E "^(DATABASE_URL|DATABASE_REPLICA_URL|ALEMBIC_DATABASE_URL)=" /opt/akira/.env
docker compose -f /opt/akira/docker-compose.yml ps backend cdr-worker
'

Expected final state:

  • akira-db-01-staging is primary: pg_is_in_recovery=false.
  • akira-db-02-staging is replica: pg_is_in_recovery=true.
  • db-02 replication lag is under 5 seconds.
  • Management write database URL points to db-01.
  • Backend and CDR worker are running.
  • Inventory still matches reality: pg_role: primary on db-01 and pg_role: replica on db-02.

Report

Use tests/dr/expected_outcome.md as the required report format. Store the completed report as:

~/Documents/Claude/Projects/Akira/DR-DRILL-{date}.md

Rollback Notes

  • If promotion fails, stop the drill and prepare backup/WAL restore using dr-restore-pg-timescale.md.
  • If backend does not resume after the env switch, inspect docker logs akira-backend and docker logs akira-cdr-worker.
  • If a base backup fails due to disk pressure, confirm the target host and retry after cleaning only the target PGDATA.