PostgreSQL failover DR drill
Scope
Manual staging drill for TASK-238: promote akira-db-02-staging, switch the
management backend to the promoted primary, validate CDR ingestion, rebuild the
old primary as a replica, then fail back to the inventory primary
akira-db-01-staging.
This drill is destructive. Run it from the operator Mac with Tailscale connected and SSH access to the staging hosts. The code runner is code-only and does not execute staging drills.
Targets:
- RTO: under 300 seconds for service recovery on promoted db-02.
- RPO: under 30 seconds, measured as replica replay lag immediately before stopping db-01.
- CDR loss: under 5 rows against the 1 cps drill workload.
- Backend CDR ingestion resumes within 60 seconds after backend switch.
Prerequisites
- Tailscale is up on the operator Mac and can resolve Akira Magic DNS names.
- SSH as
rootworks for:akira-db-01-stagingakira-db-02-stagingakira-mgmt-01-stagingakira-sipp-01-staging
- The operator has confirmed this is staging, not production.
infra/scripts/dr-pg-failover-rto.shis available from the repo root.- SIPp scenario files exist under
/opt/akira/sipponakira-sipp-01-staging. - Incident/channel owner is aware before the destructive stop.
Pre-Flight
ssh root@akira-db-01-staging '
sudo -u postgres psql -Atq -c "
SELECT pg_is_in_recovery()::text || '\''|'\'' || current_setting('\''wal_level'\'');
"
'
Expected: false|replica.
ssh root@akira-db-02-staging '
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
sudo -u postgres psql -Atq -c "
SELECT COALESCE(EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())::int, 0);
"
'
Expected: recovery is t or true, replication lag is under 5 seconds.
ssh root@akira-mgmt-01-staging '
grep -E "^(DATABASE_URL|DATABASE_REPLICA_URL|ALEMBIC_DATABASE_URL)=" /opt/akira/.env
'
Expected: write database URL points at akira-db-01-staging or 100.64.1.11.
Automated Drill
From repo root on the operator Mac:
DR_CONFIRM_STAGING=YES DR_EXECUTE=1 infra/scripts/dr-pg-failover-rto.sh
Default behavior includes the recommended post-drill failback:
- Snapshot CDR count on db-01.
- Start SIPp background load at 1 cps for 10 minutes.
- Measure replay lag on db-02 immediately before the primary stop.
- Stop PostgreSQL on db-01.
- Promote db-02.
- Switch
/opt/akira/.envon mgmt-01 from db-01 to db-02. - Restart
backendandcdr-worker. - Validate CDR ingestion on db-02.
- Re-clone db-01 as replica from db-02.
- Fail back to db-01 to keep inventory and reality aligned.
- Re-clone db-02 as replica from db-01.
- Write
DR-DRILL-{date}.mdunder~/Documents/Claude/Projects/Akira.
To intentionally leave db-02 as primary after step 9, set DR_FAILBACK=0.
That mode is for explicit operator choice only; the standard staging drill
should keep DR_FAILBACK=1.
Manual Steps
Use these steps if the script cannot be used. Keep a local timeline with
date -u -Iseconds after each step.
Start background workload:
ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
nohup sipp -sf scenarios/smoke_e2e_single.xml \
-inf smoke_target.csv \
-r 1 \
-m 600 \
-mi 100.64.0.51 \
100.64.0.21:5060 \
> /tmp/sipp_drill_background.log 2>&1 &
echo "$!"
'
Stop db-01 and promote db-02:
ssh root@akira-db-01-staging 'systemctl stop postgresql'
ssh root@akira-db-02-staging '
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
sleep 2
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'
Switch backend:
ssh root@akira-mgmt-01-staging '
cp /opt/akira/.env /opt/akira/.env.pre-task-238-$(date -u +%Y%m%d%H%M%S)
sed -i \
-e "s|akira-db-01-staging|akira-db-02-staging|g" \
-e "s|100.64.1.11|100.64.1.12|g" \
/opt/akira/.env
docker compose -f /opt/akira/docker-compose.yml restart backend cdr-worker
'
Validate CDR ingestion:
ssh root@akira-db-02-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_count, max(answered_at) AS latest_cdr
FROM cdr;
"
'
Re-clone db-01 as replica:
ssh root@akira-db-01-staging '
systemctl stop postgresql || true
find /var/lib/postgresql/16/main -mindepth 1 -maxdepth 1 -exec rm -rf {} +
sudo -u postgres pg_basebackup \
-h akira-db-02-staging \
-U replicator \
-D /var/lib/postgresql/16/main \
-X stream \
-P -R
systemctl start postgresql
sleep 3
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'
Recommended failback to original topology:
ssh root@akira-db-02-staging 'systemctl stop postgresql'
ssh root@akira-db-01-staging '
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
sleep 2
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'
ssh root@akira-mgmt-01-staging '
cp /opt/akira/.env /opt/akira/.env.pre-task-238-failback-$(date -u +%Y%m%d%H%M%S)
sed -i \
-e "s|akira-db-02-staging|akira-db-01-staging|g" \
-e "s|100.64.1.12|100.64.1.11|g" \
/opt/akira/.env
docker compose -f /opt/akira/docker-compose.yml restart backend cdr-worker
'
ssh root@akira-db-02-staging '
systemctl stop postgresql || true
find /var/lib/postgresql/16/main -mindepth 1 -maxdepth 1 -exec rm -rf {} +
sudo -u postgres pg_basebackup \
-h akira-db-01-staging \
-U replicator \
-D /var/lib/postgresql/16/main \
-X stream \
-P -R
systemctl start postgresql
'
Final Validation
ssh root@akira-db-01-staging '
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
'
ssh root@akira-db-02-staging '
sudo -u postgres psql -Atq -c "SELECT pg_is_in_recovery();"
sudo -u postgres psql -Atq -c "
SELECT COALESCE(EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())::int, 0);
"
'
ssh root@akira-mgmt-01-staging '
grep -E "^(DATABASE_URL|DATABASE_REPLICA_URL|ALEMBIC_DATABASE_URL)=" /opt/akira/.env
docker compose -f /opt/akira/docker-compose.yml ps backend cdr-worker
'
Expected final state:
akira-db-01-stagingis primary:pg_is_in_recovery=false.akira-db-02-stagingis replica:pg_is_in_recovery=true.- db-02 replication lag is under 5 seconds.
- Management write database URL points to db-01.
- Backend and CDR worker are running.
- Inventory still matches reality:
pg_role: primaryon db-01 andpg_role: replicaon db-02.
Report
Use tests/dr/expected_outcome.md as the required report format. Store the completed report as:
~/Documents/Claude/Projects/Akira/DR-DRILL-{date}.md
Rollback Notes
- If promotion fails, stop the drill and prepare backup/WAL restore using dr-restore-pg-timescale.md.
- If backend does not resume after the env switch, inspect
docker logs akira-backendanddocker logs akira-cdr-worker. - If a base backup fails due to disk pressure, confirm the target host and
retry after cleaning only the target
PGDATA.