DR Restore Procedure
Use this runbook only for a real PostgreSQL/TimescaleDB disaster. Monthly
restore drills are automated by /usr/local/bin/dr-restore-test.sh on
akira-mgmt-01-staging.
Scope
- Restore target: new isolated VM first, never the damaged DB in place.
- Backup source: latest valid
pg_dumpcustom-format file from Storage Box/backups/pg/akira-*.dump. - RPO target: 24h with nightly dump.
- RTO target: 15 minutes for the validated restore path.
1. Freeze Writes
Stop application workers that can write to PostgreSQL:
ssh root@akira-mgmt-01-staging \
'docker compose -f /opt/akira/docker-compose.yml stop backend agentcore-bridge'
Keep NATS JetStream online so CDR backlog is retained.
2. Provision Restore VM
Create a fresh Ubuntu 24.04 VM in fsn1 with PostgreSQL 16 and TimescaleDB 2:
hcloud server create \
--name akira-db-restore-$(date -u +%Y%m%d%H%M) \
--type cx23 \
--image ubuntu-24.04 \
--location fsn1 \
--ssh-key akira-staging-admin
Install PostgreSQL/TimescaleDB using the same package sequence as
infra/scripts/dr-test-cloud-init.yml.
3. Fetch Dump
On the restore VM:
sftp -i /root/.ssh/storagebox_ed25519 "${STORAGEBOX_USER}@${STORAGEBOX_HOST}" \
<<< "get /backups/pg/akira-YYYYMMDDTHHMMSSZ.dump /tmp/restore.dump"
Verify the archive is readable:
pg_restore --list /tmp/restore.dump | head
4. Restore and Verify
Restore into a new database:
sudo -u postgres createdb akira_restore
sudo -u postgres pg_restore \
--dbname=akira_restore \
--verbose \
--no-owner \
--no-privileges \
/tmp/restore.dump
sudo -u postgres python3 /usr/local/bin/dr-sanity-check.py akira_restore
The JSON must show:
tables_count >= 70enums_count >= 20hypertable_cdr_exists = truefn_resolve_agent_fee_exists = true
If any check fails, do not switch traffic to the restored database.
5. Promote Restore
Point application configuration at the restored DB host, then restart writers:
ssh root@akira-mgmt-01-staging \
'docker compose -f /opt/akira/docker-compose.yml up -d backend agentcore-bridge'
For a permanent replacement, update inventory/private DNS/Tailscale references to the restored VM and preserve the failed VM for forensic analysis until the incident owner approves deletion.
6. Communication
Record the incident timeline, selected dump name, sanity JSON, RTO/RPO observed, and any data-loss window in the incident notes. Notify operations before re-enabling traffic.