DR Restore Procedure

Use this runbook only for a real PostgreSQL/TimescaleDB disaster. Monthly restore drills are automated by /usr/local/bin/dr-restore-test.sh on akira-mgmt-01-staging.

Scope

Restore target: new isolated VM first, never the damaged DB in place.
Backup source: latest valid pg_dump custom-format file from Storage Box /backups/pg/akira-*.dump.
RPO target: 24h with nightly dump.
RTO target: 15 minutes for the validated restore path.

1. Freeze Writes

Stop application workers that can write to PostgreSQL:

ssh root@akira-mgmt-01-staging \
  'docker compose -f /opt/akira/docker-compose.yml stop backend agentcore-bridge'

Keep NATS JetStream online so CDR backlog is retained.

2. Provision Restore VM

Create a fresh Ubuntu 24.04 VM in fsn1 with PostgreSQL 16 and TimescaleDB 2:

hcloud server create \
  --name akira-db-restore-$(date -u +%Y%m%d%H%M) \
  --type cx23 \
  --image ubuntu-24.04 \
  --location fsn1 \
  --ssh-key akira-staging-admin

Install PostgreSQL/TimescaleDB using the same package sequence as infra/scripts/dr-test-cloud-init.yml.

3. Fetch Dump

On the restore VM:

sftp -i /root/.ssh/storagebox_ed25519 "${STORAGEBOX_USER}@${STORAGEBOX_HOST}" \
  <<< "get /backups/pg/akira-YYYYMMDDTHHMMSSZ.dump /tmp/restore.dump"

Verify the archive is readable:

pg_restore --list /tmp/restore.dump | head

4. Restore and Verify

Restore into a new database:

sudo -u postgres createdb akira_restore
sudo -u postgres pg_restore \
  --dbname=akira_restore \
  --verbose \
  --no-owner \
  --no-privileges \
  /tmp/restore.dump
sudo -u postgres python3 /usr/local/bin/dr-sanity-check.py akira_restore

The JSON must show:

tables_count >= 70
enums_count >= 20
hypertable_cdr_exists = true
fn_resolve_agent_fee_exists = true

If any check fails, do not switch traffic to the restored database.

5. Promote Restore

Point application configuration at the restored DB host, then restart writers:

ssh root@akira-mgmt-01-staging \
  'docker compose -f /opt/akira/docker-compose.yml up -d backend agentcore-bridge'

For a permanent replacement, update inventory/private DNS/Tailscale references to the restored VM and preserve the failed VM for forensic analysis until the incident owner approves deletion.

6. Communication

Record the incident timeline, selected dump name, sanity JSON, RTO/RPO observed, and any data-loss window in the incident notes. Notify operations before re-enabling traffic.

Scope​

1. Freeze Writes​

2. Provision Restore VM​

3. Fetch Dump​

4. Restore and Verify​

5. Promote Restore​

6. Communication​