Passa al contenuto principale

Runbook — Deploy Akira staging from scratch

Audience: ops / SRE Akira
Updated: 2026-05-20
Coverage: Trigger #1 (3 VM) + Trigger #2 (9 VM) + Storage Box BX11 = ~EUR 85/m staging

Prerequisites

  • Hetzner Cloud project akira-staging with quota for at least 20 servers and 40 IPs.
  • Tailscale reusable auth key in vault as vault_tailscale_authkey.
  • SSH key ~/.ssh/akira_ed25519 registered in Hetzner.
  • Vault password file ~/.akira-vault-pass.txt, mode 0600.
  • Storage Box BX11 fsn1 with subaccount u<NNN>-sub1 and SSH key access.
  • SignalWire PAT in vault as vault_signalwire_token; blocking for package-based FreeSWITCH.

Inventory Groups

GroupHosts
trigger1bastion-01, mgmt-01, db-01
trigger2sip-01/02, rtp-01/02, fs-01, fs-vas-01, sipp-01, cache-01, db-02
kamailio_nodessip-01, sip-02
sip_nodesAlias of kamailio_nodes
rtpengine_nodesrtp-01, rtp-02
rtp_nodesAlias of rtpengine_nodes
freeswitch_nodesfs-01
freeswitch_vas_nodesfs-vas-01
fs_nodesCombined FreeSWITCH alias: fs-01, fs-vas-01
cache_nodescache-01
db_nodesdb-01, db-02
db_nodes_secondarydb-02
sipp_nodessipp-01
monitoring_nodesmgmt-01 for Loki, Prometheus, Grafana, Alertmanager

Deploy Sequence

Trigger #1 provisions bastion, management, and primary database:

ansible-playbook -i infra/inventory/staging.yml infra/playbooks/provision_staging_phase1.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/bootstrap_all.yml \
--vault-password-file ~/.akira-vault-pass.txt --limit trigger1
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt --limit akira-db-01-staging
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt

Trigger #2 provisions signaling, media, cache, and secondary database:

ansible-playbook -i infra/inventory/staging.yml infra/playbooks/provision_staging_phase2.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/bootstrap_all.yml \
--vault-password-file ~/.akira-vault-pass.txt --limit trigger2
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--limit akira-cache-01-staging,akira-db-02-staging
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_signaling.yml \
--vault-password-file ~/.akira-vault-pass.txt

Gotchas Known

G1. Tailscale fact ansible_tailscale_ipv4 undefined

Symptom: rtpengine_control_allow_sources evaluates to []; Kamailio cannot talk to RTPengine NG.
Permanent fix: TASK-104 imports _tasks/set_tailscale_facts.yml in pre_tasks for deploy playbooks.

G2. Hetzner multicast blocked, VRRP split-brain

Symptom: both sip-01 and sip-02 elect themselves MASTER.
Permanent fix: TASK-107 configures keepalived unicast peers and Hetzner Floating IP API failover.

G3. SignalWire FreeSWITCH packages are PAT-gated

Symptom: apt install freeswitch returns HTTP 401.
Workaround: use the TASK-103 build-from-source path or wait until vault_signalwire_token is valid.

G4. wait_for does not support UDP readiness

Symptom: "Wait for RTPengine NG up" fails even when RTPengine listens.
Permanent fix: TASK-105 adds _tasks/wait_for_udp.yml with a socat probe.

G5. fail2ban Kamailio templates missing

Symptom: fail2ban fails while rendering kamailio-auth or kamailio-scan jails.
Permanent fix: TASK-101 adds jail and filter templates, including SIPVicious detection regexes.

G6. deploy_signaling.yml missing vars_files

Symptom: vault-backed variables are undefined during signaling deploy.
Permanent fix: TASK-102 adds vars_files to all plays that need infra/group_vars/all/main.yml and infra/group_vars/all/vault.yml.

G7. Inventory group naming mismatch

Symptom: playbooks refer to trigger2, sip_nodes, or rtp_nodes while inventory only has canonical groups.
Permanent fix: TASK-100 adds compatibility aliases and documents canonical groups.

G8. Storage Box subaccount form has no SSH key picker

Symptom: Hetzner panel creates the subaccount with password but no expected SSH key selector.
Workaround: create it with a temporary ASCII password, log in over SFTP, upload the public key, then switch to key-only. For quick staging recovery, use the main user temporarily; TD-065 keeps dedicated sub-user cleanup in Phase 4.

G9. uv.lock rebase conflict

Symptom: rebase of a task branch on master fails with CONFLICT (content): Merge conflict in uv.lock.

Automatic fix: after TASK-145, the Toolbox runner invokes scripts/resolve-uvlock-conflict.sh when uv.lock is the only conflicted file. No operator intervention is required.

Manual fix if the runner is unavailable:

git checkout --theirs uv.lock && git add uv.lock && git rebase --continue && uv lock && \
git add uv.lock && git commit -m "chore(deps): regenerate uv.lock" && \
git push origin "$(git branch --show-current)" --force-with-lease

Root cause: uv.lock is generated and sorted by uv lock; pyproject.toml is the source of truth. Do not resolve lockfile conflict markers by hand.

Smoke Matrix

Helper:

sshck() { ssh -i ~/.ssh/akira_ed25519 root@${1}.tail5f9c92.ts.net "${@:2}"; }

Kamailio HA:

for h in akira-sip-01-staging akira-sip-02-staging; do
sshck "$h" 'systemctl is-active kamailio keepalived; kamcmd core.uptime'
sshck "$h" 'kamcmd rtpengine.ping'
done

RTPengine:

for h in akira-rtp-01-staging akira-rtp-02-staging; do
sshck "$h" 'systemctl is-active rtpengine; ss -ulnp | grep 22222'
done

FreeSWITCH, cache, database, observability:

sshck akira-fs-01-staging 'systemctl is-active freeswitch fs-esl-gateway'
sshck akira-fs-01-staging 'fs_cli -x "module_exists mod_g729"'
sshck akira-fs-01-staging "ss -tlnp | grep ':8021' | grep '127.0.0.1'"
sshck akira-cache-01-staging 'systemctl is-active redis-server; redis-cli ping'
sshck akira-db-02-staging 'sudo -u postgres psql -tc "SELECT pg_is_in_recovery();"'
sshck akira-mgmt-01-staging 'systemctl is-active prometheus grafana-server alertmanager loki'
sshck akira-mgmt-01-staging 'curl -fsS http://127.0.0.1:9090/-/ready'

Rollback Procedure

Per-host rollback, example fs-01:

sshck akira-fs-01-staging 'systemctl stop freeswitch fs-esl-gateway'
sshck akira-fs-01-staging 'rm -rf /etc/freeswitch'
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_signaling.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--limit akira-fs-01-staging --tags freeswitch

Trigger #2 nuke, keeping Trigger #1 intact:

for h in sip-01 sip-02 rtp-01 rtp-02 fs-01 fs-vas-01 sipp-01 cache-01 db-02; do
hcloud server delete "akira-${h}-staging"
done

Cost Summary

ItemEUR/month
Trigger #1, 3 VM22.00
Trigger #2, 9 VM60.00
Storage Box BX11 fsn1 1 TB3.20
Total85.20

Troubleshooting

ProblemDiagnostic commandFix
Magic DNS does not resolvetailscale status | grep <host>Run sudo tailscale up and check auth key
SSH fails after provisionhcloud server list and tailscale statusRe-run bootstrap_all.yml --limit <host>
kamcmd rtpengine.ping failsansible -m debug -a "var=ansible_tailscale_ipv4" kamailio_nodesVerify TASK-104 fact import
VRRP split-brainjournalctl -u keepalived -n 100 --no-pagerVerify TASK-107 unicast peers
FreeSWITCH install HTTP 401apt-cache policy freeswitchFix vault_signalwire_token or use TASK-103
RTPengine UDP wait failsss -ulnp | grep 22222Verify TASK-105 UDP probe
pg_basebackup failspsql -c "SHOW wal_keep_size;" on db-01ALTER SYSTEM SET wal_keep_size='1GB'
Hetzner quota exceededhcloud server list | wc -lRequest quota increase
Storage Box auth failsssh -p 23 -v u<NNN>@<box>Re-upload public key
Grafana unavailablesystemctl status grafana-serverCheck local port 127.0.0.1:3001

References

  • ADR-0001 stack Python FastAPI Next.js
  • ADR-0006 testing strategy
  • ADR-0007 CDR pipeline
  • ADR-0008 Kamailio HA
  • ADR-0010 Kamailio CDR emit
  • ADR-0011 FreeSWITCH ESL bridge
  • ADR-0012 backup strategy for PostgreSQL and Timescale
  • ADR-0013 observability stack
  • ADR-0014 Alertmanager routing and notification topology
  • ADR-0015 background job framework
  • docs/runbooks/deploy-signaling.md
  • docs/runbooks/keepalived-setup.md
  • CLAUDE.md