Runbook — Deploy Akira staging from scratch
Audience: ops / SRE Akira
Updated: 2026-05-20
Coverage: Trigger #1 (3 VM) + Trigger #2 (9 VM) + Storage Box BX11 = ~EUR 85/m staging
Prerequisites
- Hetzner Cloud project
akira-stagingwith quota for at least 20 servers and 40 IPs. - Tailscale reusable auth key in vault as
vault_tailscale_authkey. - SSH key
~/.ssh/akira_ed25519registered in Hetzner. - Vault password file
~/.akira-vault-pass.txt, mode0600. - Storage Box BX11
fsn1with subaccountu<NNN>-sub1and SSH key access. - SignalWire PAT in vault as
vault_signalwire_token; blocking for package-based FreeSWITCH.
Inventory Groups
| Group | Hosts |
|---|---|
trigger1 | bastion-01, mgmt-01, db-01 |
trigger2 | sip-01/02, rtp-01/02, fs-01, fs-vas-01, sipp-01, cache-01, db-02 |
kamailio_nodes | sip-01, sip-02 |
sip_nodes | Alias of kamailio_nodes |
rtpengine_nodes | rtp-01, rtp-02 |
rtp_nodes | Alias of rtpengine_nodes |
freeswitch_nodes | fs-01 |
freeswitch_vas_nodes | fs-vas-01 |
fs_nodes | Combined FreeSWITCH alias: fs-01, fs-vas-01 |
cache_nodes | cache-01 |
db_nodes | db-01, db-02 |
db_nodes_secondary | db-02 |
sipp_nodes | sipp-01 |
monitoring_nodes | mgmt-01 for Loki, Prometheus, Grafana, Alertmanager |
Deploy Sequence
Trigger #1 provisions bastion, management, and primary database:
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/provision_staging_phase1.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/bootstrap_all.yml \
--vault-password-file ~/.akira-vault-pass.txt --limit trigger1
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt --limit akira-db-01-staging
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt
Trigger #2 provisions signaling, media, cache, and secondary database:
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/provision_staging_phase2.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/bootstrap_all.yml \
--vault-password-file ~/.akira-vault-pass.txt --limit trigger2
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--limit akira-cache-01-staging,akira-db-02-staging
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_signaling.yml \
--vault-password-file ~/.akira-vault-pass.txt
Gotchas Known
G1. Tailscale fact ansible_tailscale_ipv4 undefined
Symptom: rtpengine_control_allow_sources evaluates to []; Kamailio
cannot talk to RTPengine NG.
Permanent fix: TASK-104 imports _tasks/set_tailscale_facts.yml in
pre_tasks for deploy playbooks.
G2. Hetzner multicast blocked, VRRP split-brain
Symptom: both sip-01 and sip-02 elect themselves MASTER.
Permanent fix: TASK-107 configures keepalived unicast peers and Hetzner
Floating IP API failover.
G3. SignalWire FreeSWITCH packages are PAT-gated
Symptom: apt install freeswitch returns HTTP 401.
Workaround: use the TASK-103 build-from-source path or wait until
vault_signalwire_token is valid.
G4. wait_for does not support UDP readiness
Symptom: "Wait for RTPengine NG up" fails even when RTPengine listens.
Permanent fix: TASK-105 adds _tasks/wait_for_udp.yml with a socat
probe.
G5. fail2ban Kamailio templates missing
Symptom: fail2ban fails while rendering kamailio-auth or kamailio-scan
jails.
Permanent fix: TASK-101 adds jail and filter templates, including
SIPVicious detection regexes.
G6. deploy_signaling.yml missing vars_files
Symptom: vault-backed variables are undefined during signaling deploy.
Permanent fix: TASK-102 adds vars_files to all plays that need
infra/group_vars/all/main.yml and infra/group_vars/all/vault.yml.
G7. Inventory group naming mismatch
Symptom: playbooks refer to trigger2, sip_nodes, or rtp_nodes while
inventory only has canonical groups.
Permanent fix: TASK-100 adds compatibility aliases and documents canonical
groups.
G8. Storage Box subaccount form has no SSH key picker
Symptom: Hetzner panel creates the subaccount with password but no expected
SSH key selector.
Workaround: create it with a temporary ASCII password, log in over SFTP,
upload the public key, then switch to key-only. For quick staging recovery,
use the main user temporarily; TD-065 keeps dedicated sub-user cleanup in Phase 4.
G9. uv.lock rebase conflict
Symptom: rebase of a task branch on master fails with
CONFLICT (content): Merge conflict in uv.lock.
Automatic fix: after TASK-145, the Toolbox runner invokes
scripts/resolve-uvlock-conflict.sh when uv.lock is the only conflicted file.
No operator intervention is required.
Manual fix if the runner is unavailable:
git checkout --theirs uv.lock && git add uv.lock && git rebase --continue && uv lock && \
git add uv.lock && git commit -m "chore(deps): regenerate uv.lock" && \
git push origin "$(git branch --show-current)" --force-with-lease
Root cause: uv.lock is generated and sorted by uv lock; pyproject.toml
is the source of truth. Do not resolve lockfile conflict markers by hand.
Smoke Matrix
Helper:
sshck() { ssh -i ~/.ssh/akira_ed25519 root@${1}.tail5f9c92.ts.net "${@:2}"; }
Kamailio HA:
for h in akira-sip-01-staging akira-sip-02-staging; do
sshck "$h" 'systemctl is-active kamailio keepalived; kamcmd core.uptime'
sshck "$h" 'kamcmd rtpengine.ping'
done
RTPengine:
for h in akira-rtp-01-staging akira-rtp-02-staging; do
sshck "$h" 'systemctl is-active rtpengine; ss -ulnp | grep 22222'
done
FreeSWITCH, cache, database, observability:
sshck akira-fs-01-staging 'systemctl is-active freeswitch fs-esl-gateway'
sshck akira-fs-01-staging 'fs_cli -x "module_exists mod_g729"'
sshck akira-fs-01-staging "ss -tlnp | grep ':8021' | grep '127.0.0.1'"
sshck akira-cache-01-staging 'systemctl is-active redis-server; redis-cli ping'
sshck akira-db-02-staging 'sudo -u postgres psql -tc "SELECT pg_is_in_recovery();"'
sshck akira-mgmt-01-staging 'systemctl is-active prometheus grafana-server alertmanager loki'
sshck akira-mgmt-01-staging 'curl -fsS http://127.0.0.1:9090/-/ready'
Rollback Procedure
Per-host rollback, example fs-01:
sshck akira-fs-01-staging 'systemctl stop freeswitch fs-esl-gateway'
sshck akira-fs-01-staging 'rm -rf /etc/freeswitch'
ansible-playbook -i infra/inventory/staging.yml infra/playbooks/deploy_signaling.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--limit akira-fs-01-staging --tags freeswitch
Trigger #2 nuke, keeping Trigger #1 intact:
for h in sip-01 sip-02 rtp-01 rtp-02 fs-01 fs-vas-01 sipp-01 cache-01 db-02; do
hcloud server delete "akira-${h}-staging"
done
Cost Summary
| Item | EUR/month |
|---|---|
| Trigger #1, 3 VM | 22.00 |
| Trigger #2, 9 VM | 60.00 |
| Storage Box BX11 fsn1 1 TB | 3.20 |
| Total | 85.20 |
Troubleshooting
| Problem | Diagnostic command | Fix |
|---|---|---|
| Magic DNS does not resolve | tailscale status | grep <host> | Run sudo tailscale up and check auth key |
| SSH fails after provision | hcloud server list and tailscale status | Re-run bootstrap_all.yml --limit <host> |
kamcmd rtpengine.ping fails | ansible -m debug -a "var=ansible_tailscale_ipv4" kamailio_nodes | Verify TASK-104 fact import |
| VRRP split-brain | journalctl -u keepalived -n 100 --no-pager | Verify TASK-107 unicast peers |
| FreeSWITCH install HTTP 401 | apt-cache policy freeswitch | Fix vault_signalwire_token or use TASK-103 |
| RTPengine UDP wait fails | ss -ulnp | grep 22222 | Verify TASK-105 UDP probe |
pg_basebackup fails | psql -c "SHOW wal_keep_size;" on db-01 | ALTER SYSTEM SET wal_keep_size='1GB' |
| Hetzner quota exceeded | hcloud server list | wc -l | Request quota increase |
| Storage Box auth fails | ssh -p 23 -v u<NNN>@<box> | Re-upload public key |
| Grafana unavailable | systemctl status grafana-server | Check local port 127.0.0.1:3001 |
References
- ADR-0001 stack Python FastAPI Next.js
- ADR-0006 testing strategy
- ADR-0007 CDR pipeline
- ADR-0008 Kamailio HA
- ADR-0010 Kamailio CDR emit
- ADR-0011 FreeSWITCH ESL bridge
- ADR-0012 backup strategy for PostgreSQL and Timescale
- ADR-0013 observability stack
- ADR-0014 Alertmanager routing and notification topology
- ADR-0015 background job framework
docs/runbooks/deploy-signaling.mddocs/runbooks/keepalived-setup.mdCLAUDE.md