Runbook - Deploy Akira
Targets
- Staging RTO target during deploy rollback: 15 minutes.
- Production RTO target during deploy rollback: 30 minutes after production is enabled.
- Pilot quality gate: TASK-236 smoke green and TASK-237 load baseline not regressed.
Prerequisites
- Tailscale connected.
- SSH key loaded for Akira hosts.
~/.akira-vault-pass.txtpresent.- No active SEV1 or SEV2 incident in Telegram.
- Latest backup is less than 24 hours old.
- CI on
masteris green.
Pre-Flight Checklist
- Pull the repository.
- Verify current branch and commit.
- Verify Ansible vault decrypts.
- Verify staging inventory parses.
- Confirm no active incident.
cd /home/devcomm/akira
git pull origin master
git status --short --branch
git log --oneline -5
ansible-vault view infra/group_vars/all/vault.yml \
--vault-password-file ~/.akira-vault-pass.txt | grep -c "_"
ansible-inventory -i infra/inventory/staging.yml --list >/tmp/akira-inventory.json
Staging Deploy
Full Stack
Use the full bootstrap when host state may be stale or after broad infrastructure changes.
cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/bootstrap_all.yml \
--vault-password-file ~/.akira-vault-pass.txt
Expected duration: about 45 minutes cold, about 15 minutes incremental.
Per Layer
Use layered deploys for ordinary changes.
| Layer | Playbook | Duration |
|---|---|---|
| Stateful | playbooks/deploy_stateful.yml | 10 min |
| Signaling | playbooks/deploy_signaling.yml | 15 min |
| Management | playbooks/deploy_management.yml | 15 min |
| Traffic tools | playbooks/deploy_traffic_tools.yml | 5 min |
Layer contents:
- Stateful: PostgreSQL, Redis, Vault transit, NATS.
- Signaling: Kamailio, RTPengine, FreeSWITCH.
- Management: Caddy, backend, frontend, observability.
cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i inventory/staging.yml playbooks/deploy_signaling.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i inventory/staging.yml playbooks/deploy_traffic_tools.yml \
--vault-password-file ~/.akira-vault-pass.txt
Tag-Only Redeploy
Use tags when a small operational fix should avoid touching unrelated tiers.
cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--tags caddy
ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--tags alertmanager
ansible-playbook -i inventory/staging.yml playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--tags vault
ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--tags workers \
--limit management
Caveat: run tag-only deploys only when the tag is already maintained by the role. If Ansible reports skipped dependencies or undefined variables, stop and use the layer playbook.
Operational note for management: do not run the repository's standard build
or deploy script directly on akira-mgmt-01-staging; that host is the runtime
target and does not provide the GitHub/buildx build environment. Apply worker
systemd changes from the control host with the Ansible command above; image
updates still follow the existing git-archive/scp or CI image flow.
Smoke Post-Deploy
Run these checks before declaring the deploy complete.
curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz
curl -fsS -I https://grafana.akira-staging.asheep.it
curl -fsS -I https://alerts.akira-staging.asheep.it
ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
./run_scenario.sh smoke_e2e_single.xml smoke_target.csv
'
ssh root@akira-db-01-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_last_5m
FROM cdr
WHERE answered_at > NOW() - INTERVAL '\''5 minutes'\'';
"
'
Expected result:
- HTTPS endpoints return a response.
- TASK-236 single-call path succeeds.
- CDR count for the last 5 minutes is greater than zero after smoke.
- No new SEV1 or SEV2 alert fires within 10 minutes.
Pilot Load Regression Check
Run the full TASK-237 load profile only before pilot gates or risky signaling changes.
cd /home/devcomm/akira
bash infra/load-test/run-pilot-profile.sh --duration 300
Expected result follows the TASK-237 target: ASR at least 95%, PDD p95 under 2s, and no RTPengine port exhaustion.
Rollback
Rollback if smoke fails, CDR ingestion stops, Sentry or logs show a critical spike, or customer-impacting behavior appears within 10 minutes.
- Identify the previous stable commit.
- Revert the bad commit or check out the previous stable commit for an emergency redeploy.
- Redeploy only the affected layer.
- Re-run smoke and monitor for 30 minutes.
cd /home/devcomm/akira
git log --oneline -10
git revert <bad_commit>
cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/<affected-playbook>.yml \
--vault-password-file ~/.akira-vault-pass.txt
Validation after rollback:
ssh root@akira-mgmt-01-staging 'docker ps --format "{{.Names}} {{.Status}}"'
ssh root@akira-db-01-staging '
sudo -u postgres psql -d akira -c "
SELECT max(answered_at) FROM cdr;
"
'
Production Deploy
Production promotion is not enabled in the pilot. Until the production environment is ready, use this section only as the expected future sequence:
- Confirm staging ran the same commit successfully.
- Confirm a production backup exists.
- Notify on-call primary and secondary.
- Deploy stateful, signaling, then management.
- Run HTTPS, SIP, CDR, and observability smoke.
Escalation: if a production deploy is blocked or degraded for 15 minutes, page the secondary on-call. If customer-facing impact continues for 30 minutes, escalate to Massimo and Francesco.
Caveats
- Do not deploy during an active SEV1 or SEV2 unless the deploy is the approved mitigation.
- Do not run destructive DR commands from this deploy runbook. Use dr.md.
- Do not run production commands from staging hostnames.
- Avoid Friday deploys unless the change is an emergency fix.