Passa al contenuto principale

Runbook - Deploy Akira

Targets

  • Staging RTO target during deploy rollback: 15 minutes.
  • Production RTO target during deploy rollback: 30 minutes after production is enabled.
  • Pilot quality gate: TASK-236 smoke green and TASK-237 load baseline not regressed.

Prerequisites

  • Tailscale connected.
  • SSH key loaded for Akira hosts.
  • ~/.akira-vault-pass.txt present.
  • No active SEV1 or SEV2 incident in Telegram.
  • Latest backup is less than 24 hours old.
  • CI on master is green.

Pre-Flight Checklist

  • Pull the repository.
  • Verify current branch and commit.
  • Verify Ansible vault decrypts.
  • Verify staging inventory parses.
  • Confirm no active incident.
cd /home/devcomm/akira
git pull origin master
git status --short --branch
git log --oneline -5

ansible-vault view infra/group_vars/all/vault.yml \
--vault-password-file ~/.akira-vault-pass.txt | grep -c "_"

ansible-inventory -i infra/inventory/staging.yml --list >/tmp/akira-inventory.json

Staging Deploy

Full Stack

Use the full bootstrap when host state may be stale or after broad infrastructure changes.

cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/bootstrap_all.yml \
--vault-password-file ~/.akira-vault-pass.txt

Expected duration: about 45 minutes cold, about 15 minutes incremental.

Per Layer

Use layered deploys for ordinary changes.

LayerPlaybookDuration
Statefulplaybooks/deploy_stateful.yml10 min
Signalingplaybooks/deploy_signaling.yml15 min
Managementplaybooks/deploy_management.yml15 min
Traffic toolsplaybooks/deploy_traffic_tools.yml5 min

Layer contents:

  • Stateful: PostgreSQL, Redis, Vault transit, NATS.
  • Signaling: Kamailio, RTPengine, FreeSWITCH.
  • Management: Caddy, backend, frontend, observability.
cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i inventory/staging.yml playbooks/deploy_signaling.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt
ansible-playbook -i inventory/staging.yml playbooks/deploy_traffic_tools.yml \
--vault-password-file ~/.akira-vault-pass.txt

Tag-Only Redeploy

Use tags when a small operational fix should avoid touching unrelated tiers.

cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--tags caddy

ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--tags alertmanager

ansible-playbook -i inventory/staging.yml playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--tags vault

ansible-playbook -i inventory/staging.yml playbooks/deploy_management.yml \
--vault-password-file ~/.akira-vault-pass.txt \
--tags workers \
--limit management

Caveat: run tag-only deploys only when the tag is already maintained by the role. If Ansible reports skipped dependencies or undefined variables, stop and use the layer playbook.

Operational note for management: do not run the repository's standard build or deploy script directly on akira-mgmt-01-staging; that host is the runtime target and does not provide the GitHub/buildx build environment. Apply worker systemd changes from the control host with the Ansible command above; image updates still follow the existing git-archive/scp or CI image flow.

Smoke Post-Deploy

Run these checks before declaring the deploy complete.

curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz
curl -fsS -I https://grafana.akira-staging.asheep.it
curl -fsS -I https://alerts.akira-staging.asheep.it

ssh root@akira-sipp-01-staging '
cd /opt/akira/sipp
./run_scenario.sh smoke_e2e_single.xml smoke_target.csv
'

ssh root@akira-db-01-staging '
sudo -u postgres psql -d akira -c "
SELECT count(*) AS cdr_last_5m
FROM cdr
WHERE answered_at > NOW() - INTERVAL '\''5 minutes'\'';
"
'

Expected result:

  • HTTPS endpoints return a response.
  • TASK-236 single-call path succeeds.
  • CDR count for the last 5 minutes is greater than zero after smoke.
  • No new SEV1 or SEV2 alert fires within 10 minutes.

Pilot Load Regression Check

Run the full TASK-237 load profile only before pilot gates or risky signaling changes.

cd /home/devcomm/akira
bash infra/load-test/run-pilot-profile.sh --duration 300

Expected result follows the TASK-237 target: ASR at least 95%, PDD p95 under 2s, and no RTPengine port exhaustion.

Rollback

Rollback if smoke fails, CDR ingestion stops, Sentry or logs show a critical spike, or customer-impacting behavior appears within 10 minutes.

  1. Identify the previous stable commit.
  2. Revert the bad commit or check out the previous stable commit for an emergency redeploy.
  3. Redeploy only the affected layer.
  4. Re-run smoke and monitor for 30 minutes.
cd /home/devcomm/akira
git log --oneline -10
git revert <bad_commit>

cd /home/devcomm/akira/infra
ansible-playbook -i inventory/staging.yml playbooks/<affected-playbook>.yml \
--vault-password-file ~/.akira-vault-pass.txt

Validation after rollback:

ssh root@akira-mgmt-01-staging 'docker ps --format "{{.Names}} {{.Status}}"'
ssh root@akira-db-01-staging '
sudo -u postgres psql -d akira -c "
SELECT max(answered_at) FROM cdr;
"
'

Production Deploy

Production promotion is not enabled in the pilot. Until the production environment is ready, use this section only as the expected future sequence:

  1. Confirm staging ran the same commit successfully.
  2. Confirm a production backup exists.
  3. Notify on-call primary and secondary.
  4. Deploy stateful, signaling, then management.
  5. Run HTTPS, SIP, CDR, and observability smoke.

Escalation: if a production deploy is blocked or degraded for 15 minutes, page the secondary on-call. If customer-facing impact continues for 30 minutes, escalate to Massimo and Francesco.

Caveats

  • Do not deploy during an active SEV1 or SEV2 unless the deploy is the approved mitigation.
  • Do not run destructive DR commands from this deploy runbook. Use dr.md.
  • Do not run production commands from staging hostnames.
  • Avoid Friday deploys unless the change is an emergency fix.