Capacity Planning - Akira pilot to GA
This baseline defines when Akira should scale capacity during the path from staging to pilot and GA. Values are planning targets: real validation is tied to SIPp load testing and production telemetry.
Sizing matrix
| Phase | Target | VM types | Cost/m | Capacity target |
|---|---|---|---|---|
| Staging (Trigger #2 closed) | dev/test | 12 VM mix cx23/cx33/cx43 | EUR 82 | 50 cps, 500 concurrent |
| Pilot Phase 1 (single client) | 1-3 customers | same staging | EUR 82 | 100 cps, 1000 concurrent |
| Pilot Phase 2 (5-10 customers) | grow + SLA | upgrade mgmt+db | EUR 150 | 250 cps, 2500 concurrent |
| GA Phase 1 (20-50 customers) | production | upgrade signaling+media | EUR 400 | 500 cps, 5000 concurrent |
| GA Phase 2 (100+ customers) | scale-out | multi-region nbg1+fsn1 | EUR 1000+ | 1000+ cps, 10000+ concurrent |
Component CPU/RAM ceiling
Scale actions should be planned when thresholds remain above target for the alert window, not on one-off spikes.
| Component | CPU 70% threshold action | RAM 80% threshold action |
|---|---|---|
| Kamailio sip-01/02 | Scale up cx33 to cx43 (4 to 8 vCPU) | Investigate htable size and OOM tuning |
| RTPengine rtp-01/02 | Scale up and add rtp-03 node | Investigate active session count |
| FreeSWITCH fs-01/vas-01 | Scale up cx43 to cpx41 | Check OOM events and transcoding load |
| Postgres db-01 | Investigate slow queries, indexes, and replica routing | Check pg_buffercache and tune shared_buffers |
| Redis cache-01 | Investigate unusual CPU or command mix | Tune maxmemory policy and key TTL profile |
| NATS | Tune stream and consumer count | Move JetStream storage tier from memory to file when needed |
| Backend mgmt-01 | Add second management node for HA | Tune asyncpg pool and worker concurrency |
Capacity warning alerts
Capacity warnings are versioned in
infra/roles/prometheus/files/rules/capacity.yml and installed with the
Prometheus role.
| Alert | Trigger | Initial action |
|---|---|---|
| NodeCpuHighSustained | Node CPU above 70% for 15 minutes | Check top processes and decide scale-up vs load redistribution |
| NodeMemoryHigh | Node memory above 80% for 10 minutes | Check service RSS, OOM risk, and cache pressure |
| NodeDiskHigh | Root disk above 80% for 5 minutes | Check logs, Prometheus/Loki/Timescale growth, backups |
| KamailioCpsHigh | INVITE CPS above 80 for 5 minutes | Compare to pilot target and SIPp results, plan signaling scale |
| RTPengineSessionsHigh | Active RTP sessions above 800 for 5 minutes | Prepare rtp-03 or node resize before 1000-session ceiling |
| PostgresConnectionsHigh | Connections above 80% of max for 5 minutes | Check pgbouncer pools and backend connection churn |
Grafana dashboard
The capacity dashboard is provisioned as
infra/roles/grafana/files/dashboards/akira-capacity-sizing.json.
Core panels:
- CPU, RAM, and disk saturation by node with 70/80% threshold coloring.
- Kamailio CPS 24h trend with pilot target 50 cps and breakpoint 80 cps.
- RTPengine active sessions with 500 target and 1000 breakpoint.
- Postgres connection saturation, slow-query proxy, and buffer hit ratio.
- NATS message rate and consumer lag placeholders for post-deploy telemetry.
- TimescaleDB/Postgres disk growth estimate in GB/day.
- Static cost projection matrix for current capacity phase.
Runbook: docs/runbooks/capacity-scaling.md.
Weekly baseline report
scripts/capacity-baseline.sh writes
reports/capacity-YYYY-Www.md with:
pg_stat_statementstop 10 by total execution time.- Timescale hypertable sizes.
- Kamailio CPS p95 over the last 7 days from Prometheus.
- Manual cost-trend notes and weekly recommendations.
The Prometheus role can deploy a weekly systemd timer that runs every Sunday at
06:00 UTC. The timer is opt-in through prometheus_capacity_baseline_enabled.
Validation notes
- Baseline numbers are projections until TASK-199 load testing produces SIPp evidence.
- CPU/RAM thresholds are planning thresholds, not emergency thresholds.
pg_stat_statementsmust be present on Postgres. The baseline script attemptsCREATE EXTENSION IF NOT EXISTS pg_stat_statements, but the DB still needsshared_preload_librariesconfigured before the extension can collect data.