Passa al contenuto principale

Capacity Planning - Akira pilot to GA

This baseline defines when Akira should scale capacity during the path from staging to pilot and GA. Values are planning targets: real validation is tied to SIPp load testing and production telemetry.

Sizing matrix

PhaseTargetVM typesCost/mCapacity target
Staging (Trigger #2 closed)dev/test12 VM mix cx23/cx33/cx43EUR 8250 cps, 500 concurrent
Pilot Phase 1 (single client)1-3 customerssame stagingEUR 82100 cps, 1000 concurrent
Pilot Phase 2 (5-10 customers)grow + SLAupgrade mgmt+dbEUR 150250 cps, 2500 concurrent
GA Phase 1 (20-50 customers)productionupgrade signaling+mediaEUR 400500 cps, 5000 concurrent
GA Phase 2 (100+ customers)scale-outmulti-region nbg1+fsn1EUR 1000+1000+ cps, 10000+ concurrent

Component CPU/RAM ceiling

Scale actions should be planned when thresholds remain above target for the alert window, not on one-off spikes.

ComponentCPU 70% threshold actionRAM 80% threshold action
Kamailio sip-01/02Scale up cx33 to cx43 (4 to 8 vCPU)Investigate htable size and OOM tuning
RTPengine rtp-01/02Scale up and add rtp-03 nodeInvestigate active session count
FreeSWITCH fs-01/vas-01Scale up cx43 to cpx41Check OOM events and transcoding load
Postgres db-01Investigate slow queries, indexes, and replica routingCheck pg_buffercache and tune shared_buffers
Redis cache-01Investigate unusual CPU or command mixTune maxmemory policy and key TTL profile
NATSTune stream and consumer countMove JetStream storage tier from memory to file when needed
Backend mgmt-01Add second management node for HATune asyncpg pool and worker concurrency

Capacity warning alerts

Capacity warnings are versioned in infra/roles/prometheus/files/rules/capacity.yml and installed with the Prometheus role.

AlertTriggerInitial action
NodeCpuHighSustainedNode CPU above 70% for 15 minutesCheck top processes and decide scale-up vs load redistribution
NodeMemoryHighNode memory above 80% for 10 minutesCheck service RSS, OOM risk, and cache pressure
NodeDiskHighRoot disk above 80% for 5 minutesCheck logs, Prometheus/Loki/Timescale growth, backups
KamailioCpsHighINVITE CPS above 80 for 5 minutesCompare to pilot target and SIPp results, plan signaling scale
RTPengineSessionsHighActive RTP sessions above 800 for 5 minutesPrepare rtp-03 or node resize before 1000-session ceiling
PostgresConnectionsHighConnections above 80% of max for 5 minutesCheck pgbouncer pools and backend connection churn

Grafana dashboard

The capacity dashboard is provisioned as infra/roles/grafana/files/dashboards/akira-capacity-sizing.json.

Core panels:

  • CPU, RAM, and disk saturation by node with 70/80% threshold coloring.
  • Kamailio CPS 24h trend with pilot target 50 cps and breakpoint 80 cps.
  • RTPengine active sessions with 500 target and 1000 breakpoint.
  • Postgres connection saturation, slow-query proxy, and buffer hit ratio.
  • NATS message rate and consumer lag placeholders for post-deploy telemetry.
  • TimescaleDB/Postgres disk growth estimate in GB/day.
  • Static cost projection matrix for current capacity phase.

Runbook: docs/runbooks/capacity-scaling.md.

Weekly baseline report

scripts/capacity-baseline.sh writes reports/capacity-YYYY-Www.md with:

  • pg_stat_statements top 10 by total execution time.
  • Timescale hypertable sizes.
  • Kamailio CPS p95 over the last 7 days from Prometheus.
  • Manual cost-trend notes and weekly recommendations.

The Prometheus role can deploy a weekly systemd timer that runs every Sunday at 06:00 UTC. The timer is opt-in through prometheus_capacity_baseline_enabled.

Validation notes

  • Baseline numbers are projections until TASK-199 load testing produces SIPp evidence.
  • CPU/RAM thresholds are planning thresholds, not emergency thresholds.
  • pg_stat_statements must be present on Postgres. The baseline script attempts CREATE EXTENSION IF NOT EXISTS pg_stat_statements, but the DB still needs shared_preload_libraries configured before the extension can collect data.