ADR-0011 - FreeSWITCH ESL bridge: centralized TCP gateway pattern
- Status: Accepted (2026-05-15)
- Deciders: Massimo Bagnoli, Claude (sintesi Architect+TLC)
- Implementation tasks: TASK-82+ signaling FreeSWITCH, dopo Trigger #2
- Supersedes: nessuno
- Superseded by: nessuno
Context
Akira usera' FreeSWITCH per transcoding G.729 verso G.711, VAS, IVR, prompt audio, trasferimenti controllati e operazioni runtime sui canali. Il controllo operativo passa dallo standard Event Socket Library (ESL): protocollo TCP asincrono per inviare comandi e ricevere eventi di canale.
I pattern principali sono due: connessione ESL per worker, oppure gateway centralizzato con una connessione condivisa e API interne. I trade-off riguardano lifecycle, latenza, isolamento, sicurezza e dispatch degli eventi.
Vincoli iniziali:
- FreeSWITCH gira su
fs-01. - Il controllo ESL non deve esporre la password FreeSWITCH a piu' worker.
- Gli eventi FreeSWITCH devono essere pubblicati su NATS per consumo asincrono.
- Il path deve restare semplice da operare in Fase 2 e compatibile con ADR-0008.
Decision
Adottiamo il pattern ESL centralized gateway.
Il servizio fs-esl-gateway gira sullo stesso host di FreeSWITCH, mantiene una
connessione ESL persistente verso 127.0.0.1:8021 e offre una REST API interna
ai consumer Akira. Gli eventi ricevuti da ESL vengono normalizzati e pubblicati
su NATS subject fs.events.
Architecture
FreeSWITCH (fs-01)
|
+-- ESL TCP 127.0.0.1:8021
auth: vault_freeswitch_esl_password
|
+-- fs-esl-gateway (apps/fs-esl-gateway/)
|
+-- Python 3.12 asyncio process
+-- single persistent ESL connection
+-- reconnect with exponential backoff
+-- REST API exposed on tailscale0 only
+-- JWT auth shared with backend API
+-- event normalizer and publisher
+-- nats-py publish to subject fs.events
+-- Prometheus metrics and health endpoint
Gateway API contract
Initial HTTP endpoints:
POST /api/v1/fs/originatePOST /api/v1/fs/hangup/{call_uuid}GET /api/v1/fs/channelsPOST /api/v1/fs/play/{call_uuid}POST /api/v1/fs/transfer/{call_uuid}GET /health
ESL authentication and bind policy
The ESL password is stored in Ansible vault as
vault_freeswitch_esl_password. FreeSWITCH event_socket.conf.xml binds only
to loopback because the gateway runs on the same host as FreeSWITCH.
<configuration name="event_socket.conf" description="Socket Client">
<settings>
<param name="listen-ip" value="127.0.0.1"/>
<param name="listen-port" value="8021"/>
<param name="password" value="{{ vault_freeswitch_esl_password }}"/>
<param name="apply-inbound-acl" value="loopback.auto"/>
</settings>
</configuration>
This keeps the ESL control plane off public interfaces and off the Tailscale interface. Tailscale is used only for the gateway HTTP surface.
Gateway runtime requirements
- Python 3.12 con ambiente
uv. - Candidate ESL dependency:
ESL-pythonorpyesl-async, selected in TASK-82+ after a compatibility spike against the FreeSWITCH package version. - Runtime dependencies:
fastapi,nats-py,structlog. - Service manager:
fs-esl-gateway.servicewithRestart=always. - HTTP bind: Tailscale interface only, never
0.0.0.0on a public NIC. - HTTP auth: JWT shared with backend API, using
JWT_SECRETfrom vault.
Event stream
The gateway subscribes to FreeSWITCH events required by signaling and VAS:
channel create/destroy, answer, hangup, bridge/unbridge, DTMF, ringback and play
prompt completion. Events are published to NATS subject fs.events.
The event payload must include at least event_type, call_uuid, timestamp,
source_host and the raw FreeSWITCH fields needed for troubleshooting. Schema
versioning is additive-only until a future subject such as fs.events.v2.
Backpressure policy
Command requests are synchronous from caller to gateway and asynchronous from gateway to FreeSWITCH only where ESL semantics require it. The gateway may throttle or reject commands if FreeSWITCH is unhealthy or command latency grows past threshold.
For event publish lag, the default policy is drop oldest with warning log. Blocking ESL event consumption to preserve every non-CDR event would risk stale control state. Critical downstream accounting remains covered by the CDR pipeline in ADR-0007 and ADR-0010.
Consequences
Positive
- Single connection lifecycle: one persistent ESL connection instead of N worker-managed connections.
- Credential minimization:
vault_freeswitch_esl_passwordis needed only by FreeSWITCH config andfs-esl-gateway. - Central event dispatch: one subscriber normalizes FreeSWITCH events and publishes them to NATS without races between worker sockets.
- Backpressure point: the gateway can reject or throttle commands when FreeSWITCH is degraded.
- Stable API boundary: consumers depend on a small HTTP contract instead of ESL protocol details.
- Reconnect control: exponential backoff and readiness live in one process.
Negative
- Single point of failure: if the gateway is down, Akira loses runtime control over FreeSWITCH. Mitigation: systemd restart, health check and Alertmanager coverage from TASK-43.
- Extra hop latency: HTTP to gateway to ESL TCP adds an estimated 5-10ms p95. Acceptable for VAS, prompt playback and call control in Fase 2.
- Throughput ceiling: above 200 CPS production load, HA or sharded gateway topology must be measured.
- Custom service: another Python service needs packaging, observability, deploy and on-call runbook.
- Event loss policy: consumers must treat
fs.eventsas operational state, not as the billing source of truth.
Alternatives considered
A1 - Per-worker ESL connections
Each Python worker opens a dedicated ESL socket directly to FreeSWITCH.
Pro: minimal latency, natural parallelism and no intermediate service.
Contro: many sockets on FreeSWITCH, duplicated ESL password, reconnect logic repeated in every worker, thread/process safety questions in FastAPI workers and ambiguous async event ownership.
Verdict: rejected. It spreads a hot control-plane concern across too many processes for the initial team and phase.
A2 - fs_cli wrapper subprocess
Python calls fs_cli -p '<password>' -x '<command>' through subprocess for
each command.
Pro: no custom network client and useful for manual operator debugging.
Contro: fork overhead per command, expected latency around 50-100ms, password can appear in process listings if not handled carefully and there is no persistent async event stream.
Verdict: rejected for application control. Acceptable only as an interactive/debug tool.
A3 - Direct ESL library inside backend endpoints
The backend exposes FreeSWITCH endpoints and imports a shared module such as
apps/backend/src/akira_backend/integrations/freeswitch.py.
Pro: no extra service and one fewer deployable artifact.
Contro: connection ownership is unclear across FastAPI workers, reconnect and event subscription lifetime become coupled to web process lifecycle, and ESL credentials are loaded into the general backend runtime.
Verdict: rejected. The ESL control plane deserves its own lifecycle and failure domain.
Implementation status
- Decision: separate service
apps/fs-esl-gateway/is canonical. - TASK-93 implementation: deviated. It implemented the ESL bridge inside
apps/backend/src/akira_backend/services/esl_bridge.pybecause the task acceptance criteria contradicted this ADR. That backend-side implementation is not the accepted production pattern. - Refactor: TASK-99 is scheduled to migrate the ESL bridge from
apps/backend/to the separateapps/fs-esl-gateway/service and close TD-061.
Implementation references
- TD-061: tracks the TASK-93 backend-side ESL bridge deviation.
- TASK-99: refactors the implementation to the canonical
fs-esl-gatewayservice. - ADR-0008: signaling HA context for Kamailio and FreeSWITCH placement.
- ADR-0010: adjacent bridge pattern for Kamailio CDR emission.
- TASK-25: NATS consumer pattern reused downstream from
fs.events. - TASK-43: alerting for gateway health and ESL connection failures.
- TASK-82+: FreeSWITCH install, ESL config and
fs-esl-gatewayimplementation. apps/fs-esl-gateway/: service directory to scaffold in the implementation task.infra/roles/freeswitch/templates/event_socket.conf.xml.j2: ESL config.vault_freeswitch_esl_password: Ansible vault variable for ESL auth.
Monitoring & success metrics
fs_esl_commands_total: increments for each command accepted by the gateway.fs_esl_errors_total: command and connection errors by type.fs_esl_command_latency_ms: p95 under 50ms in staging.fs_esl_connection_uptime_seconds: uptime ratio above 99.9% per quarter.fs_esl_event_publish_lag_ms: p95 under 100ms from ESL receive to NATS.fs_esl_reconnects_total: alert on sustained reconnect loops.- Resident memory below 512MB in normal load.
- Alert on unhealthy gateway or
fs_esl_gateway_connection_failures > 0.
Open questions
- HA gateway topology: decide in Fase 3 production planning.
- Exact
fs.eventsschema: define in TASK-82+ after capturing samples. - gRPC API: revisit only if REST p95 exceeds 100ms or streaming is required.
- Command authorization matrix: gateway-level permissions or service JWT claims.