Passa al contenuto principale

ADR-0011 - FreeSWITCH ESL bridge: centralized TCP gateway pattern

  • Status: Accepted (2026-05-15)
  • Deciders: Massimo Bagnoli, Claude (sintesi Architect+TLC)
  • Implementation tasks: TASK-82+ signaling FreeSWITCH, dopo Trigger #2
  • Supersedes: nessuno
  • Superseded by: nessuno

Context

Akira usera' FreeSWITCH per transcoding G.729 verso G.711, VAS, IVR, prompt audio, trasferimenti controllati e operazioni runtime sui canali. Il controllo operativo passa dallo standard Event Socket Library (ESL): protocollo TCP asincrono per inviare comandi e ricevere eventi di canale.

I pattern principali sono due: connessione ESL per worker, oppure gateway centralizzato con una connessione condivisa e API interne. I trade-off riguardano lifecycle, latenza, isolamento, sicurezza e dispatch degli eventi.

Vincoli iniziali:

  • FreeSWITCH gira su fs-01.
  • Il controllo ESL non deve esporre la password FreeSWITCH a piu' worker.
  • Gli eventi FreeSWITCH devono essere pubblicati su NATS per consumo asincrono.
  • Il path deve restare semplice da operare in Fase 2 e compatibile con ADR-0008.

Decision

Adottiamo il pattern ESL centralized gateway.

Il servizio fs-esl-gateway gira sullo stesso host di FreeSWITCH, mantiene una connessione ESL persistente verso 127.0.0.1:8021 e offre una REST API interna ai consumer Akira. Gli eventi ricevuti da ESL vengono normalizzati e pubblicati su NATS subject fs.events.

Architecture

FreeSWITCH (fs-01)
|
+-- ESL TCP 127.0.0.1:8021
auth: vault_freeswitch_esl_password
|
+-- fs-esl-gateway (apps/fs-esl-gateway/)
|
+-- Python 3.12 asyncio process
+-- single persistent ESL connection
+-- reconnect with exponential backoff
+-- REST API exposed on tailscale0 only
+-- JWT auth shared with backend API
+-- event normalizer and publisher
+-- nats-py publish to subject fs.events
+-- Prometheus metrics and health endpoint

Gateway API contract

Initial HTTP endpoints:

  • POST /api/v1/fs/originate
  • POST /api/v1/fs/hangup/{call_uuid}
  • GET /api/v1/fs/channels
  • POST /api/v1/fs/play/{call_uuid}
  • POST /api/v1/fs/transfer/{call_uuid}
  • GET /health

ESL authentication and bind policy

The ESL password is stored in Ansible vault as vault_freeswitch_esl_password. FreeSWITCH event_socket.conf.xml binds only to loopback because the gateway runs on the same host as FreeSWITCH.

<configuration name="event_socket.conf" description="Socket Client">
<settings>
<param name="listen-ip" value="127.0.0.1"/>
<param name="listen-port" value="8021"/>
<param name="password" value="{{ vault_freeswitch_esl_password }}"/>
<param name="apply-inbound-acl" value="loopback.auto"/>
</settings>
</configuration>

This keeps the ESL control plane off public interfaces and off the Tailscale interface. Tailscale is used only for the gateway HTTP surface.

Gateway runtime requirements

  • Python 3.12 con ambiente uv.
  • Candidate ESL dependency: ESL-python or pyesl-async, selected in TASK-82+ after a compatibility spike against the FreeSWITCH package version.
  • Runtime dependencies: fastapi, nats-py, structlog.
  • Service manager: fs-esl-gateway.service with Restart=always.
  • HTTP bind: Tailscale interface only, never 0.0.0.0 on a public NIC.
  • HTTP auth: JWT shared with backend API, using JWT_SECRET from vault.

Event stream

The gateway subscribes to FreeSWITCH events required by signaling and VAS: channel create/destroy, answer, hangup, bridge/unbridge, DTMF, ringback and play prompt completion. Events are published to NATS subject fs.events.

The event payload must include at least event_type, call_uuid, timestamp, source_host and the raw FreeSWITCH fields needed for troubleshooting. Schema versioning is additive-only until a future subject such as fs.events.v2.

Backpressure policy

Command requests are synchronous from caller to gateway and asynchronous from gateway to FreeSWITCH only where ESL semantics require it. The gateway may throttle or reject commands if FreeSWITCH is unhealthy or command latency grows past threshold.

For event publish lag, the default policy is drop oldest with warning log. Blocking ESL event consumption to preserve every non-CDR event would risk stale control state. Critical downstream accounting remains covered by the CDR pipeline in ADR-0007 and ADR-0010.

Consequences

Positive

  • Single connection lifecycle: one persistent ESL connection instead of N worker-managed connections.
  • Credential minimization: vault_freeswitch_esl_password is needed only by FreeSWITCH config and fs-esl-gateway.
  • Central event dispatch: one subscriber normalizes FreeSWITCH events and publishes them to NATS without races between worker sockets.
  • Backpressure point: the gateway can reject or throttle commands when FreeSWITCH is degraded.
  • Stable API boundary: consumers depend on a small HTTP contract instead of ESL protocol details.
  • Reconnect control: exponential backoff and readiness live in one process.

Negative

  • Single point of failure: if the gateway is down, Akira loses runtime control over FreeSWITCH. Mitigation: systemd restart, health check and Alertmanager coverage from TASK-43.
  • Extra hop latency: HTTP to gateway to ESL TCP adds an estimated 5-10ms p95. Acceptable for VAS, prompt playback and call control in Fase 2.
  • Throughput ceiling: above 200 CPS production load, HA or sharded gateway topology must be measured.
  • Custom service: another Python service needs packaging, observability, deploy and on-call runbook.
  • Event loss policy: consumers must treat fs.events as operational state, not as the billing source of truth.

Alternatives considered

A1 - Per-worker ESL connections

Each Python worker opens a dedicated ESL socket directly to FreeSWITCH.

Pro: minimal latency, natural parallelism and no intermediate service.

Contro: many sockets on FreeSWITCH, duplicated ESL password, reconnect logic repeated in every worker, thread/process safety questions in FastAPI workers and ambiguous async event ownership.

Verdict: rejected. It spreads a hot control-plane concern across too many processes for the initial team and phase.

A2 - fs_cli wrapper subprocess

Python calls fs_cli -p '<password>' -x '<command>' through subprocess for each command.

Pro: no custom network client and useful for manual operator debugging.

Contro: fork overhead per command, expected latency around 50-100ms, password can appear in process listings if not handled carefully and there is no persistent async event stream.

Verdict: rejected for application control. Acceptable only as an interactive/debug tool.

A3 - Direct ESL library inside backend endpoints

The backend exposes FreeSWITCH endpoints and imports a shared module such as apps/backend/src/akira_backend/integrations/freeswitch.py.

Pro: no extra service and one fewer deployable artifact.

Contro: connection ownership is unclear across FastAPI workers, reconnect and event subscription lifetime become coupled to web process lifecycle, and ESL credentials are loaded into the general backend runtime.

Verdict: rejected. The ESL control plane deserves its own lifecycle and failure domain.

Implementation status

  • Decision: separate service apps/fs-esl-gateway/ is canonical.
  • TASK-93 implementation: deviated. It implemented the ESL bridge inside apps/backend/src/akira_backend/services/esl_bridge.py because the task acceptance criteria contradicted this ADR. That backend-side implementation is not the accepted production pattern.
  • Refactor: TASK-99 is scheduled to migrate the ESL bridge from apps/backend/ to the separate apps/fs-esl-gateway/ service and close TD-061.

Implementation references

  • TD-061: tracks the TASK-93 backend-side ESL bridge deviation.
  • TASK-99: refactors the implementation to the canonical fs-esl-gateway service.
  • ADR-0008: signaling HA context for Kamailio and FreeSWITCH placement.
  • ADR-0010: adjacent bridge pattern for Kamailio CDR emission.
  • TASK-25: NATS consumer pattern reused downstream from fs.events.
  • TASK-43: alerting for gateway health and ESL connection failures.
  • TASK-82+: FreeSWITCH install, ESL config and fs-esl-gateway implementation.
  • apps/fs-esl-gateway/: service directory to scaffold in the implementation task.
  • infra/roles/freeswitch/templates/event_socket.conf.xml.j2: ESL config.
  • vault_freeswitch_esl_password: Ansible vault variable for ESL auth.

Monitoring & success metrics

  • fs_esl_commands_total: increments for each command accepted by the gateway.
  • fs_esl_errors_total: command and connection errors by type.
  • fs_esl_command_latency_ms: p95 under 50ms in staging.
  • fs_esl_connection_uptime_seconds: uptime ratio above 99.9% per quarter.
  • fs_esl_event_publish_lag_ms: p95 under 100ms from ESL receive to NATS.
  • fs_esl_reconnects_total: alert on sustained reconnect loops.
  • Resident memory below 512MB in normal load.
  • Alert on unhealthy gateway or fs_esl_gateway_connection_failures > 0.

Open questions

  • HA gateway topology: decide in Fase 3 production planning.
  • Exact fs.events schema: define in TASK-82+ after capturing samples.
  • gRPC API: revisit only if REST p95 exceeds 100ms or streaming is required.
  • Command authorization matrix: gateway-level permissions or service JWT claims.