Passa al contenuto principale

ADR-0015 - Background job framework: NATS-native systemd workers

  • Status: Accepted (2026-05-20)
  • Deciders: Massimo Bagnoli, Claude
  • Implementation tasks: TASK-127, TASK-128
  • Supersedes: ADR-0001 job queue guidance for Akira background workers
  • Superseded by: nessuno

Context

Akira already has several asynchronous background processes:

  • kam-cdr-bridge: file-tail bridge from Kamailio CDR JSONL to NATS.
  • fs-esl-gateway: FreeSWITCH ESL bridge to HTTP/NATS.
  • recordings-upload-sftp: periodic recording upload via systemd timer.
  • notification-dispatcher: NATS subscriber for outbound notifications.
  • audit-retention-worker: daily retention cleanup via systemd timer.

Future work is expected to add agent-fee rule resolution, balance-recompute jobs, dispatcher synchronization and export workers.

The architectural question is whether these workers should be unified under a Python job framework such as Celery, Dramatiq, RQ or APScheduler, or keep the current Akira pattern of independent systemd-managed processes using NATS JetStream and asyncio where messaging is needed.

Decision

Adopt the Akira-native pattern: one explicit systemd service or timer per worker, NATS JetStream for durable async messaging, and Python asyncio for long-running subscribers and bridges.

Do not introduce Celery, Dramatiq, RQ or APScheduler as a unified background job framework for the current Akira worker set.

The shared behavior needed by NATS subscribers and publishers will be centralized in packages/akira_workers/ through a reusable NATSWorker base class in TASK-128. Existing workers are not refactored by this ADR.

Worker class hierarchy

TASK-128 defines three explicit worker classes instead of a single implicit durability flag:

WorkloadLoss on disconnect OK?ClassUse case
<1k msg/secyesNATSWorkerephemeral alerts, simple low-value events
<10k msg/secnoNATSJetStreamPushWorkernotification dispatcher, fs-esl-gateway events
>10k msg/sec or batch processingnoNATSJetStreamPullWorkerproduction CDR pipeline, bulk rating, fraud batches

NATSWorker is core NATS pub/sub and does not ack/nak. NATSJetStreamPushWorker uses a durable push consumer with explicit ack/nak. NATSJetStreamPullWorker uses durable pull fetches for batch acknowledgement and backpressure.

Worker mapping:

WorkerPattern
notification-dispatcherNATSJetStreamPushWorker
fs-esl-gateway event publisher/consumer sideNATSJetStreamPushWorker
alertmanager webhookNATSWorker core if loss is acceptable
CDR pipelineNATSJetStreamPullWorker before Phase 4 production cutover
audit retention cleanupsystemd timer, not NATS worker
balance recompute nightlyarq + Redis if API submit/result is needed

Rationale

  1. NATS JetStream is already the Akira async bus. ADR-0007 and TASK-25 establish NATS for durable event flow, and current subjects already cover CDR, FreeSWITCH events, notifications and alerts. Adding a job framework with Redis as a second broker would duplicate responsibility.

  2. systemd is the canonical Akira operations primitive. Workers already map naturally to service units or timers with journald logging, Restart=always, dependency ordering and Ansible deployment.

  3. The worker shapes are heterogeneous. File-tail bridges, NATS subscribers, timers and protocol gateways do not all fit the "submit task, run later, return result" model that Celery-style frameworks optimize for.

  4. Visibility already comes from the existing stack: journald to Loki, Prometheus metrics, NATS JetStream durable consumers, acknowledgements and redelivery. Introducing Flower or another framework-specific dashboard would split operational visibility.

  5. The current scale is 5-10 workers. A small number of explicit processes is simpler to reason about than a generic queue cluster, beat scheduler and framework-specific deployment model.

Consequences

Positive

  • One async broker remains canonical: NATS JetStream.
  • Each worker has a direct runtime identity: one systemd unit or timer and one Python entry point.
  • Logs follow the existing journald -> Promtail -> Loki pipeline.
  • Metrics follow the existing Prometheus pattern.
  • Deployment remains Ansible plus systemd, without a Celery or Redis queue cluster to operate.
  • Failure domains stay narrow: a broken worker does not take down a shared generic worker pool.

Negative

  • Some boilerplate is repeated across workers: NATS connection setup, reconnect handling, acknowledgement behavior, stop handling and metrics.
  • There is no out-of-the-box "submit job from API and poll result" framework contract.
  • Retry and dead-letter behavior can drift if each worker implements its own policy.
  • Scaling many similar jobs will require discipline in naming, metrics, systemd units and NATS subject conventions.

Mitigations

  • TASK-128 introduces packages/akira_workers/ with a shared NATSWorker base class for common subscriber lifecycle behavior.
  • Workers that need API-submitted long jobs should publish explicit NATS commands such as akira.jobs.export.create, persist status in Postgres and expose polling through the API.
  • Retry and DLQ policy should be standardized in the shared worker package before adding many more queue-backed workers.

Alternatives considered

A1 - Celery with Redis broker

Celery is mature and has strong retry controls, routing features and Flower monitoring.

Rejected. It would introduce a second broker beside NATS, add Celery beat or another scheduler, and make Akira operate a generic worker cluster for a set of processes that mostly are not Celery-shaped tasks.

A2 - Dramatiq with Redis

Dramatiq is lighter than Celery and easier to operate for simple task queues.

Rejected. It still duplicates the broker layer and does not materially help file-tail bridges, protocol gateways or systemd timer jobs.

A3 - RQ

RQ is small and simple for Redis-backed Python jobs.

Rejected. It is not async-native, scheduling needs extra components, and it duplicates NATS while covering only the narrow queued-task subset.

A4 - APScheduler inside FastAPI

APScheduler would avoid a separate worker deployment for periodic jobs.

Rejected. In-process scheduling couples operational jobs to the API process, creates missed-run risk on backend restarts and makes horizontal API scaling ambiguous.

A5 - Status quo: custom units without shared worker package

Keeping only custom systemd units avoids new dependencies.

Rejected as the full target because it leaves repeated lifecycle code and inconsistent retry behavior. Accepted only as the deployment pattern, with TASK-128 adding shared worker primitives.

Implementation references

  • ADR-0007: NATS JetStream as durable CDR/event bus.
  • ADR-0010: Kamailio CDR file-tail sidecar to NATS.
  • ADR-0011: FreeSWITCH ESL gateway as a separate service.
  • TASK-97: recordings-upload-sftp systemd timer pattern.
  • TASK-124: notification-dispatcher NATS subscriber.
  • TASK-126: audit-retention-worker systemd timer.
  • TASK-128: shared packages/akira_workers/ and NATSWorker base class.

Monitoring and operations

  • Each worker should expose or emit worker-specific Prometheus metrics where practical: processed count, failures, retry count and last successful run.
  • Each worker should log structured events with worker name, correlation id where available and NATS subject or job id where applicable.
  • systemd units should use explicit service names, Restart=always for long-running workers and timers for periodic jobs.
  • NATS consumers should use durable names and explicit acknowledgement for work that must not be lost.

Open questions

  • Confirm whether the old ADR-0001 arq guidance is fully retired for all UI/admin jobs or remains available for a narrow API-only use case.
  • Define the default retry/backoff policy in TASK-128.
  • Define the default NATS DLQ convention for workers that exhaust redelivery.
  • Revisit a generic job framework if Akira grows beyond roughly 50 workers or gains complex task DAG requirements.