ADR-0015 - Background job framework: NATS-native systemd workers
- Status: Accepted (2026-05-20)
- Deciders: Massimo Bagnoli, Claude
- Implementation tasks: TASK-127, TASK-128
- Supersedes: ADR-0001 job queue guidance for Akira background workers
- Superseded by: nessuno
Context
Akira already has several asynchronous background processes:
kam-cdr-bridge: file-tail bridge from Kamailio CDR JSONL to NATS.fs-esl-gateway: FreeSWITCH ESL bridge to HTTP/NATS.recordings-upload-sftp: periodic recording upload via systemd timer.notification-dispatcher: NATS subscriber for outbound notifications.audit-retention-worker: daily retention cleanup via systemd timer.
Future work is expected to add agent-fee rule resolution, balance-recompute jobs, dispatcher synchronization and export workers.
The architectural question is whether these workers should be unified under a Python job framework such as Celery, Dramatiq, RQ or APScheduler, or keep the current Akira pattern of independent systemd-managed processes using NATS JetStream and asyncio where messaging is needed.
Decision
Adopt the Akira-native pattern: one explicit systemd service or timer per worker, NATS JetStream for durable async messaging, and Python asyncio for long-running subscribers and bridges.
Do not introduce Celery, Dramatiq, RQ or APScheduler as a unified background job framework for the current Akira worker set.
The shared behavior needed by NATS subscribers and publishers will be
centralized in packages/akira_workers/ through a reusable NATSWorker
base class in TASK-128. Existing workers are not refactored by this ADR.
Worker class hierarchy
TASK-128 defines three explicit worker classes instead of a single implicit durability flag:
| Workload | Loss on disconnect OK? | Class | Use case |
|---|---|---|---|
<1k msg/sec | yes | NATSWorker | ephemeral alerts, simple low-value events |
<10k msg/sec | no | NATSJetStreamPushWorker | notification dispatcher, fs-esl-gateway events |
>10k msg/sec or batch processing | no | NATSJetStreamPullWorker | production CDR pipeline, bulk rating, fraud batches |
NATSWorker is core NATS pub/sub and does not ack/nak. NATSJetStreamPushWorker
uses a durable push consumer with explicit ack/nak. NATSJetStreamPullWorker
uses durable pull fetches for batch acknowledgement and backpressure.
Worker mapping:
| Worker | Pattern |
|---|---|
notification-dispatcher | NATSJetStreamPushWorker |
fs-esl-gateway event publisher/consumer side | NATSJetStreamPushWorker |
alertmanager webhook | NATSWorker core if loss is acceptable |
| CDR pipeline | NATSJetStreamPullWorker before Phase 4 production cutover |
| audit retention cleanup | systemd timer, not NATS worker |
| balance recompute nightly | arq + Redis if API submit/result is needed |
Rationale
-
NATS JetStream is already the Akira async bus. ADR-0007 and TASK-25 establish NATS for durable event flow, and current subjects already cover CDR, FreeSWITCH events, notifications and alerts. Adding a job framework with Redis as a second broker would duplicate responsibility.
-
systemd is the canonical Akira operations primitive. Workers already map naturally to service units or timers with journald logging,
Restart=always, dependency ordering and Ansible deployment. -
The worker shapes are heterogeneous. File-tail bridges, NATS subscribers, timers and protocol gateways do not all fit the "submit task, run later, return result" model that Celery-style frameworks optimize for.
-
Visibility already comes from the existing stack: journald to Loki, Prometheus metrics, NATS JetStream durable consumers, acknowledgements and redelivery. Introducing Flower or another framework-specific dashboard would split operational visibility.
-
The current scale is 5-10 workers. A small number of explicit processes is simpler to reason about than a generic queue cluster, beat scheduler and framework-specific deployment model.
Consequences
Positive
- One async broker remains canonical: NATS JetStream.
- Each worker has a direct runtime identity: one systemd unit or timer and one Python entry point.
- Logs follow the existing journald -> Promtail -> Loki pipeline.
- Metrics follow the existing Prometheus pattern.
- Deployment remains Ansible plus systemd, without a Celery or Redis queue cluster to operate.
- Failure domains stay narrow: a broken worker does not take down a shared generic worker pool.
Negative
- Some boilerplate is repeated across workers: NATS connection setup, reconnect handling, acknowledgement behavior, stop handling and metrics.
- There is no out-of-the-box "submit job from API and poll result" framework contract.
- Retry and dead-letter behavior can drift if each worker implements its own policy.
- Scaling many similar jobs will require discipline in naming, metrics, systemd units and NATS subject conventions.
Mitigations
- TASK-128 introduces
packages/akira_workers/with a sharedNATSWorkerbase class for common subscriber lifecycle behavior. - Workers that need API-submitted long jobs should publish explicit NATS
commands such as
akira.jobs.export.create, persist status in Postgres and expose polling through the API. - Retry and DLQ policy should be standardized in the shared worker package before adding many more queue-backed workers.
Alternatives considered
A1 - Celery with Redis broker
Celery is mature and has strong retry controls, routing features and Flower monitoring.
Rejected. It would introduce a second broker beside NATS, add Celery beat or another scheduler, and make Akira operate a generic worker cluster for a set of processes that mostly are not Celery-shaped tasks.
A2 - Dramatiq with Redis
Dramatiq is lighter than Celery and easier to operate for simple task queues.
Rejected. It still duplicates the broker layer and does not materially help file-tail bridges, protocol gateways or systemd timer jobs.
A3 - RQ
RQ is small and simple for Redis-backed Python jobs.
Rejected. It is not async-native, scheduling needs extra components, and it duplicates NATS while covering only the narrow queued-task subset.
A4 - APScheduler inside FastAPI
APScheduler would avoid a separate worker deployment for periodic jobs.
Rejected. In-process scheduling couples operational jobs to the API process, creates missed-run risk on backend restarts and makes horizontal API scaling ambiguous.
A5 - Status quo: custom units without shared worker package
Keeping only custom systemd units avoids new dependencies.
Rejected as the full target because it leaves repeated lifecycle code and inconsistent retry behavior. Accepted only as the deployment pattern, with TASK-128 adding shared worker primitives.
Implementation references
- ADR-0007: NATS JetStream as durable CDR/event bus.
- ADR-0010: Kamailio CDR file-tail sidecar to NATS.
- ADR-0011: FreeSWITCH ESL gateway as a separate service.
- TASK-97:
recordings-upload-sftpsystemd timer pattern. - TASK-124:
notification-dispatcherNATS subscriber. - TASK-126:
audit-retention-workersystemd timer. - TASK-128: shared
packages/akira_workers/andNATSWorkerbase class.
Monitoring and operations
- Each worker should expose or emit worker-specific Prometheus metrics where practical: processed count, failures, retry count and last successful run.
- Each worker should log structured events with worker name, correlation id where available and NATS subject or job id where applicable.
- systemd units should use explicit service names,
Restart=alwaysfor long-running workers and timers for periodic jobs. - NATS consumers should use durable names and explicit acknowledgement for work that must not be lost.
Open questions
- Confirm whether the old ADR-0001
arqguidance is fully retired for all UI/admin jobs or remains available for a narrow API-only use case. - Define the default retry/backoff policy in TASK-128.
- Define the default NATS DLQ convention for workers that exhaust redelivery.
- Revisit a generic job framework if Akira grows beyond roughly 50 workers or gains complex task DAG requirements.