ADR-0014 - Alertmanager routing topology and notification channels
- Status: Accepted (2026-05-20)
- Deciders: Massimo Bagnoli, Claude
- Implementation tasks: TASK-120
- Supersedes: ADR-0013 alert routing subsection
- Superseded by: nessuno
Context
TASK-116, TASK-117, TASK-118 and TASK-119 define exporter coverage and alert rules for Kamailio, RTPengine, Postgres and Redis. Alert rules are not operationally useful until Alertmanager has a routing tree, receivers, grouping, de-duplication and inhibition policies.
Akira staging is operated by a small team, currently team-of-1, while production will require a broader escalation model. The notification topology must therefore be simple enough for staging but explicit enough to evolve into production paging without changing alert labels.
Decision
Adopt a severity-based Alertmanager routing tree with Telegram as the primary operational channel and email as the critical backup channel. Do not introduce PagerDuty in staging.
Routing tree
severity="critical"routes tocritical-multi: Telegram ops chat and email tonoc@asheep.it.severity="warning"routes totelegram-ops: Telegram ops chat only, silenced through Alertmanager when noisy.severity="info"routes tolog-only: local webhook sink intended to append informational alerts to/var/log/alertmanager-info.log.- Alerts without a more specific child route use the default
telegram-opsreceiver.
Telegram mute hours (client-side)
The single Akira staging operator should configure Telegram mute hours for
the @akira_ops_staging chat from 22:00 to 08:00 local time. Warning
alerts remain silent during that window and visible at wake-up. Critical
alerts still route through critical-multi, with Telegram plus email, and
are expected to bypass the mute through client-side priority notification
settings.
This matches the current ops-of-1 routine. In Fase 4 production, reassess this policy with on-call rotation and an explicit escalation provider (TD-072).
Grouping and de-duplication
group_by:alertname,cluster,severity.group_wait:30s.group_interval:5m.repeat_interval:4h.
This policy groups near-simultaneous related alerts, sends updates when a group changes and repeats unresolved notifications at a low enough cadence for staging operations.
Notification channels
Telegram uses the shared bot token vault_telegram_bot_token. The chat id
is supplied through vault_alertmanager_telegram_chatid; staging should use
the private ops group @akira_ops_staging. The Ansible role writes the
numeric chat id to {{ alertmanager_config_dir }}/telegram_chat_id and the
Alertmanager configuration references it through chat_id_file, preserving
Alertmanager's integer type while keeping the template YAML parseable.
Email uses SMTP credentials from vault:
vault_alertmanager_smtp_host, vault_alertmanager_smtp_port,
vault_alertmanager_smtp_user and
vault_alertmanager_smtp_password. The sender and recipient are
noc@asheep.it.
Inhibition
RTPengineDowninhibitsRTPengine.*alerts on the sameinstance.PostgresDowninhibitsPostgres.*alerts on the sameinstance.KamailioDowninhibitsKamailio.*alerts on the sameinstance.HostDowninhibits host-scoped secondary alerts on the samehost.
The service-level rules reduce alert storms caused by an exporter or service outage. The host-level rule preserves the ADR-0013 behavior for base infrastructure alerts.
Consequences
Positive
- Critical alerts have two independent notification paths.
- Warning alerts remain visible in the fastest operational channel without introducing paid paging.
- Informational alerts do not page or email the operator.
- Inhibition rules reduce correlated noise during service outages.
Negative
- Telegram remains a third-party dependency for fast notification.
- Email is a backup channel, not a true paging system.
- Staging still has no on-call rotation or SLA escalation automation.
- The
log-onlyreceiver requires a local webhook sink to persist info alerts to a file.
Neutral
- This ADR overrides only the alert routing subsection of ADR-0013.
- Production escalation can add PagerDuty, SMS or on-call rotation later without changing existing alert severity labels.
Alternatives considered
A1 - Email only
Rejected. Email has higher operator latency and is too easy to miss for critical staging incidents.
A2 - PagerDuty in staging
Rejected. The operational value does not justify cost and setup complexity while Akira is still in staging.
A3 - Slack webhook
Rejected. Slack is not the primary operational channel for A.Sheep.
A4 - Complex escalation DAG
Rejected. A multi-level escalation graph is over-engineered for the current team size and can be introduced in a production follow-up.
Implementation references
- Alertmanager template:
infra/roles/alertmanager/templates/alertmanager.yml.j2. - Alertmanager role task writes Telegram chat id file before validating the rendered configuration.
- Vault variables documented in:
docs/runbooks/alertmanager-setup.md. - Smoke test script:
tests/test_alertmanager_routing.sh.
Open questions
- Decide the production paging provider for Fase 4.