Passa al contenuto principale

ADR-0014 - Alertmanager routing topology and notification channels

  • Status: Accepted (2026-05-20)
  • Deciders: Massimo Bagnoli, Claude
  • Implementation tasks: TASK-120
  • Supersedes: ADR-0013 alert routing subsection
  • Superseded by: nessuno

Context

TASK-116, TASK-117, TASK-118 and TASK-119 define exporter coverage and alert rules for Kamailio, RTPengine, Postgres and Redis. Alert rules are not operationally useful until Alertmanager has a routing tree, receivers, grouping, de-duplication and inhibition policies.

Akira staging is operated by a small team, currently team-of-1, while production will require a broader escalation model. The notification topology must therefore be simple enough for staging but explicit enough to evolve into production paging without changing alert labels.

Decision

Adopt a severity-based Alertmanager routing tree with Telegram as the primary operational channel and email as the critical backup channel. Do not introduce PagerDuty in staging.

Routing tree

  • severity="critical" routes to critical-multi: Telegram ops chat and email to noc@asheep.it.
  • severity="warning" routes to telegram-ops: Telegram ops chat only, silenced through Alertmanager when noisy.
  • severity="info" routes to log-only: local webhook sink intended to append informational alerts to /var/log/alertmanager-info.log.
  • Alerts without a more specific child route use the default telegram-ops receiver.

Telegram mute hours (client-side)

The single Akira staging operator should configure Telegram mute hours for the @akira_ops_staging chat from 22:00 to 08:00 local time. Warning alerts remain silent during that window and visible at wake-up. Critical alerts still route through critical-multi, with Telegram plus email, and are expected to bypass the mute through client-side priority notification settings.

This matches the current ops-of-1 routine. In Fase 4 production, reassess this policy with on-call rotation and an explicit escalation provider (TD-072).

Grouping and de-duplication

  • group_by: alertname, cluster, severity.
  • group_wait: 30s.
  • group_interval: 5m.
  • repeat_interval: 4h.

This policy groups near-simultaneous related alerts, sends updates when a group changes and repeats unresolved notifications at a low enough cadence for staging operations.

Notification channels

Telegram uses the shared bot token vault_telegram_bot_token. The chat id is supplied through vault_alertmanager_telegram_chatid; staging should use the private ops group @akira_ops_staging. The Ansible role writes the numeric chat id to {{ alertmanager_config_dir }}/telegram_chat_id and the Alertmanager configuration references it through chat_id_file, preserving Alertmanager's integer type while keeping the template YAML parseable.

Email uses SMTP credentials from vault: vault_alertmanager_smtp_host, vault_alertmanager_smtp_port, vault_alertmanager_smtp_user and vault_alertmanager_smtp_password. The sender and recipient are noc@asheep.it.

Inhibition

  • RTPengineDown inhibits RTPengine.* alerts on the same instance.
  • PostgresDown inhibits Postgres.* alerts on the same instance.
  • KamailioDown inhibits Kamailio.* alerts on the same instance.
  • HostDown inhibits host-scoped secondary alerts on the same host.

The service-level rules reduce alert storms caused by an exporter or service outage. The host-level rule preserves the ADR-0013 behavior for base infrastructure alerts.

Consequences

Positive

  • Critical alerts have two independent notification paths.
  • Warning alerts remain visible in the fastest operational channel without introducing paid paging.
  • Informational alerts do not page or email the operator.
  • Inhibition rules reduce correlated noise during service outages.

Negative

  • Telegram remains a third-party dependency for fast notification.
  • Email is a backup channel, not a true paging system.
  • Staging still has no on-call rotation or SLA escalation automation.
  • The log-only receiver requires a local webhook sink to persist info alerts to a file.

Neutral

  • This ADR overrides only the alert routing subsection of ADR-0013.
  • Production escalation can add PagerDuty, SMS or on-call rotation later without changing existing alert severity labels.

Alternatives considered

A1 - Email only

Rejected. Email has higher operator latency and is too easy to miss for critical staging incidents.

A2 - PagerDuty in staging

Rejected. The operational value does not justify cost and setup complexity while Akira is still in staging.

A3 - Slack webhook

Rejected. Slack is not the primary operational channel for A.Sheep.

A4 - Complex escalation DAG

Rejected. A multi-level escalation graph is over-engineered for the current team size and can be introduced in a production follow-up.

Implementation references

  • Alertmanager template: infra/roles/alertmanager/templates/alertmanager.yml.j2.
  • Alertmanager role task writes Telegram chat id file before validating the rendered configuration.
  • Vault variables documented in: docs/runbooks/alertmanager-setup.md.
  • Smoke test script: tests/test_alertmanager_routing.sh.

Open questions

  • Decide the production paging provider for Fase 4.