Passa al contenuto principale

NATS cluster migration (pilot to GA)

When to migrate

  • Pilot Phase 2 to GA transition.
  • Customer count greater than 10, where the single-node broker becomes a material SPOF.
  • SLA uptime commitment at or above 99.9%.

Procedure

Plan a maintenance window of about 2 hours.

  1. Notify customers with at least 1 week of notice.
  2. Provision three dedicated NATS VMs for GA, or absorb akira-cache-01-staging only after explicitly accepting the migration risk.
  3. Replace placeholders in infra/inventory/ga.yml.
  4. Set vault_nats_cluster_password in the real vault.
  5. Backup the pilot stream with nats stream backup AKIRA_CDR.
  6. Deploy the NATS role with nats_cluster_enabled=true.
  7. Verify cluster formation with nats server list.
  8. Upgrade AKIRA_CDR to --replicas 3 after all nodes are healthy.
  9. Test failover by stopping one NATS node and verifying publish/consume still works.
  10. Restart the stopped node and verify stream catch-up.
  11. Monitor JetStream health, disk, and consumer backlog for 24 hours.

Rollback

  1. Stop the new GA NATS nodes.
  2. Decrease AKIRA_CDR replicas back to 1 if the stream metadata allows it.
  3. Restore the pilot stream backup if stream state is inconsistent.
  4. Point application NATS_URL back to the pilot node.
  5. Re-run the staging nats tag with nats_cluster_enabled=false.

Validation commands

ansible-playbook -i infra/inventory/ga.yml infra/playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt --tags nats --limit nats_nodes

docker run --rm --network host natsio/nats-box:0.14.3 \
nats --server "nats://akira:<password>@127.0.0.1:4222" server list

docker run --rm --network host natsio/nats-box:0.14.3 \
nats --server "nats://akira:<password>@127.0.0.1:4222" stream info AKIRA_CDR

Cost delta

  • Single-node pilot: NATS on cache-01 shared, no additional VM cost.
  • 3-node GA cluster: 3 x cx23 dedicated NATS nodes, about 11 EUR/month at current Hetzner sizing.

Notes

  • Pilot remains single-node through infra/group_vars/staging/main.yml.
  • GA uses JetStream replicas 3 through infra/group_vars/ga/main.yml.
  • A 2-node degraded cluster keeps quorum, but a stream configured with replicas 3 may block writes until the third peer recovers depending on placement and acknowledgement state.