NATS cluster migration (pilot to GA)
When to migrate
- Pilot Phase 2 to GA transition.
- Customer count greater than 10, where the single-node broker becomes a material SPOF.
- SLA uptime commitment at or above 99.9%.
Procedure
Plan a maintenance window of about 2 hours.
- Notify customers with at least 1 week of notice.
- Provision three dedicated NATS VMs for GA, or absorb
akira-cache-01-stagingonly after explicitly accepting the migration risk. - Replace placeholders in
infra/inventory/ga.yml. - Set
vault_nats_cluster_passwordin the real vault. - Backup the pilot stream with
nats stream backup AKIRA_CDR. - Deploy the NATS role with
nats_cluster_enabled=true. - Verify cluster formation with
nats server list. - Upgrade
AKIRA_CDRto--replicas 3after all nodes are healthy. - Test failover by stopping one NATS node and verifying publish/consume still works.
- Restart the stopped node and verify stream catch-up.
- Monitor JetStream health, disk, and consumer backlog for 24 hours.
Rollback
- Stop the new GA NATS nodes.
- Decrease
AKIRA_CDRreplicas back to 1 if the stream metadata allows it. - Restore the pilot stream backup if stream state is inconsistent.
- Point application
NATS_URLback to the pilot node. - Re-run the staging
natstag withnats_cluster_enabled=false.
Validation commands
ansible-playbook -i infra/inventory/ga.yml infra/playbooks/deploy_stateful.yml \
--vault-password-file ~/.akira-vault-pass.txt --tags nats --limit nats_nodes
docker run --rm --network host natsio/nats-box:0.14.3 \
nats --server "nats://akira:<password>@127.0.0.1:4222" server list
docker run --rm --network host natsio/nats-box:0.14.3 \
nats --server "nats://akira:<password>@127.0.0.1:4222" stream info AKIRA_CDR
Cost delta
- Single-node pilot: NATS on
cache-01shared, no additional VM cost. - 3-node GA cluster: 3 x cx23 dedicated NATS nodes, about 11 EUR/month at current Hetzner sizing.
Notes
- Pilot remains single-node through
infra/group_vars/staging/main.yml. - GA uses JetStream replicas 3 through
infra/group_vars/ga/main.yml. - A 2-node degraded cluster keeps quorum, but a stream configured with replicas 3 may block writes until the third peer recovers depending on placement and acknowledgement state.