NATS cluster migration (pilot to GA)

When to migrate

Pilot Phase 2 to GA transition.
Customer count greater than 10, where the single-node broker becomes a material SPOF.
SLA uptime commitment at or above 99.9%.

Procedure

Plan a maintenance window of about 2 hours.

Notify customers with at least 1 week of notice.
Provision three dedicated NATS VMs for GA, or absorb akira-cache-01-staging only after explicitly accepting the migration risk.
Replace placeholders in infra/inventory/ga.yml.
Set vault_nats_cluster_password in the real vault.
Backup the pilot stream with nats stream backup AKIRA_CDR.
Deploy the NATS role with nats_cluster_enabled=true.
Verify cluster formation with nats server list.
Upgrade AKIRA_CDR to --replicas 3 after all nodes are healthy.
Test failover by stopping one NATS node and verifying publish/consume still works.
Restart the stopped node and verify stream catch-up.
Monitor JetStream health, disk, and consumer backlog for 24 hours.

Rollback

Stop the new GA NATS nodes.
Decrease AKIRA_CDR replicas back to 1 if the stream metadata allows it.
Restore the pilot stream backup if stream state is inconsistent.
Point application NATS_URL back to the pilot node.
Re-run the staging nats tag with nats_cluster_enabled=false.

Validation commands

ansible-playbook -i infra/inventory/ga.yml infra/playbooks/deploy_stateful.yml \
  --vault-password-file ~/.akira-vault-pass.txt --tags nats --limit nats_nodes

docker run --rm --network host natsio/nats-box:0.14.3 \
  nats --server "nats://akira:<password>@127.0.0.1:4222" server list

docker run --rm --network host natsio/nats-box:0.14.3 \
  nats --server "nats://akira:<password>@127.0.0.1:4222" stream info AKIRA_CDR

Cost delta

Single-node pilot: NATS on cache-01 shared, no additional VM cost.
3-node GA cluster: 3 x cx23 dedicated NATS nodes, about 11 EUR/month at current Hetzner sizing.

Notes

Pilot remains single-node through infra/group_vars/staging/main.yml.
GA uses JetStream replicas 3 through infra/group_vars/ga/main.yml.
A 2-node degraded cluster keeps quorum, but a stream configured with replicas 3 may block writes until the third peer recovers depending on placement and acknowledgement state.

When to migrate​

Procedure​

Rollback​

Validation commands​

Cost delta​

Notes​

When to migrate

Procedure

Rollback

Validation commands

Cost delta

Notes