Passa al contenuto principale

Runbook - HTTPS Certificate Renewal

Normal Operation

Caddy manages HTTPS certificates automatically with Let's Encrypt. It renews certificates before expiry and stores runtime state on the management host.

RTO target for certificate recovery: 15 minutes after DNS and firewall are healthy.

Prerequisites

  • Tailscale and SSH access to akira-mgmt-01-staging.
  • Hetzner firewall access for management host.
  • DNS access for akira-staging.asheep.it records.

Diagnostics

curl -vkI https://mgmt.akira-staging.asheep.it
curl -vkI https://grafana.akira-staging.asheep.it
curl -vkI https://alerts.akira-staging.asheep.it

ssh root@akira-mgmt-01-staging '
systemctl status caddy --no-pager
journalctl -u caddy --since "2 hours ago" --no-pager \
| grep -iE "renew|certificate|acme|error" \
| tail -120
'

Check DNS:

dig mgmt.akira-staging.asheep.it +short
dig grafana.akira-staging.asheep.it +short
dig alerts.akira-staging.asheep.it +short

Expected: records return the management host public IP.

Manual Reload

Use reload when Caddy config changed or renewal state needs a retry.

ssh root@akira-mgmt-01-staging '
systemctl reload caddy
systemctl status caddy --no-pager
'

Alternative:

ssh root@akira-mgmt-01-staging '
caddy reload --config /etc/caddy/Caddyfile
'

Failure: Port 80 Blocked

Symptoms:

  • ACME HTTP-01 challenge fails.
  • Caddy logs mention challenge timeout or connection refused.

Recovery:

  1. Open Hetzner Cloud Console.
  2. Check firewall attached to akira-mgmt-01-staging.
  3. Ensure 80/tcp is allowed from 0.0.0.0/0 and ::/0.
  4. Reload Caddy.
  5. Re-run HTTPS checks.
ssh root@akira-mgmt-01-staging '
systemctl reload caddy
journalctl -u caddy --since "5 min ago" --no-pager | tail -120
'

Failure: DNS Wrong

Symptoms:

  • dig returns an old IP or no answer.
  • Caddy challenge reaches the wrong host.

Recovery:

  1. Fix the A and AAAA records for affected hostnames.
  2. Wait for DNS propagation.
  3. Confirm dig returns the expected public IP.
  4. Reload Caddy.

Validation:

curl -fsS -I http://mgmt.akira-staging.asheep.it
curl -fsS -I https://mgmt.akira-staging.asheep.it

Failure: Let's Encrypt Rate Limit

Symptoms:

  • Caddy logs mention ACME rate limit.
  • Repeated retries fail immediately.

Recovery:

  1. Stop repeated reload loops.
  2. Wait for the cooldown shown in Caddy logs.
  3. Use the Let's Encrypt staging endpoint only for debugging config changes.
  4. Return to the default issuer before final validation.

Caveat: repeated forced retries can extend recovery time. Fix firewall and DNS first, then retry once.

Escalation

Elapsed TimeAction
T+10 minEscalate if DNS and firewall are not confirmed.
T+15 minEscalate to Massimo if HTTPS is still unavailable.
T+30 minTreat as SEV2 or SEV1 depending on service impact.

Validation

curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz
curl -fsS -I https://grafana.akira-staging.asheep.it
curl -fsS -I https://alerts.akira-staging.asheep.it

ssh root@akira-mgmt-01-staging '
journalctl -u caddy --since "10 min ago" --no-pager | tail -80
'

Expected: HTTPS endpoints respond and Caddy logs show no active ACME failure.