Runbook - HTTPS Certificate Renewal
Normal Operation
Caddy manages HTTPS certificates automatically with Let's Encrypt. It renews certificates before expiry and stores runtime state on the management host.
RTO target for certificate recovery: 15 minutes after DNS and firewall are healthy.
Prerequisites
- Tailscale and SSH access to
akira-mgmt-01-staging. - Hetzner firewall access for management host.
- DNS access for
akira-staging.asheep.itrecords.
Diagnostics
curl -vkI https://mgmt.akira-staging.asheep.it
curl -vkI https://grafana.akira-staging.asheep.it
curl -vkI https://alerts.akira-staging.asheep.it
ssh root@akira-mgmt-01-staging '
systemctl status caddy --no-pager
journalctl -u caddy --since "2 hours ago" --no-pager \
| grep -iE "renew|certificate|acme|error" \
| tail -120
'
Check DNS:
dig mgmt.akira-staging.asheep.it +short
dig grafana.akira-staging.asheep.it +short
dig alerts.akira-staging.asheep.it +short
Expected: records return the management host public IP.
Manual Reload
Use reload when Caddy config changed or renewal state needs a retry.
ssh root@akira-mgmt-01-staging '
systemctl reload caddy
systemctl status caddy --no-pager
'
Alternative:
ssh root@akira-mgmt-01-staging '
caddy reload --config /etc/caddy/Caddyfile
'
Failure: Port 80 Blocked
Symptoms:
- ACME HTTP-01 challenge fails.
- Caddy logs mention challenge timeout or connection refused.
Recovery:
- Open Hetzner Cloud Console.
- Check firewall attached to
akira-mgmt-01-staging. - Ensure
80/tcpis allowed from0.0.0.0/0and::/0. - Reload Caddy.
- Re-run HTTPS checks.
ssh root@akira-mgmt-01-staging '
systemctl reload caddy
journalctl -u caddy --since "5 min ago" --no-pager | tail -120
'
Failure: DNS Wrong
Symptoms:
digreturns an old IP or no answer.- Caddy challenge reaches the wrong host.
Recovery:
- Fix the
AandAAAArecords for affected hostnames. - Wait for DNS propagation.
- Confirm
digreturns the expected public IP. - Reload Caddy.
Validation:
curl -fsS -I http://mgmt.akira-staging.asheep.it
curl -fsS -I https://mgmt.akira-staging.asheep.it
Failure: Let's Encrypt Rate Limit
Symptoms:
- Caddy logs mention ACME rate limit.
- Repeated retries fail immediately.
Recovery:
- Stop repeated reload loops.
- Wait for the cooldown shown in Caddy logs.
- Use the Let's Encrypt staging endpoint only for debugging config changes.
- Return to the default issuer before final validation.
Caveat: repeated forced retries can extend recovery time. Fix firewall and DNS first, then retry once.
Escalation
| Elapsed Time | Action |
|---|---|
| T+10 min | Escalate if DNS and firewall are not confirmed. |
| T+15 min | Escalate to Massimo if HTTPS is still unavailable. |
| T+30 min | Treat as SEV2 or SEV1 depending on service impact. |
Validation
curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz
curl -fsS -I https://grafana.akira-staging.asheep.it
curl -fsS -I https://alerts.akira-staging.asheep.it
ssh root@akira-mgmt-01-staging '
journalctl -u caddy --since "10 min ago" --no-pager | tail -80
'
Expected: HTTPS endpoints respond and Caddy logs show no active ACME failure.