Runbook - HTTPS Certificate Renewal

Normal Operation

Caddy manages HTTPS certificates automatically with Let's Encrypt. It renews certificates before expiry and stores runtime state on the management host.

RTO target for certificate recovery: 15 minutes after DNS and firewall are healthy.

Prerequisites

Tailscale and SSH access to akira-mgmt-01-staging.
Hetzner firewall access for management host.
DNS access for akira-staging.asheep.it records.

Diagnostics

curl -vkI https://mgmt.akira-staging.asheep.it
curl -vkI https://grafana.akira-staging.asheep.it
curl -vkI https://alerts.akira-staging.asheep.it

ssh root@akira-mgmt-01-staging '
  systemctl status caddy --no-pager
  journalctl -u caddy --since "2 hours ago" --no-pager \
    | grep -iE "renew|certificate|acme|error" \
    | tail -120
'

Check DNS:

dig mgmt.akira-staging.asheep.it +short
dig grafana.akira-staging.asheep.it +short
dig alerts.akira-staging.asheep.it +short

Expected: records return the management host public IP.

Manual Reload

Use reload when Caddy config changed or renewal state needs a retry.

ssh root@akira-mgmt-01-staging '
  systemctl reload caddy
  systemctl status caddy --no-pager
'

Alternative:

ssh root@akira-mgmt-01-staging '
  caddy reload --config /etc/caddy/Caddyfile
'

Failure: Port 80 Blocked

Symptoms:

ACME HTTP-01 challenge fails.
Caddy logs mention challenge timeout or connection refused.

Recovery:

Open Hetzner Cloud Console.
Check firewall attached to akira-mgmt-01-staging.
Ensure 80/tcp is allowed from 0.0.0.0/0 and ::/0.
Reload Caddy.
Re-run HTTPS checks.

ssh root@akira-mgmt-01-staging '
  systemctl reload caddy
  journalctl -u caddy --since "5 min ago" --no-pager | tail -120
'

Failure: DNS Wrong

Symptoms:

dig returns an old IP or no answer.
Caddy challenge reaches the wrong host.

Recovery:

Fix the A and AAAA records for affected hostnames.
Wait for DNS propagation.
Confirm dig returns the expected public IP.
Reload Caddy.

Validation:

curl -fsS -I http://mgmt.akira-staging.asheep.it
curl -fsS -I https://mgmt.akira-staging.asheep.it

Failure: Let's Encrypt Rate Limit

Symptoms:

Caddy logs mention ACME rate limit.
Repeated retries fail immediately.

Recovery:

Stop repeated reload loops.
Wait for the cooldown shown in Caddy logs.
Use the Let's Encrypt staging endpoint only for debugging config changes.
Return to the default issuer before final validation.

Caveat: repeated forced retries can extend recovery time. Fix firewall and DNS first, then retry once.

Escalation

Elapsed Time	Action
T+10 min	Escalate if DNS and firewall are not confirmed.
T+15 min	Escalate to Massimo if HTTPS is still unavailable.
T+30 min	Treat as SEV2 or SEV1 depending on service impact.

Validation

curl -fsS -I https://mgmt.akira-staging.asheep.it/healthz
curl -fsS -I https://grafana.akira-staging.asheep.it
curl -fsS -I https://alerts.akira-staging.asheep.it

ssh root@akira-mgmt-01-staging '
  journalctl -u caddy --since "10 min ago" --no-pager | tail -80
'

Expected: HTTPS endpoints respond and Caddy logs show no active ACME failure.

Normal Operation​

Prerequisites​

Diagnostics​

Manual Reload​

Failure: Port 80 Blocked​

Failure: DNS Wrong​

Failure: Let's Encrypt Rate Limit​

Escalation​

Validation​

Normal Operation

Prerequisites

Diagnostics

Manual Reload

Failure: Port 80 Blocked

Failure: DNS Wrong

Failure: Let's Encrypt Rate Limit

Escalation

Validation