Blog Field Notes cert-manager ACME HTTP-01 Leak: 22,514 Stale HTTPRoutes OOMKilled Traefik
RCA #traefik#cert-manager#kubernetes#gateway-api#acme#lets-encrypt#oomkill#incident-response

cert-manager ACME HTTP-01 Leak: 22,514 Stale HTTPRoutes OOMKilled Traefik

Diagnosed an 18-hour full ingress outage caused by cert-manager leaking 22,514 stale ACME solver HTTPRoutes after TLS certificates were deployed before DNS was configured, compounded by Gateway API blocking all 23 listeners when new listeners referenced non-existent TLS secrets.

· Gideon Warui
ON THIS PAGE

Root Cause Analysis — Ingress Layer Outage

Incident ID: INC-2026-03-23-001 Date of Incident: 2026-03-22 (onset) — 2026-03-23 (resolution) Date of Report: 2026-03-23 (revised) Severity: Critical — full ingress unavailability Status: Resolved Prepared by: Gideon Warui — Platform/Infra Engineer


Table of Contents

  1. Executive Summary
  2. Impact Assessment
  3. Timeline
  4. Root Cause Analysis
  5. Contributing Factors
  6. Resolution
  7. Verification
  8. Corrective Actions
  9. Lessons Learned

Executive Summary

On the evening of 2026-03-22, two new service namespaces were deployed to the shared Kubernetes cluster. The deployments included TLS Certificate resources backed by a Let’s Encrypt ACME HTTP-01 issuer. The deploying engineer believed the corresponding DNS records had already been pointed at the cluster’s load balancer. They had not.

Because DNS was not configured, the ACME HTTP-01 challenges could never succeed — Let’s Encrypt could not reach the challenge token endpoint. cert-manager retried continuously, and each retry created a new HTTPRoute object (the ACME solver route) without removing the previous failed attempt. Over the next 18 hours, this leaked 22,514 stale HTTPRoutes into the cluster.

Traefik, the cluster’s ingress controller, loads its entire routing table into memory on startup and on each configuration change. With 22,514 routes to process, Traefik’s memory consumption spiked far beyond its configured limit, causing repeated OOMKills (exit code 137). The pod entered CrashLoopBackOff, and all HTTPS ingress became unavailable.

A secondary compounding failure worsened recovery: the same new deployments had added Gateway listeners referencing TLS secrets that did not yet exist. The Kubernetes Gateway API marks the entire Gateway resource as Programmed: False when any listener has an unresolvable configuration — blocking traffic on all 23 previously-healthy listeners, not just the two new ones.

Resolution required four actions in sequence: cleaning up 22,514 stale HTTPRoutes; increasing Traefik’s memory limit from 512Mi to 1024Mi and adding an HPA; deleting active ACME challenges and certificates to stop the retry loop; and injecting temporary placeholder TLS secrets to break the Gateway deadlock.

Total effective downtime from first OOMKill to full route restoration was approximately 18–20 hours.


Impact Assessment

Services Affected

All services routed through the cluster’s shared Traefik Gateway were impacted across both production and sandbox environments.

EnvironmentImpact
All production HTTPS endpointsFully unreachable from the internet
All sandbox HTTPS endpointsFully unreachable from the internet
Internal cluster trafficUnaffected
Database and message queue layersUnaffected
Background workers and async jobsUnaffected
Kubernetes control planeUnaffected

What Was Not Affected

  • All application pods continued running normally
  • Pod-to-pod and service-to-service communication was unaffected
  • The ArgoCD GitOps layer continued reconciling application state
  • Karpenter continued managing node provisioning (separate issue, resolved independently)

Timeline

All times are EAT (UTC+3).

TimeEvent
2026-03-22 ~18:00Four new service namespaces deployed to the cluster via ArgoCD. Each included a Certificate resource targeting ACME HTTP-01 issuance, and new Gateway listeners referencing the expected TLS secrets. The deploying engineer believed DNS records for the new domains were already pointing to the cluster NLB. They were not.
2026-03-22 ~18:00 onwardscert-manager begins ACME HTTP-01 challenge attempts for the new certificates. Challenges fail immediately — DNS not pointing to cluster, Let’s Encrypt cannot reach the solver endpoint. cert-manager enters exponential backoff retry loop. Each retry creates a new cm-acme-http-solver-* HTTPRoute without removing the previous one. HTTPRoute count begins accumulating.
2026-03-22 ~19:49Traefik pod first OOMKill. The accumulating HTTPRoutes inflate the in-memory routing table beyond the 512Mi limit. Exit code 137. Pod enters CrashLoopBackOff.
2026-03-22 ~20:00All HTTPS routes begin returning connection errors. cert-manager ACME challenges now additionally stalled because Traefik is not routing — the HTTP-01 solver endpoint is unreachable even if DNS were correct.
2026-03-22 ~20:00 — 2026-03-23 ~11:00CrashLoopBackOff cycle continues. Each Traefik restart loads an increasingly large route table, OOMKills within seconds. HTTPRoute count continues growing as cert-manager retries. No alerting fires.
2026-03-23 ~11:00Incident identified during manual cluster health review.
2026-03-23 ~11:04Traefik confirmed in CrashLoopBackOff with 102 restarts. OOMKill confirmed via kubectl describe. Memory limit at the time: 512Mi.
2026-03-23 ~11:12Helm upgrade executed: Traefik memory limit increased from 512Mi to 512Mi (first attempt — insufficient). Gateway deadlock identified. Temporary TLS secrets injected. Gateway annotation patched to trigger reconciliation. All 23 listeners reach Programmed: True. Routes begin responding.
2026-03-23 ~11:1520 of 24 HTTPS endpoints confirmed healthy. Investigation closed prematurely — the ACME HTTPRoute leak had not yet been identified. cert-manager continues accumulating routes in the background.
2026-03-23 ~12:11Traefik OOMKills again. HTTPRoute count has now reached 22,514. New memory limit (512Mi) is also insufficient. Pod re-enters CrashLoopBackOff. All routes return 404.
2026-03-23 ~12:20Second investigation begins. ACME HTTPRoute accumulation identified as root cause.
2026-03-23 ~12:22Mass deletion of 22,514 stale cm-acme-http-solver-* HTTPRoutes begins (parallel batch deletion).
2026-03-23 ~12:27Traefik memory limit raised from 512Mi to 1024Mi. GOMEMLIMIT set to 900MiB. Helm revision 5 deployed.
2026-03-23 ~12:27HPA configured: 2–5 replicas, 70% CPU target, 5-minute scaleDown stabilization window.
2026-03-23 ~12:30Active ACME challenges, orders, and Certificate objects deleted from affected namespaces. Auto-sync disabled on the four affected ArgoCD applications to prevent cert-manager re-engaging before DNS is configured.
2026-03-23 ~12:37HTTPRoute count reaches zero stale routes. Traefik route table normalised to ~30 entries. CPU drops from 286% to 1%. All HTTPS routes confirmed responding.

Root Cause Analysis

Root Cause — ACME HTTP-01 Certificate Issuance Initiated Before DNS Was Configured

What happened:

Four new service namespaces (<client>-pay-production, <client>-pay-sandbox, <client>-rbac-production, <client>-rbac-sandbox) were deployed to the cluster on the evening of 2026-03-22.

Each namespace included:

  • A cert-manager Certificate resource targeting ACME HTTP-01 issuance via Let’s Encrypt
  • A Gateway listener referencing the expected TLS secret
  • An HTTPRoute for the application

The deployment was correct in structure — the sequence of creating a Certificate and a Gateway listener together is the standard pattern. The critical missing prerequisite was that the DNS A records for the new service domains had not yet been pointed to the cluster’s Network Load Balancer.

ACME HTTP-01 validation works by having Let’s Encrypt send an HTTP request to http://<domain>/.well-known/acme-challenge/<token>. cert-manager creates an HTTPRoute to serve this token via Traefik. For this to succeed, the domain’s DNS must already resolve to the cluster’s ingress IP. Without DNS, Let’s Encrypt’s request never reaches the cluster, and the challenge fails immediately.

cert-manager’s retry behaviour does not clean up failed challenge HTTPRoute objects. Each retry attempt creates a new cm-acme-http-solver-* HTTPRoute. Over 18 hours at exponential backoff intervals, this produced 22,514 stale HTTPRoutes across the four namespaces:

<client>-pay-production:    5,445 stale HTTPRoutes
<client>-pay-sandbox:       5,695 stale HTTPRoutes
<client>-rbac-production:   5,697 stale HTTPRoutes
<client>-rbac-sandbox:      5,697 stale HTTPRoutes
─────────────────────────────────────────────────
Total:                      22,534 stale HTTPRoutes

Traefik processes every HTTPRoute in the cluster on startup and on each configuration change event. With 22,534 routes in memory — each representing a parsed routing rule, hostname match, and backend reference — Traefik’s memory consumption exceeded its 512Mi limit during the reconciliation cycle and was terminated by the Linux OOM killer.

Evidence — Traefik OOMKill:

Last State:  Terminated
  Reason:    OOMKilled
  Exit Code: 137
  Started:   Mon, 23 Mar 2026 12:11:10 +0300
  Finished:  Mon, 23 Mar 2026 12:14:59 +0300

Limits:   cpu: 500m   memory: 512Mi
Requests: cpu: 100m   memory: 128Mi

Evidence — HTTPRoute count before cleanup:

kubectl get httproute -A --no-headers | grep "cm-acme-http-solver" | wc -l
# 22514

Evidence — cert-manager retry logs:

ERR Unable to load HTTPRoute backend: Cannot load HTTPBackendRef
    <client>-pay-production/<client>-pay-web: getting service: service "<client>-pay-web" not found
    http_route=<client>-pay-route namespace=<client>-pay-production

The backend service did not exist because the application container image had not yet been deployed — another prerequisite that was not in place at the time of the deployment.


Compounding Factor — Gateway API Listener Failure Blocks All Routes

When the new namespaces were deployed, their Gateway listeners referenced TLS secrets (<client>-pay-prod-tls-secret, <client>-rbac-prod-tls-secret, etc.) that cert-manager had not yet been able to create (because challenges were failing).

The Kubernetes Gateway API marks the entire Gateway resource Programmed: False when any listener references a secret that does not exist. This is not scoped to the failing listener — it affects all 23 listeners on the same Gateway, including the 19 that had valid, working configurations.

Conditions:
  Type:    Programmed
  Status:  False
  Reason:  InvalidCertificateRef
  Message: Error while retrieving certificate:
           getting secret: secret "<client>-pay-prod-tls-secret" not found

This created a self-reinforcing deadlock:

Missing TLS secrets → Gateway Programmed: False

      All 23 listeners stop routing (including working ones)

      ACME HTTP-01 solver pods unreachable
      (cert-manager's HTTPRoutes not being served)

      Certificates cannot be issued
      (challenges fail even if DNS were correct)

      TLS secrets never created

      [DEADLOCK] — Gateway cannot program; cert-manager cannot issue

Contributing Factors

CF-1 — No DNS prerequisite check in the deployment process

The deployment procedure for a new HTTPS service did not include a step to verify that DNS records were pointing to the cluster load balancer before applying cert-manager Certificate resources. ACME HTTP-01 issuance has a hard dependency on DNS being correct at the time of issuance.

CF-2 — cert-manager does not garbage-collect failed challenge HTTPRoutes

cert-manager creates a new HTTPRoute for each HTTP-01 challenge attempt but does not remove routes from previous failed attempts. This is a known upstream behaviour. In a functioning cluster the backoff intervals are long enough that accumulation is not significant — but with a permanently-failing challenge (no DNS, no backend service), the retry frequency is high enough to produce thousands of objects over hours.

CF-3 — No alerting on HTTPRoute accumulation or cert-manager retry storms

No alert existed to detect abnormal HTTPRoute counts or sustained ACME request rates without certificate issuance. The accumulation of 22,534 routes over 18 hours produced no notification.

CF-4 — No alerting on Traefik memory pressure or CrashLoopBackOff

Traefik had been in CrashLoopBackOff for the duration of the accumulation window. No alert fired. The failure was discovered only during a manual cluster health check.

CF-5 — Traefik memory limit not sized for route table growth

The Traefik memory limit (512Mi at the time of the second OOMKill) was provisioned for a stable route table of ~30 routes. It had no headroom for transient inflation caused by accumulated challenge routes or future route table growth. No HPA was configured to distribute load across multiple replicas.

CF-6 — Gateway API listener failure blast radius not understood

The failure behaviour of the Gateway API — where one invalid listener blocks all listeners on the same resource — was not documented and not widely understood. This extended the investigation time during the first resolution attempt and contributed to premature closure before the HTTPRoute accumulation was identified.


Resolution

I resolved the outage in two phases. Phase 1 addressed the Gateway deadlock. Phase 2 addressed the underlying HTTPRoute accumulation.


Phase 1 — Gateway Deadlock Resolution (2026-03-23 ~11:12)

Action 1.1 — Inject temporary TLS secrets

I created placeholder self-signed secrets under the exact names referenced by the failing Gateway listeners. cert-manager overwrites these automatically on successful certificate issuance.

openssl req -x509 -nodes -newkey rsa:2048 -days 1 \
  -keyout /tmp/tls.key -out /tmp/tls.crt \
  -subj "/CN=temp-bootstrap" 2>/dev/null

for ns_secret in \
  "<client>-pay-production/<client>-pay-prod-tls-secret" \
  "<client>-pay-sandbox/sandbox-tls-secret" \
  "<client>-rbac-production/<client>-rbac-prod-tls-secret" \
  "<client>-rbac-sandbox/sandbox-tls-secret"; do
  ns="${ns_secret%%/*}"
  secret="${ns_secret##*/}"
  kubectl create secret tls "$secret" -n "$ns" \
    --cert=/tmp/tls.crt --key=/tmp/tls.key \
    --dry-run=client -o yaml | kubectl apply -f -
done

Action 1.2 — Force Gateway reconciliation

kubectl annotate gateway traefik-gateway -n traefik \
  "kubectl.kubernetes.io/last-applied-restart=$(date -u +%s)" \
  --overwrite

All 23 listeners transitioned to Programmed: True within 8 seconds.


Phase 2 — ACME HTTPRoute Cleanup and Traefik Stabilisation (2026-03-23 ~12:20)

Action 2.1 — Delete 22,514 stale ACME HTTPRoutes

for ns in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
  kubectl get httproute -n "$ns" --no-headers \
    | grep "cm-acme-http-solver" \
    | awk '{print $1}' \
    | xargs -P10 -n100 kubectl delete httproute -n "$ns"
done

Action 2.2 — Increase Traefik memory limit and add GOMEMLIMIT

helm upgrade traefik traefik/traefik \
  --version 38.0.2 \
  --namespace traefik \
  --reuse-values \
  --set "resources.limits.memory=1024Mi" \
  --set "resources.requests.memory=256Mi" \
  --set "env[0].name=GOMEMLIMIT" \
  --set "env[0].value=900MiB"
BeforeAfter
Memory limit512Mi1024Mi
Memory request128Mi256Mi
GOMEMLIMIT115MiB900MiB

Action 2.3 — Configure Traefik HPA

helm upgrade traefik traefik/traefik \
  --version 38.0.2 \
  --namespace traefik \
  --reuse-values \
  --set "autoscaling.enabled=true" \
  --set "autoscaling.minReplicas=2" \
  --set "autoscaling.maxReplicas=5" \
  --set "autoscaling.metrics[0].type=Resource" \
  --set "autoscaling.metrics[0].resource.name=cpu" \
  --set "autoscaling.metrics[0].resource.target.type=Utilization" \
  --set "autoscaling.metrics[0].resource.target.averageUtilization=70" \
  --set "autoscaling.behavior.scaleDown.stabilizationWindowSeconds=300"

Action 2.4 — Stop cert-manager retry loop

I deleted active challenges, orders, and Certificate objects from the four affected namespaces and disabled auto-sync on the corresponding ArgoCD applications to prevent cert-manager from re-engaging until DNS and backend services are ready.

# Disable ArgoCD auto-sync on affected applications
for app in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
  kubectl patch application "$app" -n argocd \
    --type=merge \
    -p '{"spec":{"syncPolicy":null}}'
done

# Delete all cert-manager resources in affected namespaces
for ns in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
  kubectl delete certificate --all -n "$ns"
  kubectl delete certificaterequest --all -n "$ns"
  kubectl delete order --all -n "$ns"
  kubectl delete challenge --all -n "$ns"
done

Verification

HTTPRoute count after cleanup:

kubectl get httproute -A --no-headers | grep "cm-acme-http-solver" | wc -l
# 0

Traefik pod status:

NAME                      READY   STATUS    RESTARTS   AGE
traefik-749c566fd-zj2mb   1/1     Running   0          stable
traefik-749c566fd-x87wk   1/1     Running   0          stable  (HPA second replica)

Traefik memory usage:

NAME                      CPU(cores)   MEMORY(bytes)
traefik-749c566fd-zj2mb   8m           115Mi
traefik-749c566fd-x87wk   6m           52Mi

HPA status:

NAME      REFERENCE            TARGETS    MINPODS   MAXPODS   REPLICAS
traefik   Deployment/traefik   cpu: 1%/70%   2      5         2

All 25 Gateway listeners:

All listeners: Programmed=True, ResolvedRefs=True

HTTPS endpoint sweep: All previously-active routes confirmed responding. The four <client>-pay/<client>-rbac endpoints remain unavailable — expected, as DNS and application images are not yet deployed. These are not incident-related.


Corrective Actions

CA-1 — DNS Verification Gate Before Certificate Deployment

Priority: Immediate Owner: Platform Engineering / DevOps

No HTTPS service should be deployed to the cluster until DNS records for its domain have been confirmed pointing to the cluster NLB. Pre-deployment DNS check:

CLUSTER_NLB="<cluster-fqdn>"
# REVIEW: redacted — confirm
DOMAIN="pay.example.com"

RESOLVED=$(dig +short "$DOMAIN" | tail -1)
NLB_IP=$(dig +short "$CLUSTER_NLB" | tail -1)

if [ "$RESOLVED" != "$NLB_IP" ]; then
  echo "ERROR: $DOMAIN does not resolve to cluster NLB ($NLB_IP). Do not deploy certificate."
  exit 1
fi
echo "OK: DNS confirmed. Safe to deploy."

For services that genuinely require certificate issuance before DNS is finalised, switch to DNS-01 ACME validation (Route53 solver). DNS-01 validates domain ownership via a TXT record rather than HTTP routing, completely eliminating the dependency on ingress being functional.


CA-2 — Traefik HPA and Resource Sizing (Completed)

Priority: Immediate — COMPLETED Owner: Platform Engineering

Traefik is now configured with:

  • Memory limit: 1024Mi
  • GOMEMLIMIT: 900MiB (Go runtime soft cap — triggers GC before hard limit)
  • HPA: 2–5 replicas, 70% CPU target, 5-minute scaleDown stabilization

The 1Gi limit provides headroom for transient route table growth. The HPA ensures a single pod OOMKill does not cause full ingress unavailability — the second replica continues serving traffic during a crash.


CA-3 — Alerting for ACME HTTPRoute Accumulation and Traefik Saturation (Completed)

Priority: Immediate — COMPLETED Owner: Platform Engineering

The following VMRule alerts have been deployed to the observability namespace and are currently operational:

AlertConditionSeverity
AcmeRetryStormHigh ACME request rate with no certificates becoming Ready for 20+ minutescritical
AcmeHttpRouteCountHighMore than 200 HTTPRoutes in cluster for 5+ minuteswarning
TraefikMemoryPressureTraefik memory > 70% of limit for 3+ minuteswarning
TraefikHPAAtMaxReplicasHPA pegged at maximum replicas for 10+ minuteswarning
TraefikHighConfigReloadRateConfig reload rate > 0.5/sec for 5+ minuteswarning
TraefikCrashLoopBackOffTraefik pod in CrashLoopBackOff for 5+ minutescritical

Had these alerts been in place, the expected notification sequence for this incident would have been:

~18:20  AcmeRetryStorm fires             (challenges failing with no Ready certs)
~19:30  AcmeHttpRouteCountHigh fires     (>200 routes accumulated)
~19:45  TraefikMemoryPressure fires      (>70% memory approaching limit)
~19:49  TraefikCrashLoopBackOff fires    (OOMKill, pod restarting)

The outage would have been detectable within 20 minutes of the triggering deployment rather than after 15+ hours.


CA-4 — Alertmanager Notification Channel

Priority: Immediate Owner: Platform Engineering

All alerts are currently evaluated and firing correctly by vmalert. The alertmanager receiver is pending configuration. Until a notification channel is wired (Mattermost incoming webhook), firing alerts are not delivered to any external channel.

Required: Configure alertmanager with a Mattermost webhook receiver. Alert routing: severity: critical → immediate notification; severity: warning → notification with 5-minute grouping interval.


CA-5 — Deployment Checklist for New HTTPS Services

Priority: Medium-term Owner: Platform Engineering / DevOps

New HTTPS Service Onboarding Checklist
──────────────────────────────────────
□ DNS A record for service domain points to cluster NLB
  Verify: dig +short <domain> (must match NLB IP)

□ Backend Deployment and Service exist in the target namespace
  Verify: kubectl get deploy,svc -n <namespace>

□ Container image built and pushed to registry
  Verify: Image present in ECR / registry

□ Gateway listener added with correct TLS secret name

□ cert-manager Certificate resource added (HTTP-01 or DNS-01)

□ If using HTTP-01: DNS verified before applying Certificate
  If using DNS-01: Route53 IRSA permissions verified

□ ArgoCD application sync-wave ordering reviewed
  (namespace/secrets before Gateway listener)

CA-6 — Document Gateway API Blast Radius in Runbook

Priority: Medium-term Owner: Platform Engineering

Add a runbook entry covering the Gateway API failure model:

  • A single listener with InvalidCertificateRef marks the entire Gateway as Programmed: False
  • This blocks ALL listeners, including those with valid configuration
  • Resolution path: create placeholder TLS secret → patch Gateway annotation to force reconciliation
  • Detection: kubectl get gateway -n traefik -o wide → check PROGRAMMED column
  • Detailed listener status: kubectl get gateway traefik-gateway -n traefik -o jsonpath='{.status.listeners}'

CA-7 — Re-enable <client>-pay and <client>-rbac When Ready

Priority: When DNS and images are available Owner: Platform Engineering / Application Team

The four affected ArgoCD applications have auto-sync disabled. When the application team confirms DNS records are pointing to the cluster NLB and container images are built and pushed to the registry:

for app in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
  kubectl patch application "$app" -n argocd \
    --type=merge \
    -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'
done

cert-manager will automatically re-initiate certificate issuance on sync. With DNS correctly configured, HTTP-01 challenges will succeed on the first attempt and TLS will be issued within 2–5 minutes.


Lessons Learned

Deployment prerequisites must be verified before applying, not assumed

The triggering action was a deployment made with an incorrect assumption — that DNS was already in place. Infrastructure deployments that depend on external configuration (DNS, external credentials, upstream services) need explicit verification steps, not assumptions. A 30-second DNS check before applying would have prevented this entire incident.

cert-manager’s HTTPRoute leak is a known failure mode for long-running failed challenges

cert-manager creates ACME solver HTTPRoutes but does not remove them on failure — only on success or explicit deletion. In clusters using Gateway API with HTTP-01 ACME, a permanently-failing challenge will accumulate an unbounded number of HTTPRoutes. The mitigation is either DNS-01 (preferred for new services) or a monitoring alert on HTTPRoute count.

Gateway API failure propagation is not intuitive

Engineers familiar with the Kubernetes Ingress model — where each resource is independent — do not expect a misconfiguration on one listener to affect all other listeners on the same Gateway. This behaviour must be documented explicitly. It affects incident response speed: the first responder must know to look at all listener configurations, not just the failing service.

One pod ingress is not resilient

With a single Traefik replica, any OOMKill causes 100% ingress unavailability while the pod restarts. A minimum of two replicas ensures one pod can absorb traffic during a crash of the other. The HPA now enforces minReplicas: 2.

Observability gaps extend every incident

Every phase of this incident was extended by the absence of alerting. A 15-hour undetected CrashLoopBackOff. An 18-hour undetected HTTPRoute accumulation. Alerting does not prevent incidents — it compresses the time between onset and response.


End of Report

#traefik#cert-manager#kubernetes#gateway-api#acme#lets-encrypt#oomkill#incident-response