cert-manager ACME HTTP-01 Leak: 22,514 Stale HTTPRoutes OOMKilled Traefik
Diagnosed an 18-hour full ingress outage caused by cert-manager leaking 22,514 stale ACME solver HTTPRoutes after TLS certificates were deployed before DNS was configured, compounded by Gateway API blocking all 23 listeners when new listeners referenced non-existent TLS secrets.
ON THIS PAGE
Root Cause Analysis — Ingress Layer Outage
Incident ID: INC-2026-03-23-001 Date of Incident: 2026-03-22 (onset) — 2026-03-23 (resolution) Date of Report: 2026-03-23 (revised) Severity: Critical — full ingress unavailability Status: Resolved Prepared by: Gideon Warui — Platform/Infra Engineer
Table of Contents
- Executive Summary
- Impact Assessment
- Timeline
- Root Cause Analysis
- Contributing Factors
- Resolution
- Verification
- Corrective Actions
- Lessons Learned
Executive Summary
On the evening of 2026-03-22, two new service namespaces were deployed to the shared Kubernetes cluster. The deployments included TLS Certificate resources backed by a Let’s Encrypt ACME HTTP-01 issuer. The deploying engineer believed the corresponding DNS records had already been pointed at the cluster’s load balancer. They had not.
Because DNS was not configured, the ACME HTTP-01 challenges could never succeed — Let’s Encrypt could not reach the challenge token endpoint. cert-manager retried continuously, and each retry created a new HTTPRoute object (the ACME solver route) without removing the previous failed attempt. Over the next 18 hours, this leaked 22,514 stale HTTPRoutes into the cluster.
Traefik, the cluster’s ingress controller, loads its entire routing table into memory on startup and on each configuration change. With 22,514 routes to process, Traefik’s memory consumption spiked far beyond its configured limit, causing repeated OOMKills (exit code 137). The pod entered CrashLoopBackOff, and all HTTPS ingress became unavailable.
A secondary compounding failure worsened recovery: the same new deployments had added Gateway listeners referencing TLS secrets that did not yet exist. The Kubernetes Gateway API marks the entire Gateway resource as Programmed: False when any listener has an unresolvable configuration — blocking traffic on all 23 previously-healthy listeners, not just the two new ones.
Resolution required four actions in sequence: cleaning up 22,514 stale HTTPRoutes; increasing Traefik’s memory limit from 512Mi to 1024Mi and adding an HPA; deleting active ACME challenges and certificates to stop the retry loop; and injecting temporary placeholder TLS secrets to break the Gateway deadlock.
Total effective downtime from first OOMKill to full route restoration was approximately 18–20 hours.
Impact Assessment
Services Affected
All services routed through the cluster’s shared Traefik Gateway were impacted across both production and sandbox environments.
| Environment | Impact |
|---|---|
| All production HTTPS endpoints | Fully unreachable from the internet |
| All sandbox HTTPS endpoints | Fully unreachable from the internet |
| Internal cluster traffic | Unaffected |
| Database and message queue layers | Unaffected |
| Background workers and async jobs | Unaffected |
| Kubernetes control plane | Unaffected |
What Was Not Affected
- All application pods continued running normally
- Pod-to-pod and service-to-service communication was unaffected
- The ArgoCD GitOps layer continued reconciling application state
- Karpenter continued managing node provisioning (separate issue, resolved independently)
Timeline
All times are EAT (UTC+3).
| Time | Event |
|---|---|
| 2026-03-22 ~18:00 | Four new service namespaces deployed to the cluster via ArgoCD. Each included a Certificate resource targeting ACME HTTP-01 issuance, and new Gateway listeners referencing the expected TLS secrets. The deploying engineer believed DNS records for the new domains were already pointing to the cluster NLB. They were not. |
| 2026-03-22 ~18:00 onwards | cert-manager begins ACME HTTP-01 challenge attempts for the new certificates. Challenges fail immediately — DNS not pointing to cluster, Let’s Encrypt cannot reach the solver endpoint. cert-manager enters exponential backoff retry loop. Each retry creates a new cm-acme-http-solver-* HTTPRoute without removing the previous one. HTTPRoute count begins accumulating. |
| 2026-03-22 ~19:49 | Traefik pod first OOMKill. The accumulating HTTPRoutes inflate the in-memory routing table beyond the 512Mi limit. Exit code 137. Pod enters CrashLoopBackOff. |
| 2026-03-22 ~20:00 | All HTTPS routes begin returning connection errors. cert-manager ACME challenges now additionally stalled because Traefik is not routing — the HTTP-01 solver endpoint is unreachable even if DNS were correct. |
| 2026-03-22 ~20:00 — 2026-03-23 ~11:00 | CrashLoopBackOff cycle continues. Each Traefik restart loads an increasingly large route table, OOMKills within seconds. HTTPRoute count continues growing as cert-manager retries. No alerting fires. |
| 2026-03-23 ~11:00 | Incident identified during manual cluster health review. |
| 2026-03-23 ~11:04 | Traefik confirmed in CrashLoopBackOff with 102 restarts. OOMKill confirmed via kubectl describe. Memory limit at the time: 512Mi. |
| 2026-03-23 ~11:12 | Helm upgrade executed: Traefik memory limit increased from 512Mi to 512Mi (first attempt — insufficient). Gateway deadlock identified. Temporary TLS secrets injected. Gateway annotation patched to trigger reconciliation. All 23 listeners reach Programmed: True. Routes begin responding. |
| 2026-03-23 ~11:15 | 20 of 24 HTTPS endpoints confirmed healthy. Investigation closed prematurely — the ACME HTTPRoute leak had not yet been identified. cert-manager continues accumulating routes in the background. |
| 2026-03-23 ~12:11 | Traefik OOMKills again. HTTPRoute count has now reached 22,514. New memory limit (512Mi) is also insufficient. Pod re-enters CrashLoopBackOff. All routes return 404. |
| 2026-03-23 ~12:20 | Second investigation begins. ACME HTTPRoute accumulation identified as root cause. |
| 2026-03-23 ~12:22 | Mass deletion of 22,514 stale cm-acme-http-solver-* HTTPRoutes begins (parallel batch deletion). |
| 2026-03-23 ~12:27 | Traefik memory limit raised from 512Mi to 1024Mi. GOMEMLIMIT set to 900MiB. Helm revision 5 deployed. |
| 2026-03-23 ~12:27 | HPA configured: 2–5 replicas, 70% CPU target, 5-minute scaleDown stabilization window. |
| 2026-03-23 ~12:30 | Active ACME challenges, orders, and Certificate objects deleted from affected namespaces. Auto-sync disabled on the four affected ArgoCD applications to prevent cert-manager re-engaging before DNS is configured. |
| 2026-03-23 ~12:37 | HTTPRoute count reaches zero stale routes. Traefik route table normalised to ~30 entries. CPU drops from 286% to 1%. All HTTPS routes confirmed responding. |
Root Cause Analysis
Root Cause — ACME HTTP-01 Certificate Issuance Initiated Before DNS Was Configured
What happened:
Four new service namespaces (<client>-pay-production, <client>-pay-sandbox, <client>-rbac-production, <client>-rbac-sandbox) were deployed to the cluster on the evening of 2026-03-22.
Each namespace included:
- A cert-manager
Certificateresource targeting ACME HTTP-01 issuance via Let’s Encrypt - A Gateway listener referencing the expected TLS secret
- An
HTTPRoutefor the application
The deployment was correct in structure — the sequence of creating a Certificate and a Gateway listener together is the standard pattern. The critical missing prerequisite was that the DNS A records for the new service domains had not yet been pointed to the cluster’s Network Load Balancer.
ACME HTTP-01 validation works by having Let’s Encrypt send an HTTP request to http://<domain>/.well-known/acme-challenge/<token>. cert-manager creates an HTTPRoute to serve this token via Traefik. For this to succeed, the domain’s DNS must already resolve to the cluster’s ingress IP. Without DNS, Let’s Encrypt’s request never reaches the cluster, and the challenge fails immediately.
cert-manager’s retry behaviour does not clean up failed challenge HTTPRoute objects. Each retry attempt creates a new cm-acme-http-solver-* HTTPRoute. Over 18 hours at exponential backoff intervals, this produced 22,514 stale HTTPRoutes across the four namespaces:
<client>-pay-production: 5,445 stale HTTPRoutes
<client>-pay-sandbox: 5,695 stale HTTPRoutes
<client>-rbac-production: 5,697 stale HTTPRoutes
<client>-rbac-sandbox: 5,697 stale HTTPRoutes
─────────────────────────────────────────────────
Total: 22,534 stale HTTPRoutes
Traefik processes every HTTPRoute in the cluster on startup and on each configuration change event. With 22,534 routes in memory — each representing a parsed routing rule, hostname match, and backend reference — Traefik’s memory consumption exceeded its 512Mi limit during the reconciliation cycle and was terminated by the Linux OOM killer.
Evidence — Traefik OOMKill:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 23 Mar 2026 12:11:10 +0300
Finished: Mon, 23 Mar 2026 12:14:59 +0300
Limits: cpu: 500m memory: 512Mi
Requests: cpu: 100m memory: 128Mi
Evidence — HTTPRoute count before cleanup:
kubectl get httproute -A --no-headers | grep "cm-acme-http-solver" | wc -l
# 22514
Evidence — cert-manager retry logs:
ERR Unable to load HTTPRoute backend: Cannot load HTTPBackendRef
<client>-pay-production/<client>-pay-web: getting service: service "<client>-pay-web" not found
http_route=<client>-pay-route namespace=<client>-pay-production
The backend service did not exist because the application container image had not yet been deployed — another prerequisite that was not in place at the time of the deployment.
Compounding Factor — Gateway API Listener Failure Blocks All Routes
When the new namespaces were deployed, their Gateway listeners referenced TLS secrets (<client>-pay-prod-tls-secret, <client>-rbac-prod-tls-secret, etc.) that cert-manager had not yet been able to create (because challenges were failing).
The Kubernetes Gateway API marks the entire Gateway resource Programmed: False when any listener references a secret that does not exist. This is not scoped to the failing listener — it affects all 23 listeners on the same Gateway, including the 19 that had valid, working configurations.
Conditions:
Type: Programmed
Status: False
Reason: InvalidCertificateRef
Message: Error while retrieving certificate:
getting secret: secret "<client>-pay-prod-tls-secret" not found
This created a self-reinforcing deadlock:
Missing TLS secrets → Gateway Programmed: False
↓
All 23 listeners stop routing (including working ones)
↓
ACME HTTP-01 solver pods unreachable
(cert-manager's HTTPRoutes not being served)
↓
Certificates cannot be issued
(challenges fail even if DNS were correct)
↓
TLS secrets never created
↓
[DEADLOCK] — Gateway cannot program; cert-manager cannot issue
Contributing Factors
CF-1 — No DNS prerequisite check in the deployment process
The deployment procedure for a new HTTPS service did not include a step to verify that DNS records were pointing to the cluster load balancer before applying cert-manager Certificate resources. ACME HTTP-01 issuance has a hard dependency on DNS being correct at the time of issuance.
CF-2 — cert-manager does not garbage-collect failed challenge HTTPRoutes
cert-manager creates a new HTTPRoute for each HTTP-01 challenge attempt but does not remove routes from previous failed attempts. This is a known upstream behaviour. In a functioning cluster the backoff intervals are long enough that accumulation is not significant — but with a permanently-failing challenge (no DNS, no backend service), the retry frequency is high enough to produce thousands of objects over hours.
CF-3 — No alerting on HTTPRoute accumulation or cert-manager retry storms
No alert existed to detect abnormal HTTPRoute counts or sustained ACME request rates without certificate issuance. The accumulation of 22,534 routes over 18 hours produced no notification.
CF-4 — No alerting on Traefik memory pressure or CrashLoopBackOff
Traefik had been in CrashLoopBackOff for the duration of the accumulation window. No alert fired. The failure was discovered only during a manual cluster health check.
CF-5 — Traefik memory limit not sized for route table growth
The Traefik memory limit (512Mi at the time of the second OOMKill) was provisioned for a stable route table of ~30 routes. It had no headroom for transient inflation caused by accumulated challenge routes or future route table growth. No HPA was configured to distribute load across multiple replicas.
CF-6 — Gateway API listener failure blast radius not understood
The failure behaviour of the Gateway API — where one invalid listener blocks all listeners on the same resource — was not documented and not widely understood. This extended the investigation time during the first resolution attempt and contributed to premature closure before the HTTPRoute accumulation was identified.
Resolution
I resolved the outage in two phases. Phase 1 addressed the Gateway deadlock. Phase 2 addressed the underlying HTTPRoute accumulation.
Phase 1 — Gateway Deadlock Resolution (2026-03-23 ~11:12)
Action 1.1 — Inject temporary TLS secrets
I created placeholder self-signed secrets under the exact names referenced by the failing Gateway listeners. cert-manager overwrites these automatically on successful certificate issuance.
openssl req -x509 -nodes -newkey rsa:2048 -days 1 \
-keyout /tmp/tls.key -out /tmp/tls.crt \
-subj "/CN=temp-bootstrap" 2>/dev/null
for ns_secret in \
"<client>-pay-production/<client>-pay-prod-tls-secret" \
"<client>-pay-sandbox/sandbox-tls-secret" \
"<client>-rbac-production/<client>-rbac-prod-tls-secret" \
"<client>-rbac-sandbox/sandbox-tls-secret"; do
ns="${ns_secret%%/*}"
secret="${ns_secret##*/}"
kubectl create secret tls "$secret" -n "$ns" \
--cert=/tmp/tls.crt --key=/tmp/tls.key \
--dry-run=client -o yaml | kubectl apply -f -
done
Action 1.2 — Force Gateway reconciliation
kubectl annotate gateway traefik-gateway -n traefik \
"kubectl.kubernetes.io/last-applied-restart=$(date -u +%s)" \
--overwrite
All 23 listeners transitioned to Programmed: True within 8 seconds.
Phase 2 — ACME HTTPRoute Cleanup and Traefik Stabilisation (2026-03-23 ~12:20)
Action 2.1 — Delete 22,514 stale ACME HTTPRoutes
for ns in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
kubectl get httproute -n "$ns" --no-headers \
| grep "cm-acme-http-solver" \
| awk '{print $1}' \
| xargs -P10 -n100 kubectl delete httproute -n "$ns"
done
Action 2.2 — Increase Traefik memory limit and add GOMEMLIMIT
helm upgrade traefik traefik/traefik \
--version 38.0.2 \
--namespace traefik \
--reuse-values \
--set "resources.limits.memory=1024Mi" \
--set "resources.requests.memory=256Mi" \
--set "env[0].name=GOMEMLIMIT" \
--set "env[0].value=900MiB"
| Before | After | |
|---|---|---|
| Memory limit | 512Mi | 1024Mi |
| Memory request | 128Mi | 256Mi |
| GOMEMLIMIT | 115MiB | 900MiB |
Action 2.3 — Configure Traefik HPA
helm upgrade traefik traefik/traefik \
--version 38.0.2 \
--namespace traefik \
--reuse-values \
--set "autoscaling.enabled=true" \
--set "autoscaling.minReplicas=2" \
--set "autoscaling.maxReplicas=5" \
--set "autoscaling.metrics[0].type=Resource" \
--set "autoscaling.metrics[0].resource.name=cpu" \
--set "autoscaling.metrics[0].resource.target.type=Utilization" \
--set "autoscaling.metrics[0].resource.target.averageUtilization=70" \
--set "autoscaling.behavior.scaleDown.stabilizationWindowSeconds=300"
Action 2.4 — Stop cert-manager retry loop
I deleted active challenges, orders, and Certificate objects from the four affected namespaces and disabled auto-sync on the corresponding ArgoCD applications to prevent cert-manager from re-engaging until DNS and backend services are ready.
# Disable ArgoCD auto-sync on affected applications
for app in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
kubectl patch application "$app" -n argocd \
--type=merge \
-p '{"spec":{"syncPolicy":null}}'
done
# Delete all cert-manager resources in affected namespaces
for ns in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
kubectl delete certificate --all -n "$ns"
kubectl delete certificaterequest --all -n "$ns"
kubectl delete order --all -n "$ns"
kubectl delete challenge --all -n "$ns"
done
Verification
HTTPRoute count after cleanup:
kubectl get httproute -A --no-headers | grep "cm-acme-http-solver" | wc -l
# 0
Traefik pod status:
NAME READY STATUS RESTARTS AGE
traefik-749c566fd-zj2mb 1/1 Running 0 stable
traefik-749c566fd-x87wk 1/1 Running 0 stable (HPA second replica)
Traefik memory usage:
NAME CPU(cores) MEMORY(bytes)
traefik-749c566fd-zj2mb 8m 115Mi
traefik-749c566fd-x87wk 6m 52Mi
HPA status:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
traefik Deployment/traefik cpu: 1%/70% 2 5 2
All 25 Gateway listeners:
All listeners: Programmed=True, ResolvedRefs=True
HTTPS endpoint sweep: All previously-active routes confirmed responding. The four <client>-pay/<client>-rbac endpoints remain unavailable — expected, as DNS and application images are not yet deployed. These are not incident-related.
Corrective Actions
CA-1 — DNS Verification Gate Before Certificate Deployment
Priority: Immediate Owner: Platform Engineering / DevOps
No HTTPS service should be deployed to the cluster until DNS records for its domain have been confirmed pointing to the cluster NLB. Pre-deployment DNS check:
CLUSTER_NLB="<cluster-fqdn>"
# REVIEW: redacted — confirm
DOMAIN="pay.example.com"
RESOLVED=$(dig +short "$DOMAIN" | tail -1)
NLB_IP=$(dig +short "$CLUSTER_NLB" | tail -1)
if [ "$RESOLVED" != "$NLB_IP" ]; then
echo "ERROR: $DOMAIN does not resolve to cluster NLB ($NLB_IP). Do not deploy certificate."
exit 1
fi
echo "OK: DNS confirmed. Safe to deploy."
For services that genuinely require certificate issuance before DNS is finalised, switch to DNS-01 ACME validation (Route53 solver). DNS-01 validates domain ownership via a TXT record rather than HTTP routing, completely eliminating the dependency on ingress being functional.
CA-2 — Traefik HPA and Resource Sizing (Completed)
Priority: Immediate — COMPLETED Owner: Platform Engineering
Traefik is now configured with:
- Memory limit: 1024Mi
- GOMEMLIMIT: 900MiB (Go runtime soft cap — triggers GC before hard limit)
- HPA: 2–5 replicas, 70% CPU target, 5-minute scaleDown stabilization
The 1Gi limit provides headroom for transient route table growth. The HPA ensures a single pod OOMKill does not cause full ingress unavailability — the second replica continues serving traffic during a crash.
CA-3 — Alerting for ACME HTTPRoute Accumulation and Traefik Saturation (Completed)
Priority: Immediate — COMPLETED Owner: Platform Engineering
The following VMRule alerts have been deployed to the observability namespace and are currently operational:
| Alert | Condition | Severity |
|---|---|---|
AcmeRetryStorm | High ACME request rate with no certificates becoming Ready for 20+ minutes | critical |
AcmeHttpRouteCountHigh | More than 200 HTTPRoutes in cluster for 5+ minutes | warning |
TraefikMemoryPressure | Traefik memory > 70% of limit for 3+ minutes | warning |
TraefikHPAAtMaxReplicas | HPA pegged at maximum replicas for 10+ minutes | warning |
TraefikHighConfigReloadRate | Config reload rate > 0.5/sec for 5+ minutes | warning |
TraefikCrashLoopBackOff | Traefik pod in CrashLoopBackOff for 5+ minutes | critical |
Had these alerts been in place, the expected notification sequence for this incident would have been:
~18:20 AcmeRetryStorm fires (challenges failing with no Ready certs)
~19:30 AcmeHttpRouteCountHigh fires (>200 routes accumulated)
~19:45 TraefikMemoryPressure fires (>70% memory approaching limit)
~19:49 TraefikCrashLoopBackOff fires (OOMKill, pod restarting)
The outage would have been detectable within 20 minutes of the triggering deployment rather than after 15+ hours.
CA-4 — Alertmanager Notification Channel
Priority: Immediate Owner: Platform Engineering
All alerts are currently evaluated and firing correctly by vmalert. The alertmanager receiver is pending configuration. Until a notification channel is wired (Mattermost incoming webhook), firing alerts are not delivered to any external channel.
Required: Configure alertmanager with a Mattermost webhook receiver. Alert routing: severity: critical → immediate notification; severity: warning → notification with 5-minute grouping interval.
CA-5 — Deployment Checklist for New HTTPS Services
Priority: Medium-term Owner: Platform Engineering / DevOps
New HTTPS Service Onboarding Checklist
──────────────────────────────────────
□ DNS A record for service domain points to cluster NLB
Verify: dig +short <domain> (must match NLB IP)
□ Backend Deployment and Service exist in the target namespace
Verify: kubectl get deploy,svc -n <namespace>
□ Container image built and pushed to registry
Verify: Image present in ECR / registry
□ Gateway listener added with correct TLS secret name
□ cert-manager Certificate resource added (HTTP-01 or DNS-01)
□ If using HTTP-01: DNS verified before applying Certificate
If using DNS-01: Route53 IRSA permissions verified
□ ArgoCD application sync-wave ordering reviewed
(namespace/secrets before Gateway listener)
CA-6 — Document Gateway API Blast Radius in Runbook
Priority: Medium-term Owner: Platform Engineering
Add a runbook entry covering the Gateway API failure model:
- A single listener with
InvalidCertificateRefmarks the entireGatewayasProgrammed: False - This blocks ALL listeners, including those with valid configuration
- Resolution path: create placeholder TLS secret → patch Gateway annotation to force reconciliation
- Detection:
kubectl get gateway -n traefik -o wide→ check PROGRAMMED column - Detailed listener status:
kubectl get gateway traefik-gateway -n traefik -o jsonpath='{.status.listeners}'
CA-7 — Re-enable <client>-pay and <client>-rbac When Ready
Priority: When DNS and images are available Owner: Platform Engineering / Application Team
The four affected ArgoCD applications have auto-sync disabled. When the application team confirms DNS records are pointing to the cluster NLB and container images are built and pushed to the registry:
for app in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
kubectl patch application "$app" -n argocd \
--type=merge \
-p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'
done
cert-manager will automatically re-initiate certificate issuance on sync. With DNS correctly configured, HTTP-01 challenges will succeed on the first attempt and TLS will be issued within 2–5 minutes.
Lessons Learned
Deployment prerequisites must be verified before applying, not assumed
The triggering action was a deployment made with an incorrect assumption — that DNS was already in place. Infrastructure deployments that depend on external configuration (DNS, external credentials, upstream services) need explicit verification steps, not assumptions. A 30-second DNS check before applying would have prevented this entire incident.
cert-manager’s HTTPRoute leak is a known failure mode for long-running failed challenges
cert-manager creates ACME solver HTTPRoutes but does not remove them on failure — only on success or explicit deletion. In clusters using Gateway API with HTTP-01 ACME, a permanently-failing challenge will accumulate an unbounded number of HTTPRoutes. The mitigation is either DNS-01 (preferred for new services) or a monitoring alert on HTTPRoute count.
Gateway API failure propagation is not intuitive
Engineers familiar with the Kubernetes Ingress model — where each resource is independent — do not expect a misconfiguration on one listener to affect all other listeners on the same Gateway. This behaviour must be documented explicitly. It affects incident response speed: the first responder must know to look at all listener configurations, not just the failing service.
One pod ingress is not resilient
With a single Traefik replica, any OOMKill causes 100% ingress unavailability while the pod restarts. A minimum of two replicas ensures one pod can absorb traffic during a crash of the other. The HPA now enforces minReplicas: 2.
Observability gaps extend every incident
Every phase of this incident was extended by the absence of alerting. A 15-hour undetected CrashLoopBackOff. An 18-hour undetected HTTPRoute accumulation. Alerting does not prevent incidents — it compresses the time between onset and response.
End of Report
Discussion