Root Cause Analysis — Ingress Layer Outage

Incident ID: INC-2026-03-23-001 Date of Incident: 2026-03-22 (onset) — 2026-03-23 (resolution) Date of Report: 2026-03-23 (revised) Severity: Critical — full ingress unavailability Status: Resolved Prepared by: Gideon Warui — Platform/Infra Engineer

Executive Summary
Impact Assessment
Timeline
Root Cause Analysis
Contributing Factors
Resolution
Verification
Corrective Actions
Lessons Learned

Executive Summary

On the evening of 2026-03-22, two new service namespaces were deployed to the shared Kubernetes cluster. The deployments included TLS Certificate resources backed by a Let’s Encrypt ACME HTTP-01 issuer. The deploying engineer believed the corresponding DNS records had already been pointed at the cluster’s load balancer. They had not.

Because DNS was not configured, the ACME HTTP-01 challenges could never succeed — Let’s Encrypt could not reach the challenge token endpoint. cert-manager retried continuously, and each retry created a new HTTPRoute object (the ACME solver route) without removing the previous failed attempt. Over the next 18 hours, this leaked 22,514 stale HTTPRoutes into the cluster.

Traefik, the cluster’s ingress controller, loads its entire routing table into memory on startup and on each configuration change. With 22,514 routes to process, Traefik’s memory consumption spiked far beyond its configured limit, causing repeated OOMKills (exit code 137). The pod entered CrashLoopBackOff, and all HTTPS ingress became unavailable.

A secondary compounding failure worsened recovery: the same new deployments had added Gateway listeners referencing TLS secrets that did not yet exist. The Kubernetes Gateway API marks the entire Gateway resource as Programmed: False when any listener has an unresolvable configuration — blocking traffic on all 23 previously-healthy listeners, not just the two new ones.

Resolution required four actions in sequence: cleaning up 22,514 stale HTTPRoutes; increasing Traefik’s memory limit from 512Mi to 1024Mi and adding an HPA; deleting active ACME challenges and certificates to stop the retry loop; and injecting temporary placeholder TLS secrets to break the Gateway deadlock.

Total effective downtime from first OOMKill to full route restoration was approximately 18–20 hours.

Impact Assessment

Services Affected

All services routed through the cluster’s shared Traefik Gateway were impacted across both production and sandbox environments.

Environment	Impact
All production HTTPS endpoints	Fully unreachable from the internet
All sandbox HTTPS endpoints	Fully unreachable from the internet
Internal cluster traffic	Unaffected
Database and message queue layers	Unaffected
Background workers and async jobs	Unaffected
Kubernetes control plane	Unaffected

What Was Not Affected

All application pods continued running normally
Pod-to-pod and service-to-service communication was unaffected
The ArgoCD GitOps layer continued reconciling application state
Karpenter continued managing node provisioning (separate issue, resolved independently)

Timeline

All times are EAT (UTC+3).

Time	Event
2026-03-22 ~18:00	Four new service namespaces deployed to the cluster via ArgoCD. Each included a `Certificate` resource targeting ACME HTTP-01 issuance, and new Gateway listeners referencing the expected TLS secrets. The deploying engineer believed DNS records for the new domains were already pointing to the cluster NLB. They were not.
2026-03-22 ~18:00 onwards	cert-manager begins ACME HTTP-01 challenge attempts for the new certificates. Challenges fail immediately — DNS not pointing to cluster, Let’s Encrypt cannot reach the solver endpoint. cert-manager enters exponential backoff retry loop. Each retry creates a new `cm-acme-http-solver-*` HTTPRoute without removing the previous one. HTTPRoute count begins accumulating.
2026-03-22 ~19:49	Traefik pod first OOMKill. The accumulating HTTPRoutes inflate the in-memory routing table beyond the 512Mi limit. Exit code 137. Pod enters CrashLoopBackOff.
2026-03-22 ~20:00	All HTTPS routes begin returning connection errors. cert-manager ACME challenges now additionally stalled because Traefik is not routing — the HTTP-01 solver endpoint is unreachable even if DNS were correct.
2026-03-22 ~20:00 — 2026-03-23 ~11:00	CrashLoopBackOff cycle continues. Each Traefik restart loads an increasingly large route table, OOMKills within seconds. HTTPRoute count continues growing as cert-manager retries. No alerting fires.
2026-03-23 ~11:00	Incident identified during manual cluster health review.
2026-03-23 ~11:04	Traefik confirmed in CrashLoopBackOff with 102 restarts. OOMKill confirmed via `kubectl describe`. Memory limit at the time: 512Mi.
2026-03-23 ~11:12	Helm upgrade executed: Traefik memory limit increased from 512Mi to 512Mi (first attempt — insufficient). Gateway deadlock identified. Temporary TLS secrets injected. Gateway annotation patched to trigger reconciliation. All 23 listeners reach `Programmed: True`. Routes begin responding.
2026-03-23 ~11:15	20 of 24 HTTPS endpoints confirmed healthy. Investigation closed prematurely — the ACME HTTPRoute leak had not yet been identified. cert-manager continues accumulating routes in the background.
2026-03-23 ~12:11	Traefik OOMKills again. HTTPRoute count has now reached 22,514. New memory limit (512Mi) is also insufficient. Pod re-enters CrashLoopBackOff. All routes return 404.
2026-03-23 ~12:20	Second investigation begins. ACME HTTPRoute accumulation identified as root cause.
2026-03-23 ~12:22	Mass deletion of 22,514 stale `cm-acme-http-solver-*` HTTPRoutes begins (parallel batch deletion).
2026-03-23 ~12:27	Traefik memory limit raised from 512Mi to 1024Mi. GOMEMLIMIT set to 900MiB. Helm revision 5 deployed.
2026-03-23 ~12:27	HPA configured: 2–5 replicas, 70% CPU target, 5-minute scaleDown stabilization window.
2026-03-23 ~12:30	Active ACME challenges, orders, and Certificate objects deleted from affected namespaces. Auto-sync disabled on the four affected ArgoCD applications to prevent cert-manager re-engaging before DNS is configured.
2026-03-23 ~12:37	HTTPRoute count reaches zero stale routes. Traefik route table normalised to ~30 entries. CPU drops from 286% to 1%. All HTTPS routes confirmed responding.

Root Cause Analysis

Root Cause — ACME HTTP-01 Certificate Issuance Initiated Before DNS Was Configured

What happened:

Four new service namespaces (<client>-pay-production, <client>-pay-sandbox, <client>-rbac-production, <client>-rbac-sandbox) were deployed to the cluster on the evening of 2026-03-22.

Each namespace included:

A cert-manager Certificate resource targeting ACME HTTP-01 issuance via Let’s Encrypt
A Gateway listener referencing the expected TLS secret
An HTTPRoute for the application

The deployment was correct in structure — the sequence of creating a Certificate and a Gateway listener together is the standard pattern. The critical missing prerequisite was that the DNS A records for the new service domains had not yet been pointed to the cluster’s Network Load Balancer.

ACME HTTP-01 validation works by having Let’s Encrypt send an HTTP request to http://<domain>/.well-known/acme-challenge/<token>. cert-manager creates an HTTPRoute to serve this token via Traefik. For this to succeed, the domain’s DNS must already resolve to the cluster’s ingress IP. Without DNS, Let’s Encrypt’s request never reaches the cluster, and the challenge fails immediately.

cert-manager’s retry behaviour does not clean up failed challenge HTTPRoute objects. Each retry attempt creates a new cm-acme-http-solver-* HTTPRoute. Over 18 hours at exponential backoff intervals, this produced 22,514 stale HTTPRoutes across the four namespaces:

<client>-pay-production:    5,445 stale HTTPRoutes
<client>-pay-sandbox:       5,695 stale HTTPRoutes
<client>-rbac-production:   5,697 stale HTTPRoutes
<client>-rbac-sandbox:      5,697 stale HTTPRoutes
─────────────────────────────────────────────────
Total:                      22,534 stale HTTPRoutes

Traefik processes every HTTPRoute in the cluster on startup and on each configuration change event. With 22,534 routes in memory — each representing a parsed routing rule, hostname match, and backend reference — Traefik’s memory consumption exceeded its 512Mi limit during the reconciliation cycle and was terminated by the Linux OOM killer.

Evidence — Traefik OOMKill:

Last State:  Terminated
  Reason:    OOMKilled
  Exit Code: 137
  Started:   Mon, 23 Mar 2026 12:11:10 +0300
  Finished:  Mon, 23 Mar 2026 12:14:59 +0300

Limits:   cpu: 500m   memory: 512Mi
Requests: cpu: 100m   memory: 128Mi

Evidence — HTTPRoute count before cleanup:

kubectl get httproute -A --no-headers | grep "cm-acme-http-solver" | wc -l
# 22514

Evidence — cert-manager retry logs:

ERR Unable to load HTTPRoute backend: Cannot load HTTPBackendRef
    <client>-pay-production/<client>-pay-web: getting service: service "<client>-pay-web" not found
    http_route=<client>-pay-route namespace=<client>-pay-production

The backend service did not exist because the application container image had not yet been deployed — another prerequisite that was not in place at the time of the deployment.

Compounding Factor — Gateway API Listener Failure Blocks All Routes

When the new namespaces were deployed, their Gateway listeners referenced TLS secrets (<client>-pay-prod-tls-secret, <client>-rbac-prod-tls-secret, etc.) that cert-manager had not yet been able to create (because challenges were failing).

The Kubernetes Gateway API marks the entire Gateway resource Programmed: False when any listener references a secret that does not exist. This is not scoped to the failing listener — it affects all 23 listeners on the same Gateway, including the 19 that had valid, working configurations.

Conditions:
  Type:    Programmed
  Status:  False
  Reason:  InvalidCertificateRef
  Message: Error while retrieving certificate:
           getting secret: secret "<client>-pay-prod-tls-secret" not found

This created a self-reinforcing deadlock:

Missing TLS secrets → Gateway Programmed: False
                    ↓
      All 23 listeners stop routing (including working ones)
                    ↓
      ACME HTTP-01 solver pods unreachable
      (cert-manager's HTTPRoutes not being served)
                    ↓
      Certificates cannot be issued
      (challenges fail even if DNS were correct)
                    ↓
      TLS secrets never created
                    ↓
      [DEADLOCK] — Gateway cannot program; cert-manager cannot issue

Contributing Factors

CF-1 — No DNS prerequisite check in the deployment process

The deployment procedure for a new HTTPS service did not include a step to verify that DNS records were pointing to the cluster load balancer before applying cert-manager Certificate resources. ACME HTTP-01 issuance has a hard dependency on DNS being correct at the time of issuance.

CF-2 — cert-manager does not garbage-collect failed challenge HTTPRoutes

cert-manager creates a new HTTPRoute for each HTTP-01 challenge attempt but does not remove routes from previous failed attempts. This is a known upstream behaviour. In a functioning cluster the backoff intervals are long enough that accumulation is not significant — but with a permanently-failing challenge (no DNS, no backend service), the retry frequency is high enough to produce thousands of objects over hours.

CF-3 — No alerting on HTTPRoute accumulation or cert-manager retry storms

No alert existed to detect abnormal HTTPRoute counts or sustained ACME request rates without certificate issuance. The accumulation of 22,534 routes over 18 hours produced no notification.

CF-4 — No alerting on Traefik memory pressure or CrashLoopBackOff

Traefik had been in CrashLoopBackOff for the duration of the accumulation window. No alert fired. The failure was discovered only during a manual cluster health check.

CF-5 — Traefik memory limit not sized for route table growth

The Traefik memory limit (512Mi at the time of the second OOMKill) was provisioned for a stable route table of ~30 routes. It had no headroom for transient inflation caused by accumulated challenge routes or future route table growth. No HPA was configured to distribute load across multiple replicas.

CF-6 — Gateway API listener failure blast radius not understood

The failure behaviour of the Gateway API — where one invalid listener blocks all listeners on the same resource — was not documented and not widely understood. This extended the investigation time during the first resolution attempt and contributed to premature closure before the HTTPRoute accumulation was identified.

Resolution

I resolved the outage in two phases. Phase 1 addressed the Gateway deadlock. Phase 2 addressed the underlying HTTPRoute accumulation.

Phase 1 — Gateway Deadlock Resolution (2026-03-23 ~11:12)

Action 1.1 — Inject temporary TLS secrets

I created placeholder self-signed secrets under the exact names referenced by the failing Gateway listeners. cert-manager overwrites these automatically on successful certificate issuance.

openssl req -x509 -nodes -newkey rsa:2048 -days 1 \
  -keyout /tmp/tls.key -out /tmp/tls.crt \
  -subj "/CN=temp-bootstrap" 2>/dev/null

for ns_secret in \
  "<client>-pay-production/<client>-pay-prod-tls-secret" \
  "<client>-pay-sandbox/sandbox-tls-secret" \
  "<client>-rbac-production/<client>-rbac-prod-tls-secret" \
  "<client>-rbac-sandbox/sandbox-tls-secret"; do
  ns="${ns_secret%%/*}"
  secret="${ns_secret##*/}"
  kubectl create secret tls "$secret" -n "$ns" \
    --cert=/tmp/tls.crt --key=/tmp/tls.key \
    --dry-run=client -o yaml | kubectl apply -f -
done

Action 1.2 — Force Gateway reconciliation

kubectl annotate gateway traefik-gateway -n traefik \
  "kubectl.kubernetes.io/last-applied-restart=$(date -u +%s)" \
  --overwrite

All 23 listeners transitioned to Programmed: True within 8 seconds.

Phase 2 — ACME HTTPRoute Cleanup and Traefik Stabilisation (2026-03-23 ~12:20)

Action 2.1 — Delete 22,514 stale ACME HTTPRoutes

for ns in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
  kubectl get httproute -n "$ns" --no-headers \
    | grep "cm-acme-http-solver" \
    | awk '{print $1}' \
    | xargs -P10 -n100 kubectl delete httproute -n "$ns"
done

Action 2.2 — Increase Traefik memory limit and add GOMEMLIMIT

helm upgrade traefik traefik/traefik \
  --version 38.0.2 \
  --namespace traefik \
  --reuse-values \
  --set "resources.limits.memory=1024Mi" \
  --set "resources.requests.memory=256Mi" \
  --set "env[0].name=GOMEMLIMIT" \
  --set "env[0].value=900MiB"

	Before	After
Memory limit	512Mi	1024Mi
Memory request	128Mi	256Mi
GOMEMLIMIT	115MiB	900MiB

Action 2.3 — Configure Traefik HPA

helm upgrade traefik traefik/traefik \
  --version 38.0.2 \
  --namespace traefik \
  --reuse-values \
  --set "autoscaling.enabled=true" \
  --set "autoscaling.minReplicas=2" \
  --set "autoscaling.maxReplicas=5" \
  --set "autoscaling.metrics[0].type=Resource" \
  --set "autoscaling.metrics[0].resource.name=cpu" \
  --set "autoscaling.metrics[0].resource.target.type=Utilization" \
  --set "autoscaling.metrics[0].resource.target.averageUtilization=70" \
  --set "autoscaling.behavior.scaleDown.stabilizationWindowSeconds=300"

Action 2.4 — Stop cert-manager retry loop

I deleted active challenges, orders, and Certificate objects from the four affected namespaces and disabled auto-sync on the corresponding ArgoCD applications to prevent cert-manager from re-engaging until DNS and backend services are ready.

# Disable ArgoCD auto-sync on affected applications
for app in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
  kubectl patch application "$app" -n argocd \
    --type=merge \
    -p '{"spec":{"syncPolicy":null}}'
done

# Delete all cert-manager resources in affected namespaces
for ns in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
  kubectl delete certificate --all -n "$ns"
  kubectl delete certificaterequest --all -n "$ns"
  kubectl delete order --all -n "$ns"
  kubectl delete challenge --all -n "$ns"
done

Verification

HTTPRoute count after cleanup:

kubectl get httproute -A --no-headers | grep "cm-acme-http-solver" | wc -l
# 0

Traefik pod status:

NAME                      READY   STATUS    RESTARTS   AGE
traefik-749c566fd-zj2mb   1/1     Running   0          stable
traefik-749c566fd-x87wk   1/1     Running   0          stable  (HPA second replica)

Traefik memory usage:

NAME                      CPU(cores)   MEMORY(bytes)
traefik-749c566fd-zj2mb   8m           115Mi
traefik-749c566fd-x87wk   6m           52Mi

HPA status:

NAME      REFERENCE            TARGETS    MINPODS   MAXPODS   REPLICAS
traefik   Deployment/traefik   cpu: 1%/70%   2      5         2

All 25 Gateway listeners:

All listeners: Programmed=True, ResolvedRefs=True

HTTPS endpoint sweep: All previously-active routes confirmed responding. The four <client>-pay/<client>-rbac endpoints remain unavailable — expected, as DNS and application images are not yet deployed. These are not incident-related.

Corrective Actions

CA-1 — DNS Verification Gate Before Certificate Deployment

Priority: Immediate Owner: Platform Engineering / DevOps

No HTTPS service should be deployed to the cluster until DNS records for its domain have been confirmed pointing to the cluster NLB. Pre-deployment DNS check:

CLUSTER_NLB="<cluster-fqdn>"
# REVIEW: redacted — confirm
DOMAIN="pay.example.com"

RESOLVED=$(dig +short "$DOMAIN" | tail -1)
NLB_IP=$(dig +short "$CLUSTER_NLB" | tail -1)

if [ "$RESOLVED" != "$NLB_IP" ]; then
  echo "ERROR: $DOMAIN does not resolve to cluster NLB ($NLB_IP). Do not deploy certificate."
  exit 1
fi
echo "OK: DNS confirmed. Safe to deploy."

For services that genuinely require certificate issuance before DNS is finalised, switch to DNS-01 ACME validation (Route53 solver). DNS-01 validates domain ownership via a TXT record rather than HTTP routing, completely eliminating the dependency on ingress being functional.

CA-2 — Traefik HPA and Resource Sizing (Completed)

Priority: Immediate — COMPLETED Owner: Platform Engineering

Traefik is now configured with:

Memory limit: 1024Mi
GOMEMLIMIT: 900MiB (Go runtime soft cap — triggers GC before hard limit)
HPA: 2–5 replicas, 70% CPU target, 5-minute scaleDown stabilization

The 1Gi limit provides headroom for transient route table growth. The HPA ensures a single pod OOMKill does not cause full ingress unavailability — the second replica continues serving traffic during a crash.

CA-3 — Alerting for ACME HTTPRoute Accumulation and Traefik Saturation (Completed)

Priority: Immediate — COMPLETED Owner: Platform Engineering

The following VMRule alerts have been deployed to the observability namespace and are currently operational:

Alert	Condition	Severity
`AcmeRetryStorm`	High ACME request rate with no certificates becoming Ready for 20+ minutes	critical
`AcmeHttpRouteCountHigh`	More than 200 HTTPRoutes in cluster for 5+ minutes	warning
`TraefikMemoryPressure`	Traefik memory > 70% of limit for 3+ minutes	warning
`TraefikHPAAtMaxReplicas`	HPA pegged at maximum replicas for 10+ minutes	warning
`TraefikHighConfigReloadRate`	Config reload rate > 0.5/sec for 5+ minutes	warning
`TraefikCrashLoopBackOff`	Traefik pod in CrashLoopBackOff for 5+ minutes	critical

Had these alerts been in place, the expected notification sequence for this incident would have been:

~18:20  AcmeRetryStorm fires             (challenges failing with no Ready certs)
~19:30  AcmeHttpRouteCountHigh fires     (>200 routes accumulated)
~19:45  TraefikMemoryPressure fires      (>70% memory approaching limit)
~19:49  TraefikCrashLoopBackOff fires    (OOMKill, pod restarting)

The outage would have been detectable within 20 minutes of the triggering deployment rather than after 15+ hours.

CA-4 — Alertmanager Notification Channel

Priority: Immediate Owner: Platform Engineering

All alerts are currently evaluated and firing correctly by vmalert. The alertmanager receiver is pending configuration. Until a notification channel is wired (Mattermost incoming webhook), firing alerts are not delivered to any external channel.

Required: Configure alertmanager with a Mattermost webhook receiver. Alert routing: severity: critical → immediate notification; severity: warning → notification with 5-minute grouping interval.

CA-5 — Deployment Checklist for New HTTPS Services

Priority: Medium-term Owner: Platform Engineering / DevOps

New HTTPS Service Onboarding Checklist
──────────────────────────────────────
□ DNS A record for service domain points to cluster NLB
  Verify: dig +short <domain> (must match NLB IP)

□ Backend Deployment and Service exist in the target namespace
  Verify: kubectl get deploy,svc -n <namespace>

□ Container image built and pushed to registry
  Verify: Image present in ECR / registry

□ Gateway listener added with correct TLS secret name

□ cert-manager Certificate resource added (HTTP-01 or DNS-01)

□ If using HTTP-01: DNS verified before applying Certificate
  If using DNS-01: Route53 IRSA permissions verified

□ ArgoCD application sync-wave ordering reviewed
  (namespace/secrets before Gateway listener)

CA-6 — Document Gateway API Blast Radius in Runbook

Priority: Medium-term Owner: Platform Engineering

Add a runbook entry covering the Gateway API failure model:

A single listener with InvalidCertificateRef marks the entire Gateway as Programmed: False
This blocks ALL listeners, including those with valid configuration
Resolution path: create placeholder TLS secret → patch Gateway annotation to force reconciliation
Detection: kubectl get gateway -n traefik -o wide → check PROGRAMMED column
Detailed listener status: kubectl get gateway traefik-gateway -n traefik -o jsonpath='{.status.listeners}'

CA-7 — Re-enable `<client>-pay` and `<client>-rbac` When Ready

Priority: When DNS and images are available Owner: Platform Engineering / Application Team

The four affected ArgoCD applications have auto-sync disabled. When the application team confirms DNS records are pointing to the cluster NLB and container images are built and pushed to the registry:

for app in <client>-pay-production <client>-pay-sandbox <client>-rbac-production <client>-rbac-sandbox; do
  kubectl patch application "$app" -n argocd \
    --type=merge \
    -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'
done

cert-manager will automatically re-initiate certificate issuance on sync. With DNS correctly configured, HTTP-01 challenges will succeed on the first attempt and TLS will be issued within 2–5 minutes.

Lessons Learned

Deployment prerequisites must be verified before applying, not assumed

The triggering action was a deployment made with an incorrect assumption — that DNS was already in place. Infrastructure deployments that depend on external configuration (DNS, external credentials, upstream services) need explicit verification steps, not assumptions. A 30-second DNS check before applying would have prevented this entire incident.

cert-manager’s HTTPRoute leak is a known failure mode for long-running failed challenges

cert-manager creates ACME solver HTTPRoutes but does not remove them on failure — only on success or explicit deletion. In clusters using Gateway API with HTTP-01 ACME, a permanently-failing challenge will accumulate an unbounded number of HTTPRoutes. The mitigation is either DNS-01 (preferred for new services) or a monitoring alert on HTTPRoute count.

Gateway API failure propagation is not intuitive

Engineers familiar with the Kubernetes Ingress model — where each resource is independent — do not expect a misconfiguration on one listener to affect all other listeners on the same Gateway. This behaviour must be documented explicitly. It affects incident response speed: the first responder must know to look at all listener configurations, not just the failing service.

One pod ingress is not resilient

With a single Traefik replica, any OOMKill causes 100% ingress unavailability while the pod restarts. A minimum of two replicas ensures one pod can absorb traffic during a crash of the other. The HPA now enforces minReplicas: 2.

Observability gaps extend every incident

Every phase of this incident was extended by the absence of alerting. A 15-hour undetected CrashLoopBackOff. An 18-hour undetected HTTPRoute accumulation. Alerting does not prevent incidents — it compresses the time between onset and response.

End of Report

cert-manager ACME HTTP-01 Leak: 22,514 Stale HTTPRoutes OOMKilled Traefik

Root Cause Analysis — Ingress Layer Outage

Table of Contents

Executive Summary

Impact Assessment

Services Affected

What Was Not Affected

Timeline

Root Cause Analysis

Root Cause — ACME HTTP-01 Certificate Issuance Initiated Before DNS Was Configured

Compounding Factor — Gateway API Listener Failure Blocks All Routes

Contributing Factors

CF-1 — No DNS prerequisite check in the deployment process

CF-2 — cert-manager does not garbage-collect failed challenge HTTPRoutes

CF-3 — No alerting on HTTPRoute accumulation or cert-manager retry storms

CF-4 — No alerting on Traefik memory pressure or CrashLoopBackOff

CF-5 — Traefik memory limit not sized for route table growth

CF-6 — Gateway API listener failure blast radius not understood

Resolution

Phase 1 — Gateway Deadlock Resolution (2026-03-23 ~11:12)

Phase 2 — ACME HTTPRoute Cleanup and Traefik Stabilisation (2026-03-23 ~12:20)

Verification

Corrective Actions

CA-1 — DNS Verification Gate Before Certificate Deployment

CA-2 — Traefik HPA and Resource Sizing (Completed)

CA-3 — Alerting for ACME HTTPRoute Accumulation and Traefik Saturation (Completed)

CA-4 — Alertmanager Notification Channel

CA-5 — Deployment Checklist for New HTTPS Services

CA-6 — Document Gateway API Blast Radius in Runbook

CA-7 — Re-enable `<client>-pay` and `<client>-rbac` When Ready

Lessons Learned

Deployment prerequisites must be verified before applying, not assumed

cert-manager’s HTTPRoute leak is a known failure mode for long-running failed challenges

Gateway API failure propagation is not intuitive

One pod ingress is not resilient

Observability gaps extend every incident

Root Cause Analysis — Ingress Layer Outage

Table of Contents

Executive Summary

Impact Assessment

Services Affected

What Was Not Affected

Timeline

Root Cause Analysis

Root Cause — ACME HTTP-01 Certificate Issuance Initiated Before DNS Was Configured

Compounding Factor — Gateway API Listener Failure Blocks All Routes

Contributing Factors

CF-1 — No DNS prerequisite check in the deployment process

CF-2 — cert-manager does not garbage-collect failed challenge HTTPRoutes

CF-3 — No alerting on HTTPRoute accumulation or cert-manager retry storms

CF-4 — No alerting on Traefik memory pressure or CrashLoopBackOff

CF-5 — Traefik memory limit not sized for route table growth

CF-6 — Gateway API listener failure blast radius not understood

Resolution

Phase 1 — Gateway Deadlock Resolution (2026-03-23 ~11:12)

Phase 2 — ACME HTTPRoute Cleanup and Traefik Stabilisation (2026-03-23 ~12:20)

Verification

Corrective Actions

CA-1 — DNS Verification Gate Before Certificate Deployment

CA-2 — Traefik HPA and Resource Sizing (Completed)

CA-3 — Alerting for ACME HTTPRoute Accumulation and Traefik Saturation (Completed)

CA-4 — Alertmanager Notification Channel

CA-5 — Deployment Checklist for New HTTPS Services

CA-6 — Document Gateway API Blast Radius in Runbook

CA-7 — Re-enable <client>-pay and <client>-rbac When Ready

Lessons Learned

Deployment prerequisites must be verified before applying, not assumed

cert-manager’s HTTPRoute leak is a known failure mode for long-running failed challenges

Gateway API failure propagation is not intuitive

One pod ingress is not resilient

Observability gaps extend every incident

CA-7 — Re-enable `<client>-pay` and `<client>-rbac` When Ready