Diagnosing and Fixing an OOMKilled Traefik Ingress Controller on EKS
Traced a Traefik CrashLoopBackOff that took down the entire ingress layer overnight to an undersized memory limit, then fixed a Helm schema validation failure on the first upgrade attempt.
ON THIS PAGE
In a production EKS cluster running a multi-tenant microservices platform, the Traefik ingress controller entered a CrashLoopBackOff state overnight. By morning, the entire ingress layer was unreachable — every HTTP and HTTPS route across all namespaces returned no response. Users and downstream services were completely blocked.
Environment
| Component | Detail |
|---|---|
| Kubernetes | v1.34 (managed EKS) |
| Ingress controller | Traefik v3.6.6 |
| Helm chart | traefik-38.0.2 |
| Routing model | Kubernetes Gateway API (HTTPRoute, Gateway, GatewayClass) |
| Namespace count | 30+ |
Step 1 — Initial Detection
The first signal was a CrashLoopBackOff on the Traefik pod with a restart count exceeding 100 after 15 hours:
kubectl get pods -A | grep -i traefik
traefik traefik-6cf4b8bd9c-bhr92 0/1 CrashLoopBackOff 102 15h
A restart count that high over such a long window meant the pod had been failing repeatedly throughout the night. Combined with reports that all ingress endpoints were unreachable, this became the immediate priority.
Step 2 — Reading the Crash Logs
The --previous flag retrieves logs from the last terminated container instance, not the currently waiting one:
kubectl logs -n traefik traefik-6cf4b8bd9c-bhr92 --previous
The output showed Traefik starting up cleanly:
INF Traefik version 3.6.6 built on 2025-12-29T15:47:44Z version=3.6.6
INF Starting provider aggregator *aggregator.ProviderAggregator
INF Starting provider *crd.Provider
INF Starting provider *gateway.Provider
INF Starting provider *acme.ChallengeTLSALPN
Then — nothing. The log simply stopped mid-startup with no error message, no panic trace, no graceful shutdown message.
This is the characteristic signature of a kernel OOM kill: the process is terminated with SIGKILL before it has a chance to write anything to stderr.
Step 3 — Confirming OOMKill via Pod Description
kubectl logs only shows what the container wrote before dying. The authoritative termination reason lives in the pod description:
kubectl describe pod -n traefik traefik-6cf4b8bd9c-bhr92
The key section:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 23 Mar 2026 11:06:52 +0300
Finished: Mon, 23 Mar 2026 11:06:56 +0300
Exit code 137 confirms it: 128 + 9 — the container received SIGKILL from the kernel’s Out-Of-Memory killer. The pod started, ran for 4 seconds, consumed enough memory to breach its limit, and was forcibly terminated.
The resource configuration in the same describe output told the full story:
Limits:
cpu: 200m
memory: 128Mi
Requests:
cpu: 50m
memory: 64Mi
Environment:
GOMEMLIMIT: 115MiB (limits.cpu)
# REVIEW: verify this output — GOMEMLIMIT is expected to derive from limits.memory, not limits.cpu
GOMEMLIMIT=115MiB is automatically derived from the memory limit by the Go runtime’s Kubernetes integration. At 115MiB, Traefik had almost no room to load its routing configuration before hitting the ceiling.
Step 4 — Why 128Mi Is Insufficient at Scale
Traefik’s memory consumption at startup is not flat — it scales with the size of the routing table. When using the Gateway API provider (--providers.kubernetesgateway), Traefik performs the following reconciliation on every start or restart:
- Lists all route objects —
HTTPRoute,GRPCRoute,TCPRouteacross all namespaces - Resolves backend references — validates that each
backendRefpoints to a realService - Loads TLS secrets — fetches every referenced
kubernetes.io/tlssecret across namespaces - Builds the routing tree — constructs an in-memory trie of hostname + path prefix → backend
In a cluster with 30+ namespaces and hundreds of routes, this initial reconciliation alone can consume 200–400MiB depending on route complexity and TLS secret count. The 128Mi limit had been appropriate when the platform was smaller; as new services and namespaces were added, the routing table outgrew it.
Step 5 — Retrieving the Current Helm Values
Before any upgrade, I retrieved the current Helm values to understand what was in play:
helm get values traefik -n traefik
ingressRoute:
dashboard:
enabled: false
nodeSelector:
node-type: workload
providers:
kubernetesGateway:
enabled: true
kubernetesIngress:
enabled: false
resources:
limits:
cpu: 200m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
type: LoadBalancer
The values were minimal. Only the resources block required changes.
Step 6 — The First Upgrade Attempt (and Why It Failed)
The first helm upgrade attempt used --reuse-values without pinning the chart version:
helm upgrade traefik traefik/traefik \
--reuse-values \
--set "resources.limits.memory=512Mi" \
--set "resources.requests.memory=128Mi" \
-n traefik
This failed:
Error: UPGRADE FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
traefik:
- at '/ports': validation failed
- at '/ports/websecure': additional properties 'tls', 'middlewares' not allowed
- at '/ports/web': additional properties 'redirections' not allowed
- at '/rbac': additional properties 'secretResourceNames' not allowed
Without --version, Helm resolved to the latest available chart version in the repository — traefik-39.0.x, a major bump from the deployed 38.0.2. Chart version 39 introduced schema changes that removed several fields present in v38. The --reuse-values flag replayed stored values from the current release against the new schema, and those stale field names failed JSON Schema validation.
This is a well-known Helm footgun: without an explicit version pin, upgrades can silently pull a major chart version with breaking changes.
Step 7 — The Correct Upgrade
The fix was to pin --version 38.0.2 and update only the resource limits:
helm upgrade traefik traefik/traefik \
--version 38.0.2 \
-n traefik \
--reuse-values \
--set "resources.limits.memory=512Mi" \
--set "resources.requests.memory=128Mi" \
--set "resources.limits.cpu=500m" \
--set "resources.requests.cpu=100m" \
--wait --timeout 120s
Release "traefik" has been upgraded. Happy Helming!
STATUS: deployed
REVISION: 4
The new pod came up within 30 seconds:
kubectl get pods -n traefik
NAME READY STATUS RESTARTS AGE
traefik-5dcd6664cf-rsd5s 1/1 Running 0 29s
Step 8 — Verifying Recovery
With Traefik running, a sweep of all HTTPS endpoints confirmed the ingress layer was restored. Routes that had been returning connection errors were now responding with expected HTTP status codes from their backends.
The updated resource configuration:
Limits: cpu: 500m memory: 512Mi
Requests: cpu: 100m memory: 128Mi
With 512Mi headroom, Traefik completed its full startup reconciliation, built the routing table for all active namespaces, and began serving traffic without hitting the OOM killer.
Resource Sizing Reference
For Traefik using the Gateway API provider, memory consumption at startup correlates primarily with route count and TLS secret volume:
| Route count | Recommended memory limit |
|---|---|
| < 50 | 128Mi |
| 50–200 | 256Mi |
| 200–500 | 512Mi |
| 500+ | 768Mi–1Gi |
CPU is less critical at steady state — Traefik is predominantly I/O-bound during normal operation. The 500m CPU limit provides startup headroom without over-provisioning.
Production Rules
1. kubectl describe reveals what kubectl logs cannot
When a container is OOMKilled, it exits instantly with no log output. The Last State.Reason: OOMKilled field in kubectl describe pod is the only reliable indicator.
2. Always pin --version during helm upgrade
Without a version pin, Helm resolves to the latest chart — which may have incompatible schema changes. Use helm search repo <chart> --versions to confirm the currently deployed version before any upgrade.
helm search repo traefik/traefik --versions | head -5
3. GOMEMLIMIT is derived from resources.limits.memory
Modern Go runtimes in Kubernetes automatically set GOMEMLIMIT to ~90% of the container memory limit. A low limit directly constrains Go’s garbage collector and can cause excessive GC pressure even before the kernel OOM kill occurs.
4. Resource limits need to grow with the platform
Ingress controller memory requirements scale with routing table size. As new services and namespaces are added, limits that were appropriate at initial deployment become insufficient. Periodic review using kubectl top pods prevents silent degradation.
Discussion