Blog Field Notes Diagnosing and Fixing an OOMKilled Traefik Ingress Controller on EKS
Incident

Diagnosing and Fixing an OOMKilled Traefik Ingress Controller on EKS

Traced a Traefik CrashLoopBackOff that took down the entire ingress layer overnight to an undersized memory limit, then fixed a Helm schema validation failure on the first upgrade attempt.

· Gideon Warui
ON THIS PAGE

In a production EKS cluster running a multi-tenant microservices platform, the Traefik ingress controller entered a CrashLoopBackOff state overnight. By morning, the entire ingress layer was unreachable — every HTTP and HTTPS route across all namespaces returned no response. Users and downstream services were completely blocked.


Environment

ComponentDetail
Kubernetesv1.34 (managed EKS)
Ingress controllerTraefik v3.6.6
Helm charttraefik-38.0.2
Routing modelKubernetes Gateway API (HTTPRoute, Gateway, GatewayClass)
Namespace count30+

Step 1 — Initial Detection

The first signal was a CrashLoopBackOff on the Traefik pod with a restart count exceeding 100 after 15 hours:

kubectl get pods -A | grep -i traefik
traefik   traefik-6cf4b8bd9c-bhr92   0/1   CrashLoopBackOff   102   15h

A restart count that high over such a long window meant the pod had been failing repeatedly throughout the night. Combined with reports that all ingress endpoints were unreachable, this became the immediate priority.


Step 2 — Reading the Crash Logs

The --previous flag retrieves logs from the last terminated container instance, not the currently waiting one:

kubectl logs -n traefik traefik-6cf4b8bd9c-bhr92 --previous

The output showed Traefik starting up cleanly:

INF Traefik version 3.6.6 built on 2025-12-29T15:47:44Z version=3.6.6
INF Starting provider aggregator *aggregator.ProviderAggregator
INF Starting provider *crd.Provider
INF Starting provider *gateway.Provider
INF Starting provider *acme.ChallengeTLSALPN

Then — nothing. The log simply stopped mid-startup with no error message, no panic trace, no graceful shutdown message.

This is the characteristic signature of a kernel OOM kill: the process is terminated with SIGKILL before it has a chance to write anything to stderr.


Step 3 — Confirming OOMKill via Pod Description

kubectl logs only shows what the container wrote before dying. The authoritative termination reason lives in the pod description:

kubectl describe pod -n traefik traefik-6cf4b8bd9c-bhr92

The key section:

State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 23 Mar 2026 11:06:52 +0300
  Finished:     Mon, 23 Mar 2026 11:06:56 +0300

Exit code 137 confirms it: 128 + 9 — the container received SIGKILL from the kernel’s Out-Of-Memory killer. The pod started, ran for 4 seconds, consumed enough memory to breach its limit, and was forcibly terminated.

The resource configuration in the same describe output told the full story:

Limits:
  cpu:     200m
  memory:  128Mi
Requests:
  cpu:      50m
  memory:   64Mi

Environment:
  GOMEMLIMIT:  115MiB  (limits.cpu)
# REVIEW: verify this output — GOMEMLIMIT is expected to derive from limits.memory, not limits.cpu

GOMEMLIMIT=115MiB is automatically derived from the memory limit by the Go runtime’s Kubernetes integration. At 115MiB, Traefik had almost no room to load its routing configuration before hitting the ceiling.


Step 4 — Why 128Mi Is Insufficient at Scale

Traefik’s memory consumption at startup is not flat — it scales with the size of the routing table. When using the Gateway API provider (--providers.kubernetesgateway), Traefik performs the following reconciliation on every start or restart:

  1. Lists all route objectsHTTPRoute, GRPCRoute, TCPRoute across all namespaces
  2. Resolves backend references — validates that each backendRef points to a real Service
  3. Loads TLS secrets — fetches every referenced kubernetes.io/tls secret across namespaces
  4. Builds the routing tree — constructs an in-memory trie of hostname + path prefix → backend

In a cluster with 30+ namespaces and hundreds of routes, this initial reconciliation alone can consume 200–400MiB depending on route complexity and TLS secret count. The 128Mi limit had been appropriate when the platform was smaller; as new services and namespaces were added, the routing table outgrew it.


Step 5 — Retrieving the Current Helm Values

Before any upgrade, I retrieved the current Helm values to understand what was in play:

helm get values traefik -n traefik
ingressRoute:
  dashboard:
    enabled: false
nodeSelector:
  node-type: workload
providers:
  kubernetesGateway:
    enabled: true
  kubernetesIngress:
    enabled: false
resources:
  limits:
    cpu: 200m
    memory: 128Mi
  requests:
    cpu: 50m
    memory: 64Mi
service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
  type: LoadBalancer

The values were minimal. Only the resources block required changes.


Step 6 — The First Upgrade Attempt (and Why It Failed)

The first helm upgrade attempt used --reuse-values without pinning the chart version:

helm upgrade traefik traefik/traefik \
  --reuse-values \
  --set "resources.limits.memory=512Mi" \
  --set "resources.requests.memory=128Mi" \
  -n traefik

This failed:

Error: UPGRADE FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
traefik:
- at '/ports': validation failed
  - at '/ports/websecure': additional properties 'tls', 'middlewares' not allowed
  - at '/ports/web': additional properties 'redirections' not allowed
- at '/rbac': additional properties 'secretResourceNames' not allowed

Without --version, Helm resolved to the latest available chart version in the repository — traefik-39.0.x, a major bump from the deployed 38.0.2. Chart version 39 introduced schema changes that removed several fields present in v38. The --reuse-values flag replayed stored values from the current release against the new schema, and those stale field names failed JSON Schema validation.

This is a well-known Helm footgun: without an explicit version pin, upgrades can silently pull a major chart version with breaking changes.


Step 7 — The Correct Upgrade

The fix was to pin --version 38.0.2 and update only the resource limits:

helm upgrade traefik traefik/traefik \
  --version 38.0.2 \
  -n traefik \
  --reuse-values \
  --set "resources.limits.memory=512Mi" \
  --set "resources.requests.memory=128Mi" \
  --set "resources.limits.cpu=500m" \
  --set "resources.requests.cpu=100m" \
  --wait --timeout 120s
Release "traefik" has been upgraded. Happy Helming!
STATUS: deployed
REVISION: 4

The new pod came up within 30 seconds:

kubectl get pods -n traefik
NAME                       READY   STATUS    RESTARTS   AGE
traefik-5dcd6664cf-rsd5s   1/1     Running   0          29s

Step 8 — Verifying Recovery

With Traefik running, a sweep of all HTTPS endpoints confirmed the ingress layer was restored. Routes that had been returning connection errors were now responding with expected HTTP status codes from their backends.

The updated resource configuration:

Limits:   cpu: 500m   memory: 512Mi
Requests: cpu: 100m   memory: 128Mi

With 512Mi headroom, Traefik completed its full startup reconciliation, built the routing table for all active namespaces, and began serving traffic without hitting the OOM killer.


Resource Sizing Reference

For Traefik using the Gateway API provider, memory consumption at startup correlates primarily with route count and TLS secret volume:

Route countRecommended memory limit
< 50128Mi
50–200256Mi
200–500512Mi
500+768Mi–1Gi

CPU is less critical at steady state — Traefik is predominantly I/O-bound during normal operation. The 500m CPU limit provides startup headroom without over-provisioning.


Production Rules

1. kubectl describe reveals what kubectl logs cannot

When a container is OOMKilled, it exits instantly with no log output. The Last State.Reason: OOMKilled field in kubectl describe pod is the only reliable indicator.

2. Always pin --version during helm upgrade

Without a version pin, Helm resolves to the latest chart — which may have incompatible schema changes. Use helm search repo <chart> --versions to confirm the currently deployed version before any upgrade.

helm search repo traefik/traefik --versions | head -5

3. GOMEMLIMIT is derived from resources.limits.memory

Modern Go runtimes in Kubernetes automatically set GOMEMLIMIT to ~90% of the container memory limit. A low limit directly constrains Go’s garbage collector and can cause excessive GC pressure even before the kernel OOM kill occurs.

4. Resource limits need to grow with the platform

Ingress controller memory requirements scale with routing table size. As new services and namespaces are added, limits that were appropriate at initial deployment become insufficient. Periodic review using kubectl top pods prevents silent degradation.