Blog Field Notes Karpenter CrashLoopBackOff: 178 Restarts from an Empty Feature Gate Value
Debug #kubernetes#karpenter#eks#autoscaler#debugging#feature-gates

Karpenter CrashLoopBackOff: 178 Restarts from an Empty Feature Gate Value

Traced a Karpenter startup panic to a single empty string in the FEATURE_GATES environment variable and resolved it with a one-line kubectl set env patch.

· Gideon Warui
ON THIS PAGE

Environment

ComponentDetail
Kubernetesv1.34 (EKS)
Karpenterv1.8.0
Deployment methodDirect (not Helm-managed)
Node provisionerKarpenter with EC2NodeClass and NodePool

Step 1 — Observing the Symptom

During a cluster health check, I found the Karpenter pod in CrashLoopBackOff with an unusually high restart count:

kubectl get pods -n karpenter
NAME                        READY   STATUS             RESTARTS   AGE
karpenter-64d5c66ffb-pxcdk  0/1     CrashLoopBackOff   178        3d

178 restarts over 3 days. The pod had been broken for a long time without anyone noticing — likely because the cluster’s existing nodes continued operating, and Karpenter is only invoked when new nodes need to be provisioned.


Step 2 — Reading the Panic Trace

kubectl logs -n karpenter karpenter-64d5c66ffb-pxcdk --previous
panic: parsing feature gates, invalid value of StaticCapacity: , err: strconv.ParseBool: parsing "": invalid syntax

goroutine 1 [running]:
github.com/samber/lo.must({0x1b154c0, 0xc0005f4180}, {0x0, 0x0, 0x0})
        github.com/samber/lo@v1.51.0/errors.go:55 +0x1df
github.com/samber/lo.Must0(...)
        github.com/samber/lo@v1.51.0/errors.go:74
sigs.k8s.io/karpenter/pkg/operator/injection.WithOptionsOrDie({0x2122f50, 0x2ed8a40}, {0xc000124300, 0x2, 0x199?})
        sigs.k8s.io/karpenter@v1.8.0/pkg/operator/injection/injection.go:53 +0x13c
sigs.k8s.io/karpenter/pkg/operator.NewOperator({0x0?, 0xbf00c0020000d4?, 0xc00066d748?})
        sigs.k8s.io/karpenter@v1.8.0/pkg/operator/operator.go:124 +0x6a
main.main()
        github.com/aws/karpenter-provider-aws/cmd/controller/main.go:32 +0x29

The panic message is unambiguous:

parsing feature gates, invalid value of StaticCapacity: ,
err: strconv.ParseBool: parsing "": invalid syntax

Karpenter reads its feature gates from an environment variable called FEATURE_GATES. The value is a comma-separated list of key=value pairs where each value must be a boolean (true or false). The StaticCapacity key was present in the list but had an empty value — set to nothing rather than false or true.

Go’s strconv.ParseBool("") returns an error for an empty string, and Karpenter uses lo.Must0 (a zero-tolerance wrapper that panics on any error) when parsing this configuration. There is no graceful recovery — an empty feature gate value is an immediate fatal panic.


Step 3 — Confirming the Misconfiguration

I inspected the environment variable directly on the deployment:

kubectl get deployment -n karpenter karpenter \
  -o jsonpath='{.spec.template.spec.containers[0].env}' \
  | python3 -c "import json,sys; [print(f'{e[\"name\"]}={e.get(\"value\",\"(valueFrom)\")}') for e in json.load(sys.stdin)]"

The relevant line in the output:

FEATURE_GATES=ReservedCapacity=true,SpotToSpotConsolidation=false,NodeRepair=false,NodeOverlay=false,StaticCapacity=

The trailing StaticCapacity= with no value after the equals sign was the problem. Every other feature gate had an explicit boolean value. StaticCapacity had been added to the list — perhaps when the feature was introduced in a newer Karpenter version — but the value was never filled in.


Step 4 — Understanding the Feature Gates Mechanism

Karpenter’s feature gates are controlled via the FEATURE_GATES environment variable using a format identical to Kubernetes’ own feature gate syntax:

FEATURE_GATES=FeatureA=true,FeatureB=false,FeatureC=true

All values must be boolean strings. The parser performs strict validation on startup with no fallback defaults for malformed entries. This design is intentional — silently ignoring a misconfigured feature gate could lead to unexpected autoscaling behaviour in production.

The StaticCapacity feature gate (introduced in Karpenter v1.7+) controls whether Karpenter honours statically configured node capacity in EC2NodeClass. It defaults to false when not specified, but when specified in the FEATURE_GATES string, it must have an explicit value.

The likely origin: when Karpenter’s deployment configuration was updated to add StaticCapacity to the feature gate list, the value was accidentally left empty — possibly by a template that generated the list without providing the value.


Step 5 — The Fix

The fix was a single environment variable patch on the deployment:

kubectl set env deployment/karpenter -n karpenter \
  "FEATURE_GATES=ReservedCapacity=true,SpotToSpotConsolidation=false,NodeRepair=false,NodeOverlay=false,StaticCapacity=false"

kubectl set env updates the environment variable in-place and triggers a rolling deployment of a new pod. I monitored the rollout:

kubectl rollout status deployment/karpenter -n karpenter --timeout=60s
Waiting for deployment "karpenter" rollout to finish: 0 of 1 updated replicas are available...
deployment "karpenter" successfully rolled out

The new pod came up immediately:

kubectl get pods -n karpenter
NAME                        READY   STATUS    RESTARTS   AGE
karpenter-b4d4cf89c-sz45x   1/1     Running   0          25s

Step 6 — Verifying Recovery

I checked the logs to confirm clean startup:

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=10
{"level":"INFO","time":"2026-03-23T08:45:13Z","logger":"controller","message":"unknown field \"status.nodes\"","controller":"nodepool.counter","NodePool":{"name":"default"}}
{"level":"INFO","time":"2026-03-23T08:45:18Z","logger":"controller","message":"unknown field \"status.nodes\"","controller":"nodepool.counter","NodePool":{"name":"production"}}

No more panics. Karpenter was reconciling NodePool objects normally.

The INFO-level unknown field "status.nodes" messages are a separate, minor concern: they indicate a version drift between the Karpenter CRD schema and the controller. The CRD has a status.nodes field that this controller version does not recognise. This is non-fatal and typically occurs when the CRD is upgraded ahead of or behind the controller version. It does not affect node provisioning behaviour.


Impact of Karpenter Being Down

While Karpenter was in CrashLoopBackOff, the cluster’s existing nodes continued to operate normally — pods already scheduled were unaffected. The impact was limited to:

  1. No new node provisioning: If a pod was unschedulable due to resource constraints, it would remain in Pending indefinitely rather than triggering a new node
  2. No node consolidation: Underutilised nodes were not being reclaimed, leading to potential cost inefficiency
  3. No spot interruption handling: The interruption queue was not being processed, meaning spot instance termination events were not being acted upon

In a cluster that was already at capacity, this would have caused scheduling failures. In this case, the existing nodes had sufficient headroom, masking the impact.


How This Misconfiguration Likely Occurred

The most common path to this type of misconfiguration:

Scenario 1 — Template generation without a value: A Helm chart, Kustomize overlay, or Terraform variable was updated to include StaticCapacity in the feature gate list, but the corresponding value variable was not set — resulting in StaticCapacity= with an empty string.

Scenario 2 — Manual edit with a trailing comma: A direct edit of the deployment or configmap left a dangling entry such as StaticCapacity=,NextFeature=true or simply StaticCapacity= at the end of the list.

Scenario 3 — Version upgrade without reviewing defaults: When upgrading Karpenter to a version that introduced StaticCapacity, a configuration template added the key but did not specify the default value.


Prevention

Validate feature gate format before applying

validate_feature_gates() {
  local gates="$1"
  IFS=',' read -ra pairs <<< "$gates"
  for pair in "${pairs[@]}"; do
    key="${pair%%=*}"
    value="${pair##*=}"
    if [[ -z "$value" ]]; then
      echo "ERROR: Feature gate '$key' has empty value"
      return 1
    fi
    if [[ "$value" != "true" && "$value" != "false" ]]; then
      echo "ERROR: Feature gate '$key' has invalid value '$value' (must be true or false)"
      return 1
    fi
  done
  echo "OK: All feature gates valid"
}

validate_feature_gates "ReservedCapacity=true,SpotToSpotConsolidation=false,StaticCapacity=false"
# OK: All feature gates valid

Use explicit defaults in configuration templates

When managing feature gates via Helm or Kustomize, always specify a default value even for features that are disabled:

# values.yaml
featureGates:
  reservedCapacity: "true"
  spotToSpotConsolidation: "false"
  nodeRepair: "false"
  nodeOverlay: "false"
  staticCapacity: "false"   # always explicit, never empty

Monitor CrashLoopBackOff on infrastructure pods

Karpenter, cert-manager, external-secrets, and similar infrastructure operators can fail silently from the perspective of application teams — workloads keep running while the control plane is broken. A dedicated alert on CrashLoopBackOff for pods in infrastructure namespaces (kube-system, karpenter, cert-manager, external-secrets) catches these failures before they accumulate 178 restarts unnoticed.


Production Rules

  1. Go’s strconv.ParseBool("") returns an error. Any configuration system that produces empty-value key=value pairs and feeds them to a strict boolean parser will crash. Karpenter uses Must0 which panics on error — there is no recovery.

  2. High restart counts on infrastructure pods are silent failures. 178 restarts over 3 days went unnoticed because existing workloads continued running. Infrastructure component health requires dedicated monitoring, not just workload monitoring.

  3. kubectl set env is a fast, targeted patch. For environment variable changes on running deployments, it is faster than editing the deployment YAML and triggers a proper rolling update.

  4. Feature gate format is strict. Both Kubernetes and Karpenter treat empty values as invalid. When adding a new feature gate to a configuration, always include the value even if it is the default.


Commands Reference

# Check pod restart count
kubectl get pods -n karpenter

# Read logs from crashed container
kubectl logs -n karpenter <pod> --previous

# List all environment variables on a deployment
kubectl get deployment <name> -n <namespace> \
  -o jsonpath='{.spec.template.spec.containers[0].env}'

# Patch an environment variable
kubectl set env deployment/<name> -n <namespace> \
  "FEATURE_GATES=key1=true,key2=false,key3=false"

# Monitor rollout
kubectl rollout status deployment/<name> -n <namespace> --timeout=60s

# Read live logs after fix
kubectl logs -n <namespace> -l app.kubernetes.io/name=<app> --tail=20
#kubernetes#karpenter#eks#autoscaler#debugging#feature-gates