Karpenter CrashLoopBackOff: 178 Restarts from an Empty Feature Gate Value
Traced a Karpenter startup panic to a single empty string in the FEATURE_GATES environment variable and resolved it with a one-line kubectl set env patch.
ON THIS PAGE
Environment
| Component | Detail |
|---|---|
| Kubernetes | v1.34 (EKS) |
| Karpenter | v1.8.0 |
| Deployment method | Direct (not Helm-managed) |
| Node provisioner | Karpenter with EC2NodeClass and NodePool |
Step 1 — Observing the Symptom
During a cluster health check, I found the Karpenter pod in CrashLoopBackOff with an unusually high restart count:
kubectl get pods -n karpenter
NAME READY STATUS RESTARTS AGE
karpenter-64d5c66ffb-pxcdk 0/1 CrashLoopBackOff 178 3d
178 restarts over 3 days. The pod had been broken for a long time without anyone noticing — likely because the cluster’s existing nodes continued operating, and Karpenter is only invoked when new nodes need to be provisioned.
Step 2 — Reading the Panic Trace
kubectl logs -n karpenter karpenter-64d5c66ffb-pxcdk --previous
panic: parsing feature gates, invalid value of StaticCapacity: , err: strconv.ParseBool: parsing "": invalid syntax
goroutine 1 [running]:
github.com/samber/lo.must({0x1b154c0, 0xc0005f4180}, {0x0, 0x0, 0x0})
github.com/samber/lo@v1.51.0/errors.go:55 +0x1df
github.com/samber/lo.Must0(...)
github.com/samber/lo@v1.51.0/errors.go:74
sigs.k8s.io/karpenter/pkg/operator/injection.WithOptionsOrDie({0x2122f50, 0x2ed8a40}, {0xc000124300, 0x2, 0x199?})
sigs.k8s.io/karpenter@v1.8.0/pkg/operator/injection/injection.go:53 +0x13c
sigs.k8s.io/karpenter/pkg/operator.NewOperator({0x0?, 0xbf00c0020000d4?, 0xc00066d748?})
sigs.k8s.io/karpenter@v1.8.0/pkg/operator/operator.go:124 +0x6a
main.main()
github.com/aws/karpenter-provider-aws/cmd/controller/main.go:32 +0x29
The panic message is unambiguous:
parsing feature gates, invalid value of StaticCapacity: ,
err: strconv.ParseBool: parsing "": invalid syntax
Karpenter reads its feature gates from an environment variable called FEATURE_GATES. The value is a comma-separated list of key=value pairs where each value must be a boolean (true or false). The StaticCapacity key was present in the list but had an empty value — set to nothing rather than false or true.
Go’s strconv.ParseBool("") returns an error for an empty string, and Karpenter uses lo.Must0 (a zero-tolerance wrapper that panics on any error) when parsing this configuration. There is no graceful recovery — an empty feature gate value is an immediate fatal panic.
Step 3 — Confirming the Misconfiguration
I inspected the environment variable directly on the deployment:
kubectl get deployment -n karpenter karpenter \
-o jsonpath='{.spec.template.spec.containers[0].env}' \
| python3 -c "import json,sys; [print(f'{e[\"name\"]}={e.get(\"value\",\"(valueFrom)\")}') for e in json.load(sys.stdin)]"
The relevant line in the output:
FEATURE_GATES=ReservedCapacity=true,SpotToSpotConsolidation=false,NodeRepair=false,NodeOverlay=false,StaticCapacity=
The trailing StaticCapacity= with no value after the equals sign was the problem. Every other feature gate had an explicit boolean value. StaticCapacity had been added to the list — perhaps when the feature was introduced in a newer Karpenter version — but the value was never filled in.
Step 4 — Understanding the Feature Gates Mechanism
Karpenter’s feature gates are controlled via the FEATURE_GATES environment variable using a format identical to Kubernetes’ own feature gate syntax:
FEATURE_GATES=FeatureA=true,FeatureB=false,FeatureC=true
All values must be boolean strings. The parser performs strict validation on startup with no fallback defaults for malformed entries. This design is intentional — silently ignoring a misconfigured feature gate could lead to unexpected autoscaling behaviour in production.
The StaticCapacity feature gate (introduced in Karpenter v1.7+) controls whether Karpenter honours statically configured node capacity in EC2NodeClass. It defaults to false when not specified, but when specified in the FEATURE_GATES string, it must have an explicit value.
The likely origin: when Karpenter’s deployment configuration was updated to add StaticCapacity to the feature gate list, the value was accidentally left empty — possibly by a template that generated the list without providing the value.
Step 5 — The Fix
The fix was a single environment variable patch on the deployment:
kubectl set env deployment/karpenter -n karpenter \
"FEATURE_GATES=ReservedCapacity=true,SpotToSpotConsolidation=false,NodeRepair=false,NodeOverlay=false,StaticCapacity=false"
kubectl set env updates the environment variable in-place and triggers a rolling deployment of a new pod. I monitored the rollout:
kubectl rollout status deployment/karpenter -n karpenter --timeout=60s
Waiting for deployment "karpenter" rollout to finish: 0 of 1 updated replicas are available...
deployment "karpenter" successfully rolled out
The new pod came up immediately:
kubectl get pods -n karpenter
NAME READY STATUS RESTARTS AGE
karpenter-b4d4cf89c-sz45x 1/1 Running 0 25s
Step 6 — Verifying Recovery
I checked the logs to confirm clean startup:
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=10
{"level":"INFO","time":"2026-03-23T08:45:13Z","logger":"controller","message":"unknown field \"status.nodes\"","controller":"nodepool.counter","NodePool":{"name":"default"}}
{"level":"INFO","time":"2026-03-23T08:45:18Z","logger":"controller","message":"unknown field \"status.nodes\"","controller":"nodepool.counter","NodePool":{"name":"production"}}
No more panics. Karpenter was reconciling NodePool objects normally.
The INFO-level unknown field "status.nodes" messages are a separate, minor concern: they indicate a version drift between the Karpenter CRD schema and the controller. The CRD has a status.nodes field that this controller version does not recognise. This is non-fatal and typically occurs when the CRD is upgraded ahead of or behind the controller version. It does not affect node provisioning behaviour.
Impact of Karpenter Being Down
While Karpenter was in CrashLoopBackOff, the cluster’s existing nodes continued to operate normally — pods already scheduled were unaffected. The impact was limited to:
- No new node provisioning: If a pod was unschedulable due to resource constraints, it would remain in
Pendingindefinitely rather than triggering a new node - No node consolidation: Underutilised nodes were not being reclaimed, leading to potential cost inefficiency
- No spot interruption handling: The interruption queue was not being processed, meaning spot instance termination events were not being acted upon
In a cluster that was already at capacity, this would have caused scheduling failures. In this case, the existing nodes had sufficient headroom, masking the impact.
How This Misconfiguration Likely Occurred
The most common path to this type of misconfiguration:
Scenario 1 — Template generation without a value:
A Helm chart, Kustomize overlay, or Terraform variable was updated to include StaticCapacity in the feature gate list, but the corresponding value variable was not set — resulting in StaticCapacity= with an empty string.
Scenario 2 — Manual edit with a trailing comma:
A direct edit of the deployment or configmap left a dangling entry such as StaticCapacity=,NextFeature=true or simply StaticCapacity= at the end of the list.
Scenario 3 — Version upgrade without reviewing defaults:
When upgrading Karpenter to a version that introduced StaticCapacity, a configuration template added the key but did not specify the default value.
Prevention
Validate feature gate format before applying
validate_feature_gates() {
local gates="$1"
IFS=',' read -ra pairs <<< "$gates"
for pair in "${pairs[@]}"; do
key="${pair%%=*}"
value="${pair##*=}"
if [[ -z "$value" ]]; then
echo "ERROR: Feature gate '$key' has empty value"
return 1
fi
if [[ "$value" != "true" && "$value" != "false" ]]; then
echo "ERROR: Feature gate '$key' has invalid value '$value' (must be true or false)"
return 1
fi
done
echo "OK: All feature gates valid"
}
validate_feature_gates "ReservedCapacity=true,SpotToSpotConsolidation=false,StaticCapacity=false"
# OK: All feature gates valid
Use explicit defaults in configuration templates
When managing feature gates via Helm or Kustomize, always specify a default value even for features that are disabled:
# values.yaml
featureGates:
reservedCapacity: "true"
spotToSpotConsolidation: "false"
nodeRepair: "false"
nodeOverlay: "false"
staticCapacity: "false" # always explicit, never empty
Monitor CrashLoopBackOff on infrastructure pods
Karpenter, cert-manager, external-secrets, and similar infrastructure operators can fail silently from the perspective of application teams — workloads keep running while the control plane is broken. A dedicated alert on CrashLoopBackOff for pods in infrastructure namespaces (kube-system, karpenter, cert-manager, external-secrets) catches these failures before they accumulate 178 restarts unnoticed.
Production Rules
-
Go’s
strconv.ParseBool("")returns an error. Any configuration system that produces empty-valuekey=valuepairs and feeds them to a strict boolean parser will crash. Karpenter usesMust0which panics on error — there is no recovery. -
High restart counts on infrastructure pods are silent failures. 178 restarts over 3 days went unnoticed because existing workloads continued running. Infrastructure component health requires dedicated monitoring, not just workload monitoring.
-
kubectl set envis a fast, targeted patch. For environment variable changes on running deployments, it is faster than editing the deployment YAML and triggers a proper rolling update. -
Feature gate format is strict. Both Kubernetes and Karpenter treat empty values as invalid. When adding a new feature gate to a configuration, always include the value even if it is the default.
Commands Reference
# Check pod restart count
kubectl get pods -n karpenter
# Read logs from crashed container
kubectl logs -n karpenter <pod> --previous
# List all environment variables on a deployment
kubectl get deployment <name> -n <namespace> \
-o jsonpath='{.spec.template.spec.containers[0].env}'
# Patch an environment variable
kubectl set env deployment/<name> -n <namespace> \
"FEATURE_GATES=key1=true,key2=false,key3=false"
# Monitor rollout
kubectl rollout status deployment/<name> -n <namespace> --timeout=60s
# Read live logs after fix
kubectl logs -n <namespace> -l app.kubernetes.io/name=<app> --tail=20 Discussion