Blog Field Notes Installing Karpenter 1.8 on EKS 1.34: Four Errors and a Working Cluster
Debug #karpenter#eks#aws#spot-instances#cost-optimization

Installing Karpenter 1.8 on EKS 1.34: Four Errors and a Working Cluster

Installed Karpenter 1.8 on EKS 1.34 by working through a Helm registry migration, a version compatibility gap, a feature gate parsing bug, and a missing aws-auth entry — alongside a cost audit that cut $23/month.

· Gideon Warui
ON THIS PAGE

The Problem

An AWS alert arrived: “You have exceeded 85% of your AWS Free Tier usage for CloudWatch.” The cluster was consuming 4.75GB of the 5GB monthly limit after only two weeks of operation.

Part 1: The CloudWatch Investigation

Finding the Culprit

I started by identifying what was generating all these logs:

aws logs describe-log-groups --region us-east-2 \
  --query 'logGroups[*].[logGroupName,storedBytes]' \
  --output table

The output revealed the problem immediately:

/aws/eks/cluster-name/cluster    748,482,560 bytes (748MB)

EKS control plane logs were generating approximately 250MB per day. At that rate, the cluster would hit 7.5GB/month — well over the 5GB Free Tier limit.

The Root Cause

EKS control plane logging was enabled for all log types in the Terraform configuration:

# Before: All logging enabled
cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]

The audit logs alone generated hundreds of megabytes daily, logging every API call to the Kubernetes API server.

The Fix

Disabling EKS control plane logging entirely in Terraform resolved the issue:

# After: Logging disabled to stay within Free Tier
cluster_enabled_log_types = []
cluster_log_retention_days = 1

The CloudWatch log group also became conditional:

resource "aws_cloudwatch_log_group" "cluster" {
  count             = length(var.cluster_enabled_log_types) > 0 ? 1 : 0
  name              = "/aws/eks/${local.cluster_name}/cluster"
  retention_in_days = var.cluster_log_retention_days
  tags              = local.common_tags
}

Deleting the existing log group reclaimed the 748MB of stored data still counting against the quota:

aws logs delete-log-group \
  --log-group-name /aws/eks/cluster-name/cluster \
  --region us-east-2

Key Takeaway: EKS control plane logging is useful for active debugging but expensive for always-on usage. For development clusters, enable it only when actively debugging issues.

Part 2: The Real Cost Analysis

With the CloudWatch issue addressed, deeper investigation into AWS spending revealed hidden costs.

The Hidden Truth: AWS Credits

Running aws ce get-cost-and-usage revealed something interesting:

Cost TypeAmount
Unblended Cost$17.32
Net Unblended Cost$0.00

AWS credits ($160) were masking actual infrastructure costs. The real monthly run rate was approximately $177/month, which would become $420–490/month once credits expired — well above the $400/month budget target.

Cost Breakdown

ServiceMonthly Cost
EC2 (6 nodes)~$40
RDS (prod + nonprod)~$33
EKS Control Plane~$32
VPC (NAT Gateway)~$32
Other~$40

Quick Wins

  1. RDS Nonprod Downsizing: The nonprod database was running on db.t4g.small, but CloudWatch metrics showed:

    • Average CPU: 4.5%
    • Average connections: 4.8
    • Max connections: 20

    Downsizing from db.t4g.small ($23/month) to db.t4g.micro ($8/month) saved ~$15/month.

  2. CloudWatch Logging Disabled: Saved ~$8/month on log ingestion costs.

Part 3: The Spot Instance Challenge

The Goal

Moving sandbox/development workloads to Spot instances reduces costs, with automatic fallback to On-Demand when Spot capacity becomes unavailable.

First Attempt: Hard NodeSelector (Failed)

My initial approach used a nodeSelector to force pods onto Spot nodes:

spec:
  template:
    spec:
      nodeSelector:
        eks.amazonaws.com/capacityType: SPOT

Result: Pods stuck in Pending state.

Events:
  Warning  FailedScheduling  0/6 nodes are available:
  1 node(s) didn't match Pod's node affinity/selector,
  5 node(s) had untolerated taint(s)

The Spot node had reached its 17-pod limit (t3.large ENI limitation), and the hard selector prevented fallback to On-Demand nodes.

Second Attempt: Preferred Node Affinity (Partial Success)

Switching to preferredDuringSchedulingIgnoredDuringExecution provided flexibility:

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: eks.amazonaws.com/capacityType
                    operator: In
                    values:
                      - SPOT
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: workload-type
                    operator: NotIn
                    values:
                      - production

Result: Pods scheduled successfully, but only 1 of 3 landed on Spot (the node was full).

The ENI Limit Problem

Each EC2 instance type has a maximum number of ENIs and IPs it can support. For t3.large:

  • Max ENIs: 3
  • IPs per ENI: 12
  • Max pods: ~17 (accounting for system pods)

With 17 pods already on the Spot node, no new pods could schedule there regardless of affinity preferences.

The Solution: VPC CNI Prefix Delegation

AWS VPC CNI supports “prefix delegation” which assigns /28 prefixes (16 IPs) instead of individual IPs:

kubectl set env daemonset aws-node -n kube-system \
  ENABLE_PREFIX_DELEGATION=true
kubectl set env daemonset aws-node -n kube-system \
  WARM_PREFIX_TARGET=1
kubectl rollout restart daemonset aws-node -n kube-system

Important: Existing nodes retain their original pod limits. Only newly provisioned nodes receive the higher capacity (~110 pods per node for t3.large).

Part 4: Installing Karpenter — The Hard Way

Why Karpenter?

Without Karpenter, EKS Managed Node Groups impose several limitations:

  • Slow scaling (2–5 minutes)
  • Separate node groups required for Spot vs On-Demand
  • No automatic rebalancing when Spot capacity returns
  • No dynamic instance type selection based on pod requirements

Karpenter solves all of these problems.

Prerequisites Check

Terraform had already created the necessary IAM infrastructure:

# Verify IAM role exists
aws iam get-role --role-name cluster-karpenter-controller

# Verify SQS queue for interruption handling
aws sqs get-queue-url --queue-name cluster-karpenter-interruption

# Verify instance profile for nodes
aws iam get-instance-profile --instance-profile-name cluster-karpenter-node

All resources existed. Installation should have been straightforward. It was not.

Error 1: Chart Version Not Found

helm upgrade --install karpenter karpenter/karpenter \
  --version 1.0.0 \
  ...

Error:

Error: chart "karpenter" matching 1.0.0 not found in karpenter index

Root Cause: The old Helm repository (charts.karpenter.sh) only contains pre-1.0 versions. Karpenter 1.x distributes via OCI registry.

Fix: Use the OCI registry:

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "1.1.1" \
  ...

Error 2: Kubernetes Version Incompatibility

panic: validating kubernetes version, karpenter version is not
compatible with K8s version 1.34

Root Cause: Karpenter 1.1.1 does not support Kubernetes 1.34.

Karpenter Compatibility Matrix:

KubernetesMinimum Karpenter
1.31>= 1.0.5
1.32>= 1.2
1.33>= 1.5
1.34>= 1.8

Fix: Install Karpenter 1.8.0:

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "1.8.0" \
  ...

Error 3: Feature Gate Parsing Failure

panic: parsing feature gates, invalid value of StaticCapacity: ,
err: strconv.ParseBool: parsing "": invalid syntax

Root Cause: Karpenter 1.8.0 introduced a new feature gate (StaticCapacity) that the Helm chart set to an empty string.

Examining the deployment revealed:

- name: FEATURE_GATES
  value: ReservedCapacity=true,SpotToSpotConsolidation=false,...,StaticCapacity=

Note the trailing StaticCapacity= with no value.

Fix: Patch the deployment directly:

kubectl set env deploy/karpenter -n karpenter \
  FEATURE_GATES="ReservedCapacity=true,SpotToSpotConsolidation=false,NodeRepair=false,NodeOverlay=false,StaticCapacity=false"

Error 4: Nodes Launch But Never Join Cluster

With Karpenter running, a test deployment triggered node provisioning. Karpenter launched EC2 instances, but they never joined the cluster:

kubectl get nodeclaims
NAME            TYPE        CAPACITY   ZONE         NODE   READY     AGE
default-qshpb   t2.xlarge   spot       us-east-2b          Unknown   7m

The NodeClaim showed READY: Unknown and Node not registered with cluster. Checking the EC2 instance’s cloud-init logs revealed:

WARNING: Unhandled unknown content-type (application/node.eks.aws) userdata

Red Herring: This warning is expected behavior and NOT the problem. The application/node.eks.aws content type gets processed by nodeadm, not cloud-init.

Root Cause: The Karpenter node IAM role was missing from the aws-auth ConfigMap. Without this entry, nodes cannot authenticate to the Kubernetes API.

For AL2023 nodes to register, they require membership in BOTH the system:bootstrappers AND system:nodes groups.

Fix: Add the Karpenter node role to aws-auth:

kubectl get configmap aws-auth -n kube-system -o yaml
# Only had managed node group role, missing Karpenter role!

kubectl patch configmap aws-auth -n kube-system --type merge -p '{
  "data": {
    "mapRoles": "- rolearn: arn:aws:iam::ACCOUNT_ID:role/eks-node-role\n  groups:\n  - system:bootstrappers\n  - system:nodes\n  username: system:node:{{EC2PrivateDNSName}}\n- rolearn: arn:aws:iam::ACCOUNT_ID:role/karpenter-node-role\n  groups:\n  - system:bootstrappers\n  - system:nodes\n  username: system:node:{{EC2PrivateDNSName}}\n"
  }
}'

After patching aws-auth, the next test deployment worked immediately — a new SPOT node joined in ~60 seconds.

Finally Working

After fixing the feature gates and aws-auth, Karpenter started successfully:

kubectl get pods -n karpenter
NAME                         READY   STATUS    RESTARTS   AGE
karpenter-55674fd845-l7zhq   1/1     Running   0          39s
karpenter-55674fd845-nkj45   1/1     Running   0          39s

Applying NodePool Configuration

With Karpenter running, applying the NodePool configuration completed the setup:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["t", "m", "c", "r"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["medium", "large", "xlarge"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]  # Spot first, On-Demand fallback
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
  limits:
    cpu: 100
    memory: 200Gi
kubectl apply -f platform/karpenter/nodepool.yaml

nodepool.karpenter.sh/default created
ec2nodeclass.karpenter.k8s.aws/default created
nodepool.karpenter.sh/system created

The Results

Scale Up Test

Creating a test deployment with 3 replicas requiring Karpenter nodes:

kubectl get nodeclaims
NAME            TYPE        CAPACITY   ZONE         NODE                           READY   AGE
default-g5n5h   t2.medium   spot       us-east-2b   ip-10-0-26-223.compute.local   True    75s

Result: Karpenter provisioned a SPOT t2.medium instance in ~60 seconds. All 3 pods scheduled successfully on the new node.

Scale Down Test

After deleting the test deployment, Karpenter’s consolidation policy (WhenEmptyOrUnderutilized with consolidateAfter: 1m) removed the empty node:

kubectl get nodeclaims
No resources found

Result: The empty node was terminated within 90 seconds of becoming idle.

Final Architecture

ComponentStatus
VPC CNI Prefix DelegationEnabled
Karpenter 1.8.0Running
Default NodePoolSpot-first with On-Demand fallback
System NodePoolOn-Demand only for critical workloads

Cost Savings Summary

OptimizationMonthly Savings
CloudWatch logging disabled~$8
RDS nonprod downsized~$15
Karpenter spot provisioningVariable
Total confirmed~$23+/month

Production Rules

  1. ENI limits determine scheduling capacity. A t3.large caps at ~17 pods. A full Spot node with a hard nodeSelector will block all new pod scheduling with no fallback. Know the limit before designing your affinity strategy.

  2. Karpenter version compatibility is strict. Always check the compatibility matrix before picking a version. K8s 1.34 requires Karpenter >= 1.8. The OCI registry (oci://public.ecr.aws/karpenter/karpenter) is required for 1.x — the old Helm chart repo only has pre-1.0 versions.

  3. Helm charts can have bugs. When Karpenter crashes immediately after installation, inspect the actual FEATURE_GATES env var on the deployment, not just the Helm values. A chart bug setting an empty value causes an immediate panic with a misleading error.

  4. Use soft affinity over hard selectors. preferredDuringSchedulingIgnoredDuringExecution falls back gracefully when preferred nodes are full or unavailable. Hard nodeSelector does not.

  5. Karpenter nodes require manual aws-auth entries. Unlike managed node groups, Karpenter node roles are not automatically added to aws-auth. Add the role with both system:bootstrappers and system:nodes groups, or nodes will launch and never register.

  6. The application/node.eks.aws warning in cloud-init is a red herring. If Karpenter-provisioned nodes aren’t joining the cluster, check aws-auth before spending time on cloud-init logs. The nodeadm process handles that content type — cloud-init is not the problem.

#karpenter#eks#aws#spot-instances#cost-optimization