Zero HPAs, Unbounded Containers, and an OOMKilled ArgoCD Controller on EKS
A platform audit on Amazon EKS revealed missing HPAs across 16 services, unbounded platform components, and an ArgoCD application controller OOMKilling under reconciliation load.
ON THIS PAGE
The platform had been running for months with no horizontal pod autoscaling. Every service — web servers, Celery workers, background processors — ran at a static replica count set at deployment time. Platform components like ArgoCD, cert-manager, and external-secrets had no resource limits. The ArgoCD application controller was configured with a memory limit that would prove catastrophic under load.
A single audit session exposed all three problems simultaneously. This documents the discovery, the fixes, the incident that emerged mid-session, and the technical patterns that now govern how autoscaling is deployed on the platform.
Background
I run the platform on Amazon EKS with GitOps via ArgoCD. Application manifests live in separate Git repositories per service, structured using a Kustomize base/overlay pattern. Platform components (ArgoCD, cert-manager, Karpenter, ExternalSecrets Operator, VictoriaMetrics stack) are managed through a central infrastructure repository.
Karpenter provisions EC2 instances directly via two NodePools: a default pool using Spot instances for non-production workloads, and a production pool using On-Demand instances for revenue-critical services (tainted to ensure isolation).
I had migrated the observability stack — VictoriaMetrics, VictoriaLogs, Vector DaemonSet, and Grafana — earlier in the same session, establishing the monitoring foundation needed to make sense of the metrics that were about to become relevant.
Part 1: The HPA Audit
I started a routine cluster audit with:
kubectl get hpa -A
The output was empty. No resources found. Across 16 production and sandbox applications — four Django web services, four Celery worker processes, two Java services — not a single HorizontalPodAutoscaler existed.
The next question was whether the metrics API was even available:
kubectl top nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
metrics-server was not installed. Without it, autoscaling/v2 HPAs cannot function — the controller cannot read CPU utilization from the metrics.k8s.io API group.
Installing metrics-server
Installing metrics-server on EKS requires one non-obvious flag:
helm upgrade --install metrics-server metrics-server/metrics-server \
-n kube-system \
--set args={"--kubelet-insecure-tls"}
Without --kubelet-insecure-tls, metrics-server fails with TLS certificate errors. EKS kubelets present a certificate whose CN is the node hostname, but metrics-server connects using the node IP — the name doesn’t match and verification fails. The flag disables certificate validation for kubelet connections (acceptable within a cluster where network trust already exists).
After installation:
kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-x-x.us-east-2.compute.internal 87m 4% 1821Mi 48%
Creating HPAs
The platform runs four production Django services, each with a web deployment and a Celery worker deployment. I created HPAs for all eight combinations.
Web service pattern (CPU 70%, scale to 3):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: example-web
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-web
minReplicas: 1
maxReplicas: 3
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
The worker pattern uses an 80% CPU target and scales to 4 — workers are more CPU-intensive and benefit from more headroom before triggering scale-up. The 300-second scale-down stabilization window prevents premature scale-down during task bursts.
Celery beat was deliberately excluded. It is a task scheduler — it must run as exactly one instance. A second beat process would duplicate every scheduled task. It runs as a single-instance deployment with no HPA, and its resources were reduced to match actual usage (CPU request from 250m to 100m — beat spends most of its time sleeping between task intervals).
Part 2: ArgoCD vs HPA
Within minutes of creating HPAs and adding them to Kustomize overlays, an unexpected interaction emerged. ArgoCD was resetting pod replicas back to the value in Git — overriding the HPA’s scaling decisions.
The reason: selfHeal: true. When ArgoCD detects drift between the live cluster state and the desired state in Git, it corrects the drift. An HPA that has scaled a deployment from 1 to 2 replicas creates apparent drift. ArgoCD heals it back to 1. The HPA scales back to 2. ArgoCD heals again. An infinite loop.
The fix is ignoreDifferences in the ArgoCD Application:
apiVersion: argoproj.io/v1alpha1
kind: Application
spec:
# ...
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
This tells ArgoCD to ignore the spec.replicas field when computing drift. The HPA manages replicas; ArgoCD manages everything else. Every ArgoCD Application with an associated HPA needs this configuration.
Part 3: Platform Resource Audit
With application autoscaling addressed, I turned the audit to the platform components themselves. The findings were uniform: zero resource limits on almost everything.
kubectl get pods -n argocd -o json | python3 -c "
import sys, json
pods = json.load(sys.stdin)['items']
for p in pods:
for c in p['spec']['containers']:
r = c.get('resources', {})
print(f\"{p['metadata']['name']}/{c['name']}: limits={r.get('limits', 'NONE')}\")"
ArgoCD Redis: no limits. cert-manager: no limits. external-secrets-operator: no limits. kube-state-metrics: no limits. victoria-metrics-operator: no limits.
A container with no limits can consume all memory on a node, triggering OOM kills of neighboring pods or saturating the node entirely. On a cost-optimized cluster running multiple namespaces on shared nodes, this is particularly dangerous.
I added resource limits to all platform components via their Helm values files:
ArgoCD redis (platform/argocd/values.yaml):
redis:
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
cert-manager (platform/cert-manager/values-cert-manager.yaml):
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
webhook:
resources:
requests:
cpu: 25m
memory: 32Mi
limits:
cpu: 100m
memory: 64Mi
cainjector:
resources:
requests:
cpu: 25m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
The pattern for setting resource limits on Helm-managed platform components: create a dedicated values-<component>.yaml file, add it to the deploy script with --values, and track it in the infrastructure repository alongside the component’s other configuration.
Part 4: The OOM Incident
Midway through applying the platform changes, all ArgoCD application syncs stopped progressing. Every application showed Syncing status indefinitely. I queried sync operations:
kubectl get applications -n argocd -o json | python3 -c "
import sys, json
apps = json.load(sys.stdin)['items']
for a in apps:
op = a.get('status', {}).get('operationState', {})
print(a['metadata']['name'], op.get('phase', 'no-op'))"
Every app returned Running. No app was completing.
My initial hypothesis was a deadlock — perhaps a sync wave ordering issue. Looking at the application controller pod:
kubectl describe pod argocd-application-controller-0 -n argocd
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Restart Count: 11
Eleven OOMKill restarts. The application controller was being killed repeatedly before it could complete any sync operation.
Root cause: The ArgoCD application controller was configured with a 512Mi memory limit. With 16 applications reconciling simultaneously — including newly added apps causing heightened reconciliation activity — the controller exceeded 512Mi and was killed. The controller would restart, begin reconciling again, exceed the limit, get killed, and loop.
The fix was immediate:
# Patch the StatefulSet directly (takes effect on next pod creation)
kubectl patch statefulset argocd-application-controller -n argocd --type json \
-p '[
{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"2Gi"},
{"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/memory","value":"512Mi"}
]'
# Delete the running pod to trigger recreation with new limits
kubectl delete pod argocd-application-controller-0 -n argocd
Within 90 seconds, the new pod was running. Within five minutes, all 16 applications had completed their syncs and returned to Synced/Healthy.
I updated the Helm values to make the change permanent:
# platform/argocd/values.yaml
controller:
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 1
memory: 2Gi
Why 512Mi was insufficient: ArgoCD’s application controller holds in-memory state for every managed application — resource trees, diff state, sync history. With 16 apps containing dozens of Kubernetes resources each, plus active sync operations loading full resource manifests, the working set exceeded 512Mi under load. The ArgoCD documentation recommends at minimum 1Gi for production deployments with many applications; 2Gi provides comfortable headroom.
Results
By the end of the session:
- 8 HPAs active across 4 production services (web + worker per service)
- metrics-server running,
kubectl topfunctional - All platform components have resource limits
- ArgoCD application controller running at 2Gi (was 512Mi)
- All 16 production applications Synced/Healthy
- Celery-beat CPU reduced 60% across 4 services (250m to 100m request)
- Worker baseline replicas halved (2 to 1; HPA scales up as needed)
Production Rules
Audit kubectl get hpa -A before assuming autoscaling is configured. HPAs are not created automatically. Without explicit HPA manifests in the application repository, deployments run at static replica counts indefinitely.
metrics-server is a prerequisite, not optional. HPA controllers cannot function without the metrics.k8s.io API group. On EKS, always install with --kubelet-insecure-tls.
ArgoCD selfHeal and HPA are incompatible without ignoreDifferences. Every ArgoCD Application that manages a deployment with an HPA must include ignoreDifferences for apps/Deployment/spec/replicas. Without it, ArgoCD and the HPA fight in a continuous loop.
Platform components need resource limits too. It is easy to focus resource governance on application workloads while leaving platform components unbounded. ArgoCD, cert-manager, external-secrets, and observability components all need explicit limits proportional to their workload.
ArgoCD controller memory scales with application count. The default Helm chart limits are insufficient for clusters with many applications. At 16+ apps, 512Mi is not enough — the controller OOMKills under reconciliation load. Configure at least 1–2Gi for production deployments, and check restart counts with kubectl describe pod argocd-application-controller-0 -n argocd whenever sync operations stall.
Discussion