ArgoCD Sync Wave Deadlock: How a Broken Deployment Blocked Its Own Fix
Traced two days of ArgoCD OutOfSync/Degraded state to three concurrent root causes: a sync wave deadlock from a missing ConfigMap environment variable, a liveness probe depending on tools absent from the container image, and a stuck sync operation that blocked all subsequent fixes.
ON THIS PAGE
An ArgoCD application reported OutOfSync/Degraded for two days. Every pod was in CrashLoopBackOff, and pushing fixes to the repository had no effect. The sync operation was stuck in Running phase, blocking all new syncs — including the ones containing the fix. Three problems ran concurrently; untangling them required working through each one.
The Symptoms
The ArgoCD dashboard showed two applications in bad state:
NAME SYNC STATUS HEALTH STATUS
<project-a>-sandbox OutOfSync Degraded
<project-a>-production Synced Progressing
Checking pods confirmed the damage:
kubectl get pods -n <namespace>
NAME READY STATUS RESTARTS
celery-beat-9d6cb559c-6r4j2 0/1 CrashLoopBackOff 51
celery-worker-6c7f9b95ff-fd4ql 0/1 CrashLoopBackOff 38
<project-a>-web-6988d5b5c4-nz 0/1 CrashLoopBackOff 35
Three different pods, three different crash reasons.
Problem 1: Web Pod — Missing Environment Variable
The web pod logs showed a clear error:
Starting entrypoint script...
Error: DATABASE environment variable is not set.
Investigation
The DATABASE env var had been moved between commits. Originally defined in the Kustomize overlay patch (patch-env-web-sandbox.yaml), a later commit moved it to the base ConfigMap (configmap-app.yaml) and removed it from the overlay.
The base ConfigMap on the cluster still had the old version — without DATABASE. The new ConfigMap could not be applied because the ArgoCD sync was stuck.
Root Cause Chain
- A sync operation started against an older commit
- The sync applied deployment changes (new image tag, removed
DATABASEfrom env list) - The sync waited for deployments to become healthy before applying remaining resources
- Pods crashed because the ConfigMap (still old) lacked
DATABASE - ArgoCD could not complete the sync → could not apply the new ConfigMap → deadlock
Problem 2: Celery Worker — Liveness Probe Failure
The celery-worker pods showed a different crash pattern. Logs indicated a clean startup:
[INFO/MainProcess] Connected to amqp://...@rabbitmq.<namespace>.svc.cluster.local:5672/...
[INFO/MainProcess] celery@celery-worker-7f89fb6cb-pjk4d ready.
worker: Warm shutdown (MainProcess)
The worker started, connected to RabbitMQ, then received a graceful shutdown signal. Kubernetes events revealed the cause:
kubectl get events -n <namespace> --field-selector reason=Unhealthy
Warning Unhealthy Liveness probe failed: sh: 1: ps: not found
The Broken Probe
The base deployment defined this liveness probe:
livenessProbe:
exec:
command:
- sh
- -c
- "ps aux | grep '[c]elery'"
initialDelaySeconds: 30
periodSeconds: 60
failureThreshold: 3
The Docker image was built on a minimal Python base that does not include procps. The ps command does not exist, so the probe fails every 60 seconds. After 3 consecutive failures (3 minutes), Kubernetes kills the container. The container restarts, runs for 3 minutes, gets killed again. Production had accumulated 288 restarts in 26 hours.
The Fix
Replace ps aux | grep with a /proc filesystem check that works on any Linux container:
livenessProbe:
exec:
command:
- sh
- -c
- "grep -r celery /proc/[0-9]*/cmdline 2>/dev/null | grep -q celery"
initialDelaySeconds: 30
periodSeconds: 60
failureThreshold: 3
The /proc/[pid]/cmdline file contains the command line of each running process. Searching these files for “celery” accomplishes the same check as ps aux | grep celery without requiring any additional packages.
Problem 3: The Stuck Sync Operation
This was the most insidious issue. ArgoCD’s sync operation was stuck in Running phase:
kubectl get application <project-a>-sandbox -n argocd \
-o jsonpath='{.status.operationState.phase}'
# Running
The application used sync waves:
| Wave | Resources |
|---|---|
| 0 | Certificates, ReferenceGrants |
| 10 | ConfigMap, Secrets, Deployments, Services |
| 20 | HTTPRoutes |
ArgoCD applies each wave sequentially and waits for all resources in a wave to be healthy before proceeding. When deployments in wave 10 crashed, the wave never completed. The sync operation hung indefinitely.
The retry policy had limit: 5 with exponential backoff (5s, 10s, 20s, 40s, capped at 3m). After 5 failed attempts (~8 minutes), ArgoCD stopped retrying. New commits arrived with fixes, but ArgoCD could not start a new sync because the old operation was still Running.
Terminating the Stuck Sync
The first step was clearing the stuck operation:
kubectl patch application <project-a>-sandbox -n argocd \
--type merge -p '{"operation": null}'
Then forcing a fresh sync from inside the ArgoCD server pod (since the ArgoCD CLI was not installed locally):
kubectl exec -n argocd argocd-server-88f8db87b-vnshh -- sh -c \
'argocd login localhost:8080 --insecure --plaintext \
--username admin --password $ARGOCD_PWD && \
argocd app sync <project-a>-sandbox \
--server localhost:8080 --insecure --plaintext --force --prune'
The sync completed in 10 seconds. All resources applied, pods started cleanly.
Preventing Recurrence
Two architectural changes prevent this class of failure:
1. Remove Sync Waves
Sync waves create ordering dependencies that can deadlock. For this application, the ordering was unnecessary:
- Certificates do not need to exist before deployments start
- HTTPRoutes reference Services, but Kubernetes handles missing backends gracefully — the route accepts but returns 503 until backends are ready
- ConfigMaps and Deployments in the same wave still deadlock if the ConfigMap update is blocked
Removing all argocd.argoproj.io/sync-wave annotations allows ArgoCD to apply all resources simultaneously:
# Before (base/kustomization.yaml)
commonAnnotations:
argocd.argoproj.io/sync-wave: "10"
# After — removed entirely
Sync waves remain useful when resources have genuine ordering requirements (e.g., CRDs before CRs, namespaces before resources). For standard application deployments, they add risk without benefit.
2. Set Unlimited Retries
The default retry.limit: 5 causes ArgoCD to give up after a few minutes. Setting the limit to -1 (unlimited) ensures ArgoCD keeps retrying with exponential backoff:
syncPolicy:
automated:
prune: true
selfHeal: true
retry:
limit: -1 # unlimited retries
backoff:
duration: 5s
factor: 2
maxDuration: 3m
With unlimited retries, when a fix commit lands, ArgoCD picks it up on the next retry cycle (at most 3 minutes later) and applies it automatically. The exponential backoff prevents excessive API calls while the application is broken.
This change was applied across all 19 ArgoCD Application manifests in 4 repositories.
Bonus: Diagnosing an Unrelated Application Bug
While investigating, another pair of degraded applications surfaced:
<project-b>-production Synced Degraded
<project-b>-sandbox Synced Degraded
Both showed identical Python tracebacks:
File "/usr/src/app/<core-system>/views.py", line 226, in <module>
@api_view(["GET"])
^^^^^^^^
NameError: name 'api_view' is not defined. Did you mean: 'APIView'?
A missing import (from rest_framework.decorators import api_view) in the application code broke all pods built from the latest image. The older pods from the previous ReplicaSet were still running and serving traffic — Kubernetes rolling update strategy preserved availability even with a broken new deployment.
This is an application code fix, not an infrastructure fix. The old pods continue serving until the code is patched.
Also Fixed: ArgoCD Project Namespace Allowlists
Two applications showed Unknown status:
kubectl get application <project-c>-sandbox -n argocd \
-o jsonpath='{.status.conditions[0].message}'
application destination server 'https://kubernetes.default.svc' and
namespace '<project-c>-sandbox' do not match any of the allowed
destinations in project 'dev'
REVIEW: redacted — confirm; <project-c> replaces a namespace/service name that may or may not be client-specific
ArgoCD Projects act as a policy layer controlling which namespaces an application can deploy to. The <project-c>-sandbox and <project-c>-production namespaces were missing from the dev and prod project destination lists.
# platform/argocd/projects/dev.yaml
destinations:
# ... existing namespaces ...
- namespace: <project-c>-sandbox
server: https://kubernetes.default.svc
After adding the namespaces and applying the project updates, both applications synced to Healthy immediately.
Results
| Metric | Before | After |
|---|---|---|
| Healthy apps | 10/16 | 14/16 |
<project-a> pod restarts | 288 (26h) | 0 |
| Stuck sync operations | 1 | 0 |
| Apps with unlimited retries | 0 | 19 |
| Apps with sync wave deadlock risk | 2 | 0 |
The two remaining Degraded apps (<project-b>-*) require an application code fix.
Production Rules
-
Sync waves are a liability for standard deployments. They introduce ordering dependencies that create deadlocks when any resource in a wave fails. Reserve sync waves for genuine ordering requirements (CRDs before CRs).
-
Set
retry.limit: -1. The default limit of 5 causes ArgoCD to abandon broken applications after a few minutes. Unlimited retries with exponential backoff ensures fix commits are picked up automatically without manual intervention. -
Liveness probes must use tools present in the container image. Minimal images may lack
ps,curl, orwget. The/procfilesystem is always available on Linux and provides process information without additional packages. -
A stuck sync operation blocks all future syncs. Clear it with
kubectl patch application <name> -n argocd --type merge -p '{"operation": null}'before forcing a fresh sync. -
ArgoCD Project destination lists are a common blind spot. When adding a new application, verify the target namespace is in the project’s allowed destinations. The
InvalidSpecErrorappears in.status.conditionsbut is easy to miss in the UI. -
Compare working and broken environments. Production was partially working while sandbox was completely broken. Comparing the two revealed which issues were image-specific (liveness probe) versus sync-specific (stuck operation).
Discussion