An ArgoCD application reported OutOfSync/Degraded for two days. Every pod was in CrashLoopBackOff, and pushing fixes to the repository had no effect. The sync operation was stuck in Running phase, blocking all new syncs — including the ones containing the fix. Three problems ran concurrently; untangling them required working through each one.

The Symptoms

The ArgoCD dashboard showed two applications in bad state:

NAME                       SYNC STATUS   HEALTH STATUS
<project-a>-sandbox        OutOfSync     Degraded
<project-a>-production     Synced        Progressing

Checking pods confirmed the damage:

kubectl get pods -n <namespace>

NAME                              READY   STATUS             RESTARTS
celery-beat-9d6cb559c-6r4j2       0/1     CrashLoopBackOff   51
celery-worker-6c7f9b95ff-fd4ql    0/1     CrashLoopBackOff   38
<project-a>-web-6988d5b5c4-nz     0/1     CrashLoopBackOff   35

Three different pods, three different crash reasons.

Problem 1: Web Pod — Missing Environment Variable

The web pod logs showed a clear error:

Starting entrypoint script...
Error: DATABASE environment variable is not set.

Investigation

The DATABASE env var had been moved between commits. Originally defined in the Kustomize overlay patch (patch-env-web-sandbox.yaml), a later commit moved it to the base ConfigMap (configmap-app.yaml) and removed it from the overlay.

The base ConfigMap on the cluster still had the old version — without DATABASE. The new ConfigMap could not be applied because the ArgoCD sync was stuck.

Root Cause Chain

A sync operation started against an older commit
The sync applied deployment changes (new image tag, removed DATABASE from env list)
The sync waited for deployments to become healthy before applying remaining resources
Pods crashed because the ConfigMap (still old) lacked DATABASE
ArgoCD could not complete the sync → could not apply the new ConfigMap → deadlock

Problem 2: Celery Worker — Liveness Probe Failure

The celery-worker pods showed a different crash pattern. Logs indicated a clean startup:

[INFO/MainProcess] Connected to amqp://...@rabbitmq.<namespace>.svc.cluster.local:5672/...
[INFO/MainProcess] celery@celery-worker-7f89fb6cb-pjk4d ready.

worker: Warm shutdown (MainProcess)

The worker started, connected to RabbitMQ, then received a graceful shutdown signal. Kubernetes events revealed the cause:

kubectl get events -n <namespace> --field-selector reason=Unhealthy

Warning  Unhealthy  Liveness probe failed: sh: 1: ps: not found

The Broken Probe

The base deployment defined this liveness probe:

livenessProbe:
  exec:
    command:
      - sh
      - -c
      - "ps aux | grep '[c]elery'"
  initialDelaySeconds: 30
  periodSeconds: 60
  failureThreshold: 3

The Docker image was built on a minimal Python base that does not include procps. The ps command does not exist, so the probe fails every 60 seconds. After 3 consecutive failures (3 minutes), Kubernetes kills the container. The container restarts, runs for 3 minutes, gets killed again. Production had accumulated 288 restarts in 26 hours.

The Fix

Replace ps aux | grep with a /proc filesystem check that works on any Linux container:

livenessProbe:
  exec:
    command:
      - sh
      - -c
      - "grep -r celery /proc/[0-9]*/cmdline 2>/dev/null | grep -q celery"
  initialDelaySeconds: 30
  periodSeconds: 60
  failureThreshold: 3

The /proc/[pid]/cmdline file contains the command line of each running process. Searching these files for “celery” accomplishes the same check as ps aux | grep celery without requiring any additional packages.

Problem 3: The Stuck Sync Operation

This was the most insidious issue. ArgoCD’s sync operation was stuck in Running phase:

kubectl get application <project-a>-sandbox -n argocd \
  -o jsonpath='{.status.operationState.phase}'
# Running

The application used sync waves:

Wave	Resources
0	Certificates, ReferenceGrants
10	ConfigMap, Secrets, Deployments, Services
20	HTTPRoutes

ArgoCD applies each wave sequentially and waits for all resources in a wave to be healthy before proceeding. When deployments in wave 10 crashed, the wave never completed. The sync operation hung indefinitely.

The retry policy had limit: 5 with exponential backoff (5s, 10s, 20s, 40s, capped at 3m). After 5 failed attempts (~8 minutes), ArgoCD stopped retrying. New commits arrived with fixes, but ArgoCD could not start a new sync because the old operation was still Running.

Terminating the Stuck Sync

The first step was clearing the stuck operation:

kubectl patch application <project-a>-sandbox -n argocd \
  --type merge -p '{"operation": null}'

Then forcing a fresh sync from inside the ArgoCD server pod (since the ArgoCD CLI was not installed locally):

kubectl exec -n argocd argocd-server-88f8db87b-vnshh -- sh -c \
  'argocd login localhost:8080 --insecure --plaintext \
    --username admin --password $ARGOCD_PWD && \
   argocd app sync <project-a>-sandbox \
    --server localhost:8080 --insecure --plaintext --force --prune'

The sync completed in 10 seconds. All resources applied, pods started cleanly.

Preventing Recurrence

Two architectural changes prevent this class of failure:

1. Remove Sync Waves

Sync waves create ordering dependencies that can deadlock. For this application, the ordering was unnecessary:

Certificates do not need to exist before deployments start
HTTPRoutes reference Services, but Kubernetes handles missing backends gracefully — the route accepts but returns 503 until backends are ready
ConfigMaps and Deployments in the same wave still deadlock if the ConfigMap update is blocked

Removing all argocd.argoproj.io/sync-wave annotations allows ArgoCD to apply all resources simultaneously:

# Before (base/kustomization.yaml)
commonAnnotations:
  argocd.argoproj.io/sync-wave: "10"

# After — removed entirely

Sync waves remain useful when resources have genuine ordering requirements (e.g., CRDs before CRs, namespaces before resources). For standard application deployments, they add risk without benefit.

2. Set Unlimited Retries

The default retry.limit: 5 causes ArgoCD to give up after a few minutes. Setting the limit to -1 (unlimited) ensures ArgoCD keeps retrying with exponential backoff:

syncPolicy:
  automated:
    prune: true
    selfHeal: true
  retry:
    limit: -1  # unlimited retries
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m

With unlimited retries, when a fix commit lands, ArgoCD picks it up on the next retry cycle (at most 3 minutes later) and applies it automatically. The exponential backoff prevents excessive API calls while the application is broken.

This change was applied across all 19 ArgoCD Application manifests in 4 repositories.

Bonus: Diagnosing an Unrelated Application Bug

While investigating, another pair of degraded applications surfaced:

<project-b>-production   Synced   Degraded
<project-b>-sandbox      Synced   Degraded

Both showed identical Python tracebacks:

File "/usr/src/app/<core-system>/views.py", line 226, in <module>
    @api_view(["GET"])
     ^^^^^^^^
NameError: name 'api_view' is not defined. Did you mean: 'APIView'?

A missing import (from rest_framework.decorators import api_view) in the application code broke all pods built from the latest image. The older pods from the previous ReplicaSet were still running and serving traffic — Kubernetes rolling update strategy preserved availability even with a broken new deployment.

This is an application code fix, not an infrastructure fix. The old pods continue serving until the code is patched.

Also Fixed: ArgoCD Project Namespace Allowlists

Two applications showed Unknown status:

kubectl get application <project-c>-sandbox -n argocd \
  -o jsonpath='{.status.conditions[0].message}'

application destination server 'https://kubernetes.default.svc' and
namespace '<project-c>-sandbox' do not match any of the allowed
destinations in project 'dev'

REVIEW: redacted — confirm; `<project-c>` replaces a namespace/service name that may or may not be client-specific

ArgoCD Projects act as a policy layer controlling which namespaces an application can deploy to. The <project-c>-sandbox and <project-c>-production namespaces were missing from the dev and prod project destination lists.

# platform/argocd/projects/dev.yaml
destinations:
  # ... existing namespaces ...
  - namespace: <project-c>-sandbox
    server: https://kubernetes.default.svc

After adding the namespaces and applying the project updates, both applications synced to Healthy immediately.

Results

Metric	Before	After
Healthy apps	10/16	14/16
`<project-a>` pod restarts	288 (26h)	0
Stuck sync operations	1	0
Apps with unlimited retries	0	19
Apps with sync wave deadlock risk	2	0

The two remaining Degraded apps (<project-b>-*) require an application code fix.

Production Rules

Sync waves are a liability for standard deployments. They introduce ordering dependencies that create deadlocks when any resource in a wave fails. Reserve sync waves for genuine ordering requirements (CRDs before CRs).
Set retry.limit: -1. The default limit of 5 causes ArgoCD to abandon broken applications after a few minutes. Unlimited retries with exponential backoff ensures fix commits are picked up automatically without manual intervention.
Liveness probes must use tools present in the container image. Minimal images may lack ps, curl, or wget. The /proc filesystem is always available on Linux and provides process information without additional packages.
A stuck sync operation blocks all future syncs. Clear it with kubectl patch application <name> -n argocd --type merge -p '{"operation": null}' before forcing a fresh sync.
ArgoCD Project destination lists are a common blind spot. When adding a new application, verify the target namespace is in the project’s allowed destinations. The InvalidSpecError appears in .status.conditions but is easy to miss in the UI.
Compare working and broken environments. Production was partially working while sandbox was completely broken. Comparing the two revealed which issues were image-specific (liveness probe) versus sync-specific (stuck operation).