Breaking the cert-manager and Gateway API Bootstrap Deadlock

After restoring a crashed Traefik ingress controller, every HTTPS endpoint across 20+ namespaces was returning a 404 with a self-signed certificate — not the expected Let’s Encrypt certificates. Even routes that had been serving production traffic for weeks were broken.

I traced this to a bootstrap deadlock between cert-manager and the Kubernetes Gateway API: cert-manager needed Traefik to route HTTP traffic to complete ACME certificate challenges, but Traefik needed the resulting TLS secrets to exist before it would program the Gateway — and therefore route any traffic at all.

Environment

Component	Detail
Ingress controller	Traefik v3.6.6 with Gateway API provider
Certificate manager	cert-manager with Let’s Encrypt (HTTP-01 challenge)
Routing model	`Gateway`, `HTTPRoute`, `GatewayClass`
TLS reference model	Cross-namespace via `ReferenceGrant`

Step 1 — Observing the Symptom

With Traefik running, a curl sweep of all HTTPS endpoints showed:

curl -sk -w "%{http_code}" https://app.example.com
# 404

Every endpoint returned 404. More telling was the TLS certificate being presented:

echo | openssl s_client -connect app.example.com:443 -servername app.example.com 2>/dev/null \
  | openssl x509 -noout -issuer -enddate

issuer=CN = TRAEFIK DEFAULT CERT
notAfter=Mar 23 08:13:18 2027 GMT

Traefik’s default self-signed certificate was being served — not any Let’s Encrypt certificate. The notAfter timestamp matched exactly when Traefik had restarted minutes earlier. Traefik had just started fresh and had not loaded any of the real TLS certificates from Kubernetes secrets.

Step 2 — Checking the Gateway Status

In the Kubernetes Gateway API model, a Gateway resource defines named listeners — each listener binds to a port, hostname, and optionally a TLS secret. If any listener fails to resolve its configuration, the Gateway reports Programmed: False.

kubectl describe gateway traefik-gateway -n traefik | tail -40

The output revealed the problem immediately:

Conditions:
  Type:     Programmed
  Status:   False
  Reason:   Invalid
  Message:  Error while retrieving certificate: getting secret: secret "app-prod-tls-secret" not found

  Type:     Programmed
  Status:   False
  Reason:   Invalid
  Message:  Error while retrieving certificate: getting secret: secret "app-sandbox-tls-secret" not found

Two listeners were in a Programmed: False state because their referenced TLS secrets did not exist in the cluster.

The specific check for listener status:

kubectl get gateway traefik-gateway -n traefik \
  -o jsonpath='{range .status.listeners[*]}{.name}{"\t"}{.conditions[?(@.type=="Programmed")].status}{"\n"}{end}'

web                    True
websecure-keycloak     True
websecure-directus     True
...
websecure-app-prod     False
websecure-app-sandbox  False

19 out of 23 listeners were True. Four were False.

Step 3 — The Critical Gateway API Behaviour

This is where the Gateway API diverges significantly from the legacy Ingress model.

With Ingress resources, each Ingress object is independent. If one references a missing TLS secret, only that specific Ingress is affected — all others continue working.

With Gateway API, a single invalid listener blocks the entire Gateway from being programmed.

When even one listener fails ResolvedRefs (because its TLS secret doesn’t exist), Traefik marks the Gateway’s overall Programmed condition as False and stops routing all traffic across all listeners — including the 19 listeners whose configuration was perfectly valid.

This is the architectural reason every HTTPS route was broken despite only two TLS secrets being missing.

Step 4 — Tracing the Missing Secrets to cert-manager

The missing secrets were expected to be created by cert-manager after completing ACME HTTP-01 certificate challenges. Checking cert-manager’s state:

kubectl get certificate -A | grep -v True

app-production   app-prod-tls     False   app-prod-tls-secret     17h
app-sandbox      app-sandbox-tls  False   app-sandbox-tls-secret  17h

Both certificates had been Ready: False for 17 hours. Checking the underlying ACME challenges:

kubectl get challenges -A

NAMESPACE        NAME                            STATE     DOMAIN           AGE
app-production   app-prod-tls-1-...-challenge    pending   app.example.com  17h
app-sandbox      app-sandbox-tls-1-...-challenge pending   app.example.com  17h

Both challenges had been stuck in pending for 17 hours. The challenge detail revealed why:

kubectl describe challenge -n app-production <challenge-name>

Status:
  Presented:   true
  Processing:  true
  Reason:      Waiting for HTTP-01 challenge propagation: did not get expected response
               when querying endpoint, expected "<token>.<thumbprint>" but got:
  State:       pending

The ACME solver was presenting the challenge token (a temporary HTTP route served by a short-lived pod), but Let’s Encrypt’s validation request was receiving an empty response.

Step 5 — Identifying the Deadlock

The full causal chain:

cert-manager ACME HTTP-01 challenge
  → requires Traefik to route HTTP traffic to the solver pod
  → requires the Gateway to be Programmed
  → requires all listener TLS secrets to exist
  → requires cert-manager to have completed the ACME challenge
  → [DEADLOCK]

The two new listeners had been added to the Gateway referencing TLS secrets that cert-manager was supposed to create. But cert-manager couldn’t create those secrets because the ACME challenge couldn’t be validated. And the challenge couldn’t be validated because the Gateway wasn’t routing. And the Gateway wasn’t routing because the secrets didn’t exist.

Neither system could make progress without the other acting first.

Step 6 — Verifying the ReferenceGrant Setup

In the Gateway API model, secrets referenced by a Gateway listener must either be in the same namespace as the Gateway or be explicitly permitted via a ReferenceGrant. I verified this first to rule out a permissions issue:

kubectl get referencegrant -A | grep app

app-production   allow-traefik-app-prod-tls    17h
app-sandbox      allow-traefik-app-sandbox-tls 17h

Both ReferenceGrant resources existed and were correctly scoped. The secrets simply did not exist yet — there was no permissions problem, only a timing one.

Step 7 — The Fix: Injecting Temporary TLS Secrets

The solution was to inject self-signed TLS secrets under the exact names the Gateway listeners expected. This would:

Allow all Gateway listeners to resolve their certificateRef references
Allow the Gateway to become Programmed: True
Allow Traefik to begin routing all traffic — including the ACME solver HTTP routes
Allow cert-manager’s HTTP-01 challenge validation to succeed
Allow cert-manager to create the real Let’s Encrypt secrets, which would automatically replace the temporary ones

I generated a temporary self-signed certificate and applied it to all four missing secret locations:

# Generate a temporary self-signed certificate
openssl req -x509 -nodes -newkey rsa:2048 -days 1 \
  -keyout /tmp/tls.key -out /tmp/tls.crt \
  -subj "/CN=temp-bootstrap" 2>/dev/null

# Create the missing TLS secrets in the correct namespaces
for ns_secret in \
  "app-production/app-prod-tls-secret" \
  "app-sandbox/app-sandbox-tls-secret" \
  "service-production/service-prod-tls-secret" \
  "service-sandbox/service-sandbox-tls-secret"; do
  ns="${ns_secret%%/*}"
  secret="${ns_secret##*/}"
  kubectl create secret tls "$secret" -n "$ns" \
    --cert=/tmp/tls.crt --key=/tmp/tls.key \
    --dry-run=client -o yaml | kubectl apply -f -
done

Step 8 — Forcing Gateway Reconciliation

After creating the secrets, I forced an immediate reconcile by annotating the Gateway object. Traefik’s controller watches referenced secrets for changes, but an annotation update triggers reconciliation faster:

kubectl annotate gateway traefik-gateway -n traefik \
  "kubectl.kubernetes.io/last-applied-restart=$(date -u +%s)" \
  --overwrite

After a few seconds, all listeners were programmed:

kubectl get gateway traefik-gateway -n traefik \
  -o jsonpath='{range .status.listeners[*]}{.name}{"\t"}{.conditions[?(@.type=="Programmed")].status}{"\n"}{end}'

web                    True
websecure-keycloak     True
websecure-directus     True
...
websecure-app-prod     True
websecure-app-sandbox  True

All 23 listeners Programmed: True.

Step 9 — Verifying Route Recovery

I confirmed recovery with a curl sweep across all HTTPS endpoints:

for host in app.example.com api.example.com identity.example.com; do
  code=$(curl -o /dev/null -sk -w "%{http_code}" --max-time 8 "https://${host}")
  echo "$code  $host"
done

200  app.example.com
302  api.example.com
302  identity.example.com

Routes that had been broken for 17+ hours were responding correctly.

Why This Deadlock Is Gateway API-Specific

The legacy Kubernetes Ingress model would not have produced this deadlock because:

Each Ingress object is independently evaluated
A missing tls.secretName only affects that specific Ingress
All other Ingress objects continue routing regardless

The Gateway API model consolidates all listener configuration into a single Gateway resource. The benefit is a unified, cross-namespace control plane. The trade-off is that a single misconfigured listener can take down all listeners on the same Gateway.

This behaviour is by design — the Gateway API spec defines Programmed: False as the correct response when any listener cannot be fully resolved. But it creates a bootstrapping vulnerability when new listeners reference secrets that are expected to be created by an automated process that itself depends on routing being available.

Prevention Strategies

Strategy 1: Pre-create placeholder secrets before adding Gateway listeners

Before adding a new TLS listener to a Gateway, create a placeholder secret (even self-signed) under the expected name. The listener resolves immediately, keeping the Gateway programmed while cert-manager issues the real certificate.

# Before deploying a new service with TLS
openssl req -x509 -nodes -newkey rsa:2048 -days 1 \
  -keyout /tmp/tls.key -out /tmp/tls.crt \
  -subj "/CN=placeholder" 2>/dev/null

kubectl create secret tls new-service-tls-secret \
  -n new-service-namespace \
  --cert=/tmp/tls.crt \
  --key=/tmp/tls.key

Strategy 2: Use DNS-01 challenge instead of HTTP-01

DNS-01 ACME challenges do not depend on HTTP routing. They validate domain ownership by creating a DNS TXT record, which cert-manager handles via a DNS provider API. This completely avoids the routing dependency.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
    - dns01:
        route53:
          region: us-east-1
          hostedZoneID: <hosted-zone-id>

Strategy 3: Decouple new listener rollout from Gateway updates

When onboarding a new service, use a separate Gateway or use the sectionName rollout approach — add the new listener only after the TLS secret exists and is verified.

Commands Reference

# Check Gateway listener programmed status
kubectl get gateway <name> -n <namespace> \
  -o jsonpath='{range .status.listeners[*]}{.name}{"\t"}{.conditions[?(@.type=="Programmed")].status}{"\n"}{end}'

# Get detailed Gateway listener error messages
kubectl describe gateway <name> -n <namespace>

# Check cert-manager certificate status
kubectl get certificate -A

# Check ACME challenge status
kubectl get challenges -A

# Check ACME challenge detail
kubectl describe challenge -n <namespace> <challenge-name>

# List ReferenceGrant resources
kubectl get referencegrant -A

# Generate a temporary self-signed TLS secret
openssl req -x509 -nodes -newkey rsa:2048 -days 1 \
  -keyout /tmp/tls.key -out /tmp/tls.crt \
  -subj "/CN=temp-bootstrap" 2>/dev/null

kubectl create secret tls <secret-name> -n <namespace> \
  --cert=/tmp/tls.crt --key=/tmp/tls.key