Breaking the cert-manager and Gateway API Bootstrap Deadlock
Traced a 17-hour HTTPS outage across 20+ namespaces to a bootstrap deadlock between cert-manager's ACME HTTP-01 solver and the Gateway API's all-or-nothing listener programming model, resolved by injecting temporary placeholder TLS secrets.
ON THIS PAGE
After restoring a crashed Traefik ingress controller, every HTTPS endpoint across 20+ namespaces was returning a 404 with a self-signed certificate — not the expected Let’s Encrypt certificates. Even routes that had been serving production traffic for weeks were broken.
I traced this to a bootstrap deadlock between cert-manager and the Kubernetes Gateway API: cert-manager needed Traefik to route HTTP traffic to complete ACME certificate challenges, but Traefik needed the resulting TLS secrets to exist before it would program the Gateway — and therefore route any traffic at all.
Environment
| Component | Detail |
|---|---|
| Ingress controller | Traefik v3.6.6 with Gateway API provider |
| Certificate manager | cert-manager with Let’s Encrypt (HTTP-01 challenge) |
| Routing model | Gateway, HTTPRoute, GatewayClass |
| TLS reference model | Cross-namespace via ReferenceGrant |
Step 1 — Observing the Symptom
With Traefik running, a curl sweep of all HTTPS endpoints showed:
curl -sk -w "%{http_code}" https://app.example.com
# 404
Every endpoint returned 404. More telling was the TLS certificate being presented:
echo | openssl s_client -connect app.example.com:443 -servername app.example.com 2>/dev/null \
| openssl x509 -noout -issuer -enddate
issuer=CN = TRAEFIK DEFAULT CERT
notAfter=Mar 23 08:13:18 2027 GMT
Traefik’s default self-signed certificate was being served — not any Let’s Encrypt certificate. The notAfter timestamp matched exactly when Traefik had restarted minutes earlier. Traefik had just started fresh and had not loaded any of the real TLS certificates from Kubernetes secrets.
Step 2 — Checking the Gateway Status
In the Kubernetes Gateway API model, a Gateway resource defines named listeners — each listener binds to a port, hostname, and optionally a TLS secret. If any listener fails to resolve its configuration, the Gateway reports Programmed: False.
kubectl describe gateway traefik-gateway -n traefik | tail -40
The output revealed the problem immediately:
Conditions:
Type: Programmed
Status: False
Reason: Invalid
Message: Error while retrieving certificate: getting secret: secret "app-prod-tls-secret" not found
Type: Programmed
Status: False
Reason: Invalid
Message: Error while retrieving certificate: getting secret: secret "app-sandbox-tls-secret" not found
Two listeners were in a Programmed: False state because their referenced TLS secrets did not exist in the cluster.
The specific check for listener status:
kubectl get gateway traefik-gateway -n traefik \
-o jsonpath='{range .status.listeners[*]}{.name}{"\t"}{.conditions[?(@.type=="Programmed")].status}{"\n"}{end}'
web True
websecure-keycloak True
websecure-directus True
...
websecure-app-prod False
websecure-app-sandbox False
19 out of 23 listeners were True. Four were False.
Step 3 — The Critical Gateway API Behaviour
This is where the Gateway API diverges significantly from the legacy Ingress model.
With Ingress resources, each Ingress object is independent. If one references a missing TLS secret, only that specific Ingress is affected — all others continue working.
With Gateway API, a single invalid listener blocks the entire Gateway from being programmed.
When even one listener fails ResolvedRefs (because its TLS secret doesn’t exist), Traefik marks the Gateway’s overall Programmed condition as False and stops routing all traffic across all listeners — including the 19 listeners whose configuration was perfectly valid.
This is the architectural reason every HTTPS route was broken despite only two TLS secrets being missing.
Step 4 — Tracing the Missing Secrets to cert-manager
The missing secrets were expected to be created by cert-manager after completing ACME HTTP-01 certificate challenges. Checking cert-manager’s state:
kubectl get certificate -A | grep -v True
app-production app-prod-tls False app-prod-tls-secret 17h
app-sandbox app-sandbox-tls False app-sandbox-tls-secret 17h
Both certificates had been Ready: False for 17 hours. Checking the underlying ACME challenges:
kubectl get challenges -A
NAMESPACE NAME STATE DOMAIN AGE
app-production app-prod-tls-1-...-challenge pending app.example.com 17h
app-sandbox app-sandbox-tls-1-...-challenge pending app.example.com 17h
Both challenges had been stuck in pending for 17 hours. The challenge detail revealed why:
kubectl describe challenge -n app-production <challenge-name>
Status:
Presented: true
Processing: true
Reason: Waiting for HTTP-01 challenge propagation: did not get expected response
when querying endpoint, expected "<token>.<thumbprint>" but got:
State: pending
The ACME solver was presenting the challenge token (a temporary HTTP route served by a short-lived pod), but Let’s Encrypt’s validation request was receiving an empty response.
Step 5 — Identifying the Deadlock
The full causal chain:
cert-manager ACME HTTP-01 challenge
→ requires Traefik to route HTTP traffic to the solver pod
→ requires the Gateway to be Programmed
→ requires all listener TLS secrets to exist
→ requires cert-manager to have completed the ACME challenge
→ [DEADLOCK]
The two new listeners had been added to the Gateway referencing TLS secrets that cert-manager was supposed to create. But cert-manager couldn’t create those secrets because the ACME challenge couldn’t be validated. And the challenge couldn’t be validated because the Gateway wasn’t routing. And the Gateway wasn’t routing because the secrets didn’t exist.
Neither system could make progress without the other acting first.
Step 6 — Verifying the ReferenceGrant Setup
In the Gateway API model, secrets referenced by a Gateway listener must either be in the same namespace as the Gateway or be explicitly permitted via a ReferenceGrant. I verified this first to rule out a permissions issue:
kubectl get referencegrant -A | grep app
app-production allow-traefik-app-prod-tls 17h
app-sandbox allow-traefik-app-sandbox-tls 17h
Both ReferenceGrant resources existed and were correctly scoped. The secrets simply did not exist yet — there was no permissions problem, only a timing one.
Step 7 — The Fix: Injecting Temporary TLS Secrets
The solution was to inject self-signed TLS secrets under the exact names the Gateway listeners expected. This would:
- Allow all Gateway listeners to resolve their
certificateRefreferences - Allow the Gateway to become
Programmed: True - Allow Traefik to begin routing all traffic — including the ACME solver HTTP routes
- Allow cert-manager’s HTTP-01 challenge validation to succeed
- Allow cert-manager to create the real Let’s Encrypt secrets, which would automatically replace the temporary ones
I generated a temporary self-signed certificate and applied it to all four missing secret locations:
# Generate a temporary self-signed certificate
openssl req -x509 -nodes -newkey rsa:2048 -days 1 \
-keyout /tmp/tls.key -out /tmp/tls.crt \
-subj "/CN=temp-bootstrap" 2>/dev/null
# Create the missing TLS secrets in the correct namespaces
for ns_secret in \
"app-production/app-prod-tls-secret" \
"app-sandbox/app-sandbox-tls-secret" \
"service-production/service-prod-tls-secret" \
"service-sandbox/service-sandbox-tls-secret"; do
ns="${ns_secret%%/*}"
secret="${ns_secret##*/}"
kubectl create secret tls "$secret" -n "$ns" \
--cert=/tmp/tls.crt --key=/tmp/tls.key \
--dry-run=client -o yaml | kubectl apply -f -
done
Step 8 — Forcing Gateway Reconciliation
After creating the secrets, I forced an immediate reconcile by annotating the Gateway object. Traefik’s controller watches referenced secrets for changes, but an annotation update triggers reconciliation faster:
kubectl annotate gateway traefik-gateway -n traefik \
"kubectl.kubernetes.io/last-applied-restart=$(date -u +%s)" \
--overwrite
After a few seconds, all listeners were programmed:
kubectl get gateway traefik-gateway -n traefik \
-o jsonpath='{range .status.listeners[*]}{.name}{"\t"}{.conditions[?(@.type=="Programmed")].status}{"\n"}{end}'
web True
websecure-keycloak True
websecure-directus True
...
websecure-app-prod True
websecure-app-sandbox True
All 23 listeners Programmed: True.
Step 9 — Verifying Route Recovery
I confirmed recovery with a curl sweep across all HTTPS endpoints:
for host in app.example.com api.example.com identity.example.com; do
code=$(curl -o /dev/null -sk -w "%{http_code}" --max-time 8 "https://${host}")
echo "$code $host"
done
200 app.example.com
302 api.example.com
302 identity.example.com
Routes that had been broken for 17+ hours were responding correctly.
Why This Deadlock Is Gateway API-Specific
The legacy Kubernetes Ingress model would not have produced this deadlock because:
- Each
Ingressobject is independently evaluated - A missing
tls.secretNameonly affects that specificIngress - All other
Ingressobjects continue routing regardless
The Gateway API model consolidates all listener configuration into a single Gateway resource. The benefit is a unified, cross-namespace control plane. The trade-off is that a single misconfigured listener can take down all listeners on the same Gateway.
This behaviour is by design — the Gateway API spec defines Programmed: False as the correct response when any listener cannot be fully resolved. But it creates a bootstrapping vulnerability when new listeners reference secrets that are expected to be created by an automated process that itself depends on routing being available.
Prevention Strategies
Strategy 1: Pre-create placeholder secrets before adding Gateway listeners
Before adding a new TLS listener to a Gateway, create a placeholder secret (even self-signed) under the expected name. The listener resolves immediately, keeping the Gateway programmed while cert-manager issues the real certificate.
# Before deploying a new service with TLS
openssl req -x509 -nodes -newkey rsa:2048 -days 1 \
-keyout /tmp/tls.key -out /tmp/tls.crt \
-subj "/CN=placeholder" 2>/dev/null
kubectl create secret tls new-service-tls-secret \
-n new-service-namespace \
--cert=/tmp/tls.crt \
--key=/tmp/tls.key
Strategy 2: Use DNS-01 challenge instead of HTTP-01
DNS-01 ACME challenges do not depend on HTTP routing. They validate domain ownership by creating a DNS TXT record, which cert-manager handles via a DNS provider API. This completely avoids the routing dependency.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- dns01:
route53:
region: us-east-1
hostedZoneID: <hosted-zone-id>
Strategy 3: Decouple new listener rollout from Gateway updates
When onboarding a new service, use a separate Gateway or use the sectionName rollout approach — add the new listener only after the TLS secret exists and is verified.
Commands Reference
# Check Gateway listener programmed status
kubectl get gateway <name> -n <namespace> \
-o jsonpath='{range .status.listeners[*]}{.name}{"\t"}{.conditions[?(@.type=="Programmed")].status}{"\n"}{end}'
# Get detailed Gateway listener error messages
kubectl describe gateway <name> -n <namespace>
# Check cert-manager certificate status
kubectl get certificate -A
# Check ACME challenge status
kubectl get challenges -A
# Check ACME challenge detail
kubectl describe challenge -n <namespace> <challenge-name>
# List ReferenceGrant resources
kubectl get referencegrant -A
# Generate a temporary self-signed TLS secret
openssl req -x509 -nodes -newkey rsa:2048 -days 1 \
-keyout /tmp/tls.key -out /tmp/tls.crt \
-subj "/CN=temp-bootstrap" 2>/dev/null
kubectl create secret tls <secret-name> -n <namespace> \
--cert=/tmp/tls.crt --key=/tmp/tls.key Discussion