Blog Field Notes Network Policy Blocked DNS and the Pods Couldn't Tell You Why
Debug #kubernetes#networkpolicy#dns#aks#coredns

Network Policy Blocked DNS and the Pods Couldn't Tell You Why

External API calls from the backend started failing with EAI_AGAIN after adding Network Policies. The fix required supporting both CoreDNS and Azure DNS in the same egress rule.

· Gideon Warui
ON THIS PAGE

Two days after deploying Network Policies to the , the backend’s health check endpoint started returning errors:

{"status":"unhealthy","service":"safaricom-api","error":"getaddrinfo EAI_AGAIN sandbox.safaricom.co.ke"}

The pods were running. The service was reachable. Only outbound DNS resolution to external hosts was broken.


Environment

ComponentDetail
ClusterAKS (<cluster>), West Europe
BackendNode.js, port 3000
DNS policy on podsdnsPolicy: Default (Azure DNS, 168.63.129.16)
CoreDNS namespacekube-system

What the Network Policy said

The backend’s egress rule for DNS allowed UDP 53 to the kube-system namespace only:

egress:
  - to:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: kube-system
    ports:
      - protocol: UDP
        port: 53

This works when pods use dnsPolicy: ClusterFirst — which routes DNS through CoreDNS, running in kube-system. But the backend pods had dnsPolicy: Default, which bypasses CoreDNS entirely and sends DNS queries directly to the node’s DNS server — Azure’s virtual DNS at 168.63.129.16, outside the cluster.

The Network Policy allowed DNS egress to kube-system, but 168.63.129.16 is not in kube-system. Those queries were silently dropped. EAI_AGAIN is what Node.js surfaces when DNS resolution times out.


Why Default instead of ClusterFirst

dnsPolicy: Default was set intentionally in an earlier session to resolve a different DNS issue — Node.js was caching stale DNS responses from CoreDNS and failing to pick up updated service endpoints. At the time it fixed the problem. The downstream effect on the Network Policy wasn’t noticed until a day later.


The fix

Two options:

Option A: revert to ClusterFirst DNS

Change pods back to dnsPolicy: ClusterFirst and add ndots: 2 to stop Node.js from treating every hostname as a search-domain candidate:

dnsConfig:
  options:
    - name: ndots
      value: "2"

Option B: keep Default DNS and extend the egress rules

Keep dnsPolicy: Default but add a second egress rule allowing DNS to the Azure resolver:

egress:
  # CoreDNS (dnsPolicy: ClusterFirst)
  - to:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: kube-system
    ports:
      - protocol: UDP
        port: 53
  # Azure DNS (dnsPolicy: Default)
  - to:
      - ipBlock:
          cidr: 168.63.129.16/32
    ports:
      - protocol: UDP
        port: 53

I went with Option B to avoid touching the pod spec. The 168.63.129.16/32 CIDR is an Azure-reserved virtual IP — it’s the metadata/DNS endpoint for every Azure VM and is stable across all regions.

Applied the updated Network Policy:

kubectl apply -f k8s/network-policy-backend.yaml --context <cluster>

Health check immediately returned:

{"status":"healthy","service":"safaricom-api"}

The same rule in the wrong namespace

While fixing the backend, I also noticed the frontend Network Policy had the same issue — and a second problem. The frontend’s egress was restricted to kube-system for DNS and the backend pod for application traffic. But there was no ingress rule allowing the NGINX controller namespace to reach the frontend.

The NGINX Ingress controller runs in the ingress-nginx namespace. The frontend’s ingress rule was:

ingress:
  - from:
      - namespaceSelector:
          matchLabels:
            name: ingress-nginx

The label name: ingress-nginx wasn’t actually on the namespace — the correct label key in newer Kubernetes is kubernetes.io/metadata.name. The traffic was getting through anyway because there was a broad fallback rule, but the intent wasn’t being enforced.

Fixed the label selector and tightened the egress DNS rule to use 168.63.129.16/32 instead of the open 0.0.0.0/0 that had been used as a temporary workaround.


What to watch for

When you write a Network Policy DNS egress rule, the destination depends entirely on dnsPolicy:

dnsPolicyDNS goes toNetwork Policy needs
ClusterFirst (default)CoreDNS in kube-systemnamespaceSelector: kube-system
DefaultNode resolver (Azure: 168.63.129.16)ipBlock: 168.63.129.16/32
NoneWhatever dnsConfig.nameservers saysMatch accordingly

If you change dnsPolicy on a pod after the Network Policy is already in place, DNS will silently break. EAI_AGAIN is the symptom. The pod logs won’t tell you why.

Namespace selector labels are not always what you expect. name: is a user-applied convention; kubernetes.io/metadata.name: is the system-guaranteed label on every namespace since Kubernetes 1.21. Use the latter.

#kubernetes#networkpolicy#dns#aks#coredns