Blog Field Notes AKS Was Running But the Site Was Unreachable: an NSG Story
Debug #azure#aks#terraform#nsg#networking

AKS Was Running But the Site Was Unreachable: an NSG Story

The cluster was healthy and the pods were running, but requests from outside the corporate network timed out. An NSG rule was allowing only two CIDRs. Fixed it with a Terraform boolean toggle.

· Gideon Warui
ON THIS PAGE

Two weeks after the initial AKS deployment, the site was inaccessible from outside the corporate network. From <internal-ip> (on-prem) or <internal-cidr> (internal VNET), everything worked. From a mobile phone or a home network, the connection timed out at the TCP layer. No HTTP response, no TLS handshake — just silence.


Environment

ComponentDetail
ClusterAKS (<cluster>), West Europe
Load Balancer IP<internal-ip>
NSGApplied to the AKS node subnet
IngressNGINX Ingress Controller

Diagnosing from the outside

curl -v --connect-timeout 10 http://<internal-ip>/
# * Trying <internal-ip>:80...
# * Connection timed out after 10001 milliseconds
# curl: (28) Connection timed out after 10001 milliseconds

Not a 403, not a 404, not a TLS error — a timeout at the TCP connection stage. That ruled out application-layer problems and pointed to the network boundary: either the Azure Load Balancer wasn’t forwarding the traffic, or something was dropping it after the load balancer.

AKS load balancers in Azure forward traffic to the node subnet. The NSG on that subnet controls what reaches the nodes.


The NSG rule

In Terraform, the relevant rule:

resource "azurerm_network_security_rule" "aks_allow_http_https" {
  name                       = "Allow-HTTP-HTTPS"
  priority                   = 110
  direction                  = "Inbound"
  access                     = "Allow"
  protocol                   = "Tcp"
  source_port_range          = "*"
  destination_port_ranges    = ["80", "443"]
  source_address_prefixes    = [
    "<internal-cidr>",   # Corporate VNET
    "<internal-ip>"      # On-prem router
  ]
  destination_address_prefix = "*"
}

Two source CIDRs — internal only. Any traffic from outside those ranges hit the default-deny rule that follows.

The security intent was correct: the site should eventually be restricted to known IP ranges. But at this stage of testing, with external stakeholders needing access, the restriction was premature.


The fix: Terraform boolean toggle

Rather than deleting the rule or manually editing the CIDR list for each access change, I added a boolean variable to switch between public and corporate-only modes:

# variables.tf
variable "allow_public_http_https" {
  description = "Allow public HTTP/HTTPS access. Set to false to restrict to corporate network only."
  type        = bool
  default     = false
}

# main.tf (or network module)
resource "azurerm_network_security_rule" "aks_allow_http_https_public" {
  count = var.allow_public_http_https ? 1 : 0

  name                       = "Allow-HTTP-HTTPS-Public"
  priority                   = 110
  direction                  = "Inbound"
  access                     = "Allow"
  protocol                   = "Tcp"
  source_port_range          = "*"
  destination_port_ranges    = ["80", "443"]
  source_address_prefix      = "Internet"
  destination_address_prefix = "*"
  # ...
}

resource "azurerm_network_security_rule" "aks_allow_http_https_restricted" {
  count = var.allow_public_http_https ? 0 : 1

  name                       = "Allow-HTTP-HTTPS-Corporate"
  priority                   = 110
  direction                  = "Inbound"
  access                     = "Allow"
  protocol                   = "Tcp"
  source_port_range          = "*"
  destination_port_ranges    = ["80", "443"]
  source_address_prefixes    = [
    "<internal-cidr>",
    "<internal-ip>"
  ]
  destination_address_prefix = "*"
  # ...
}

count = var.allow_public_http_https ? 1 : 0 is the Terraform pattern for conditional resource creation. One rule or the other exists — never both, never neither.

In terraform.tfvars:

allow_public_http_https = true

Apply:

cd infra/
terraform apply -auto-approve

Verify:

curl -I http://<internal-ip>
# HTTP/1.1 200 OK

The AKS API server is separate

Changing the node subnet NSG only affects application traffic (ports 80/443). It has no effect on kubectl access to the API server, which has its own authorized IP range list:

# Add an IP to the API server authorized list
az aks update \
  --name <cluster> \
  --resource-group <resource-group> \
  --api-server-authorized-ip-ranges "<internal-ip>/32,<internal-ip>/32"

These are two separate controls. The NSG governs what reaches the workload nodes. The apiServerAccessProfile.authorizedIpRanges governs what can talk to the Kubernetes control plane. You need both to use kubectl and serve traffic from a locked-down cluster.


Production rule

With allow_public_http_https = true in terraform.tfvars, the site became accessible from anywhere. The Kubernetes API server remained restricted to known CIDRs. The database and internal services were unaffected — their NSG rules were separate and unchanged.

The toggle makes it easy to flip back:

# In terraform.tfvars
allow_public_http_https = false

terraform apply -auto-approve

That re-enables the corporate-only rule and removes the public one in a single operation. No manual NSG edits, no state drift.

#azure#aks#terraform#nsg#networking