Connecting Prod AKS to Log Analytics: Container Insights Migration and a Plaintext OTP Leak

The prod AKS cluster had been running without log aggregation. The OMS monitoring addon was enabled — ama-logs DaemonSet present on every node — but it was pointed at <legacy-workspace>, a workspace I had moved away from. No alert fired when this happened. No dashboard broke. The logs were just quietly going nowhere.

The check took two commands. The fix introduced a three-to-five minute monitoring gap and uncovered a plaintext OTP leak that had been running undetected in production.

What Container Insights actually does

The ama-logs DaemonSet runs one pod per node. It reads container stdout/stderr from /var/log/containers/*.log — symlinks that kubelet maintains pointing at the actual containerd log files on the node. It does not reach inside running containers. It does not read log files written to container-local paths.

This distinction matters more than it sounds. A service that writes logs to a file inside the container — rather than to stdout — is invisible to Log Analytics, invisible to Defender for Containers, and invisible to any alerting built on top. For the platform team, that service simply does not exist as far as observability goes.

ContainerLogV2 is the destination table. Relevant fields:

Field	Notes
`TimeGenerated`	Ingestion timestamp
`PodNamespace`	K8s namespace
`ContainerName`	Container name within the pod
`LogMessage`	The log line
`LogSource`	Always `stdout` or `stderr` — file logs never appear here

The LogSource field is the tell. If you query ContainerLogV2 for a service and get no results, either the addon is misconfigured or the service is logging to a file.

CKA/CKAD cert lens: DaemonSets guarantee one pod per eligible node. They use the same scheduling machinery as regular pods — tolerations, node selectors, affinity — but the controller manages pod placement, not a replica count. Exam trap: kubectl scale does not work on DaemonSets. Scale is implicit: one pod per node that matches the selector.

The migration

Discovering the misconfiguration:

az aks show \
  --resource-group <resource-group> \
  --name <cluster> \
  --query "addonProfiles.omsagent"

The workspaceResourceID in the response pointed at <legacy-workspace>. The addon was active, the agent was running, but it was reporting to the wrong destination.

Moving the addon to <log-workspace> requires a disable-then-enable cycle. Azure CLI does not accept an in-place workspace update when the addon is already active:

az aks disable-addons \
  --resource-group <resource-group> \
  --name <cluster> \
  --addons monitoring

az aks enable-addons \
  --resource-group <resource-group> \
  --name <cluster> \
  --addons monitoring \
  --workspace-resource-id /subscriptions/<subscription-id>/resourcegroups/<resource-group>/providers/microsoft.operationalinsights/workspaces/<log-workspace>

The disable step terminates the ama-logs pods. The enable step recreates them with the new workspace registration. Total gap: roughly three to five minutes. Unavoidable without a dual-workspace forwarding setup.

Verification

After the addon restarted, the first KQL check:

ContainerLogV2
| where TimeGenerated > ago(15m)
| summarize count() by PodNamespace
| order by count_ desc

<namespace> appeared with 1,274 entries in the first fifteen minutes. Log ingestion was confirmed.

One gotcha with az monitor log-analytics query: the --workspace parameter takes the workspace customer GUID, not the full resource ID. Using the resource ID returns PathNotFoundError. Retrieve the GUID first:

az monitor log-analytics workspace show \
  --resource-group <resource-group> \
  --workspace-name <log-workspace> \
  --query customerId -o tsv

Use that GUID as the --workspace value in subsequent CLI queries.

General-purpose log browsing query for the workspace:

ContainerLogV2
| where TimeGenerated > ago(1h)
| where PodNamespace == "<namespace>"
| project TimeGenerated, ContainerName, LogSource, LogMessage
| order by TimeGenerated desc

The OTP leak

While scanning the log stream to confirm data quality, a pattern appeared repeatedly in the backend service logs:

otpppppp [redacted]

A six-digit OTP value — plaintext — in a console.log line. Running a targeted query:

ContainerLogV2
| where TimeGenerated > ago(24h)
| where PodNamespace == "<namespace>"
| where LogMessage contains "otpppppp"
| project TimeGenerated, LogMessage
| order by TimeGenerated desc

140+ instances in 24 hours. A debug log statement left in the backend service was writing real OTP values to stdout in production. Every OTP issued to a real user was sitting in the log workspace, readable by anyone with Log Analytics Reader access on the workspace.

This is the kind of finding that never surfaces without log aggregation. The backend service runs, the OTPs work, users authenticate — nothing is broken. The leak is invisible until you have a place to look.

The fix is a one-line removal in the backend service. The app team was notified with the query output as evidence.

Stdout vs file logging

The app team had been logging to files inside the container. The ama-logs agent never saw those logs because it reads from the node’s log files, not from paths inside containers.

The fix they committed to: switch the backend logger to write to stdout/stderr. Once the new image is deployed, no infra change is needed — the agent picks up the log stream automatically.

The implication for any service that migrates: historical file-based logs are lost at the transition point. Log Analytics will only have entries from when stdout logging started. If log retention matters, export the file-based logs before switching.

What is still missing

Diagnostic settings. The prod cluster has no control plane log forwarding. kube-apiserver, kube-audit, kube-controller-manager, and kube-scheduler logs are not going anywhere. Audit logs in particular are the record of who created or deleted which resource and when — the minimum required for a security incident investigation. Adding a diagnostic settings profile pointing to <log-workspace> is the next piece of work.

UAT cluster. Still pointed at <legacy-workspace>. The same disable/enable migration applies.