Blog

AI infrastructure labs, incident write-ups, and platform engineering from real production work.

#incident Field Notes

Azure Blob Private Link Looked Configured But Wasn't: Three Misconfigs That Left Public Access Open

Diagnosed and fixed a blob storage Private Link setup where the private endpoint was in the wrong VNet, the DNS A record was in an orphaned zone, and public access was never disabled.

#incident Field Notes

Connecting Prod AKS to Log Analytics: Container Insights Migration and a Plaintext OTP Leak

Migrated the prod AKS cluster's Container Insights pipeline to a unified Log Analytics workspace and discovered 140+ plaintext OTP values being logged in production.

#platform Field Notes

Replacing a Rogue Azure Function with a Proper ADF Orchestration Pipeline

Built an ADF orchestration pipeline to chain Extract, Transform, and HRIS History into a single trigger, replacing an unsafe Azure Function that had been running the same workload as a shadow ETL.

#incident Field Notes

D365 Silently Dropped 20 OData Columns: SCD2 Saved the Data

Diagnosed HRIS dashboard failures after Dynamics 365 stopped returning WorkerStatus, Gender, and MaritalStatus from its Workers OData entity, then recovered the values from an SCD2 history table that had been quietly capturing them for months.

#incident Field Notes

A Shadow Azure Function Wiped Every Gold Table at 2am

Traced a second HRIS outage to an Azure Function running a parallel ETL that truncated all staging and Gold tables when D365 OData returned zero rows, discovered by pulling the function's source code from a SquashFS deployment package on Azure Blob Storage.

#platform Field Notes

Standing Up a Full Analytics Platform on AKS in One Session: GitOps, DuckDB, and 14 Kubernetes Manifests

Built and deployed a telecom analytics platform on Azure Kubernetes Service in a single session — Kafka, DuckDB, MLflow, Prefect, Grafana, FastAPI — all wired through ArgoCD GitOps with CI/CD image tagging.

#platform Field Notes

Churn Prediction with RandomForest, MLflow, and DuckDB in a Kubernetes Init Container

Trained a telecom churn prediction model using feature-engineered DuckDB Gold layer data, tracked with MLflow, orchestrated by Prefect, and executed as a Kubernetes init container on every deployment.

#platform Field Notes

DuckDB as an Embedded Lakehouse on Kubernetes: Medallion Layers in a Single File

Used DuckDB as a single-file analytical database running inside a Kubernetes pod, with Bronze/Silver/Gold medallion layers, shared PVC storage, and read-only connections for secure query serving.

#platform Field Notes

Kafka KRaft to DuckDB: A Medallion Lakehouse Pipeline on Kubernetes

Built a streaming analytics pipeline using Kafka in KRaft mode, DuckDB as an embedded lakehouse with Bronze/Silver/Gold layers, and a CronJob-based refresh cycle on AKS.

#platform Field Notes

Querying the Kubernetes Metrics API from a Pod: RBAC, Python Client, and Grafana Without Prometheus

Wired up a FastAPI application to read node and pod metrics from the Kubernetes metrics-server API using in-cluster config, a scoped ClusterRole, and the Python kubernetes client.

#debug Field Notes

Eight Commits to Fix a Sub-Path: Nginx Ingress Rewrites for Grafana, Prefect, MLflow, and ArgoCD

Routed six services under one IP using Nginx ingress path matching, rewrite-target annotations, and application-level sub-path configuration on AKS.

#platform Field Notes

Natural Language to SQL with Claude Haiku: Schema Grounding, Validation, and a Read-Only DuckDB Connection

Built an NLP2SQL interface using Claude Haiku via Azure AI Foundry, grounded on the DuckDB schema, with regex validation and read-only connection as defense in depth.

#platform Field Notes

Plotly.js Embedded Analytics in a FastAPI Portal

Skipped Grafana for the analytics portal and embedded Plotly.js directly in Jinja2 templates — KPI cards, funnel charts, DOM-based table rendering, and a 15-second refresh loop.

#platform Field Notes

Building a RAG Pipeline with ChromaDB and Sentence-Transformers in a Kubernetes Pod

Implemented document ingestion, paragraph-aware chunking, and semantic search using ChromaDB with all-MiniLM-L6-v2 embeddings, deployed as part of a FastAPI application on AKS.

#debug Field Notes

Diagnosing 401 Invalid Signature on a Bill Payment Webhook Azure Function

Traced a persistent 401 on an Azure Function webhook verifying RSA-signed payment notifications to three compounding failures: wrong body in the test request, a hardcoded test organisation code in Terraform, and a key mode mismatch between the deployed function and the Postman collection.

#platform Field Notes

Cleaning Up a Kubernetes Manifest Directory That Got Away From You

The k8s/ directory had stale ingresses, ambiguously named files, missing service manifests, plaintext credentials in a text file, and image tags months out of date. Here is how it was restructured.

#platform Field Notes

terraform.tfstate, a Live VPN Key, and 100MB of Provider Binaries Committed on Day One

Audited a six-month-old Terraform repo and found the state file, a live VPN pre-shared key, and all provider binaries committed in the initial push, then removed them from tracking and migrated state to an Azure Storage backend.

#debug Field Notes

Installing Karpenter 1.8 on EKS 1.34: Four Errors and a Working Cluster

Installed Karpenter 1.8 on EKS 1.34 by working through a Helm registry migration, a version compatibility gap, a feature gate parsing bug, and a missing aws-auth entry — alongside a cost audit that cut $23/month.

#platform Field Notes

Kit Confirmation Emails Not Sending From a Static Astro Site

Traced a silent 200-OK with no confirmation email through a wrong API version, a 12-hour per-address suppression window, and a per-form template scope that doesn't inherit globally.

#platform Field Notes

Wiring Azure File Persistent Storage for Notification and Batch Services on AKS Staging

Added PVCs and wired volume mounts for the notification and batch services across two namespaces on AKS staging, replacing stale AWS StorageClass references and correcting two naming and access mode mistakes made in the process.

#platform Field Notes

PeerDB Enterprise on EKS: Helm Repository Move, Temporal Schema Bootstrap, and Catalog Secret Keys

Deployed PeerDB Enterprise on EKS against an RDS catalog backend and resolved three undocumented blockers: the moved Helm repository, missing Temporal schema migrations, and a six-key catalog secret requirement.

#incident Field Notes

Diagnosing and Fixing an OOMKilled Traefik Ingress Controller on EKS

Traced a Traefik CrashLoopBackOff that took down the entire ingress layer overnight to an undersized memory limit, then fixed a Helm schema validation failure on the first upgrade attempt.

#debug Field Notes

Karpenter CrashLoopBackOff: 178 Restarts from an Empty Feature Gate Value

Traced a Karpenter startup panic to a single empty string in the FEATURE_GATES environment variable and resolved it with a one-line kubectl set env patch.

#debug Field Notes

Karpenter GC Controller Failing With AccessDenied: Missing iam:ListInstanceProfiles

Traced recurring AccessDenied errors in Karpenter's instance profile garbage collection controller to a missing iam:ListInstanceProfiles action and patched the controller IAM policy to fix it.

#debug Field Notes

Diagnosing 15 Hours of ContainerCreating: `replicas: 2` Against a ReadWriteOnce EBS Volume on EKS

Traced a 15-hour silent ContainerCreating stall to a Deployment running two replicas against a single ReadWriteOnce EBS PVC, where AWS rejected the second EC2 volume attachment with no events and no logs.

#rca Field Notes

cert-manager ACME HTTP-01 Leak: 22,514 Stale HTTPRoutes OOMKilled Traefik

Diagnosed an 18-hour full ingress outage caused by cert-manager leaking 22,514 stale ACME solver HTTPRoutes after TLS certificates were deployed before DNS was configured, compounded by Gateway API blocking all 23 listeners when new listeners referenced non-existent TLS secrets.

#debug Field Notes

Breaking the cert-manager and Gateway API Bootstrap Deadlock

Traced a 17-hour HTTPS outage across 20+ namespaces to a bootstrap deadlock between cert-manager's ACME HTTP-01 solver and the Gateway API's all-or-nothing listener programming model, resolved by injecting temporary placeholder TLS secrets.

#platform Field Notes

Replacing kube-prometheus-stack with VictoriaMetrics on EKS

Replaced kube-prometheus-stack and Loki with VictoriaMetrics, VictoriaLogs, and Vector on EKS, cutting the observability memory footprint by 56% and adding dual-sink log archival to S3.

#platform Field Notes

Zero HPAs, Unbounded Containers, and an OOMKilled ArgoCD Controller on EKS

A platform audit on Amazon EKS revealed missing HPAs across 16 services, unbounded platform components, and an ArgoCD application controller OOMKilling under reconciliation load.

#platform Field Notes

Retiring the Latest Tag: Environment-Specific Image Tagging for Kubernetes

Retired the latest tag in favor of YYYYMMDD-HHMMSS-{env} image tags and conditional pipeline logic enforcing build separation between UAT and production.

#incident Field Notes

Next.js RCE in Production: How the Attack Unfolded and What Stopped It

A manually deployed image with a downgraded Next.js version was exploited via GHSA-9qr9-h5gf-34mp within hours of deployment; a pre-existing Network Policy denying internet egress prevented the attacker from downloading xmrig and completing the compromise.

#platform Field Notes

Manual Kubernetes Deployment When the Pipeline Breaks

When Azure DevOps went down with unshipped PRs in flight, I rebuilt frontend and backend images manually using git worktrees, pushed to ACR, and rolled out to two namespaces without downtime.

#platform Field Notes

How I Migrated 3.2 Million SharePoint Files to Azure Blob: Four Bugs That Almost Stopped It

Built and operated a distributed Python migration service on AKS that moved 3.26 million files and ~15TB from SharePoint Online to Azure Blob Storage across three drives, diagnosing and resolving four production bugs along the way.

#infra L01

Prometheus ServiceMonitor: The CRD Nobody Reads Until Something Doesn't Scrape

How ServiceMonitor wiring actually works in kube-prometheus-stack, why scrape targets go missing, and the three fields that control everything.

#incident Field Notes

ArgoCD Sync Wave Deadlock: How a Broken Deployment Blocked Its Own Fix

Traced two days of ArgoCD OutOfSync/Degraded state to three concurrent root causes: a sync wave deadlock from a missing ConfigMap environment variable, a liveness probe depending on tools absent from the container image, and a stuck sync operation that blocked all subsequent fixes.

#platform Field Notes

Keycloak SMTP via ExternalSecrets and Fineract Production Gateway Wiring

Wired Keycloak SMTP through ExternalSecrets and AWS Secrets Manager across dev and prod, then diagnosed three ArgoCD project, destination, and Gateway listener gaps blocking Fineract production routing.

#platform Field Notes

Deploying a New Django Service on EKS: GitOps Setup with ArgoCD, External Secrets, and Gateway API

Documented the end-to-end GitOps setup for onboarding a new Django service to EKS, covering ECR, RDS, Secrets Manager, Kustomize overlays, cert-manager TLS, Gateway API routing, and CI/CD pipeline wiring.

#debug Field Notes

RabbitMQ Cluster Operator: The Secret Format Nobody Documents

Traced a RabbitMQ init container mount failure to undocumented secret key requirements and resolved it with External Secrets Operator templating.

#debug Field Notes

ArgoCD Sync Waves and Gateway API: HTTPRoute Order Is Not Optional

Diagnosed an ArgoCD sync loop where Gateway API HTTPRoutes failed on missing backend Services, and fixed it by annotating resources with explicit sync wave numbers.

#incident Field Notes

Kinsing Hit the Cluster and the Security Contexts Held

Kinsing malware attempted to install a crypto miner on a Kubernetes pod. Read-only root filesystem and dropped capabilities blocked every step of the attack chain.

#platform Field Notes

Wiring TLS to AKS via ExternalSecret and Azure Key Vault

Stored TLS certificates in Azure Key Vault and synced them into Kubernetes as a TLS secret using the External Secrets Operator, then wired them to the NGINX Ingress.

#debug Field Notes

Azure Key Vault 403: Application ID Is Not the Object ID

The backend returned a 403 Forbidden on every Key Vault operation. The access policy had the wrong ID — Application ID instead of Service Principal Object ID.

#debug Field Notes

Network Policy Blocked DNS and the Pods Couldn't Tell You Why

External API calls from the backend started failing with EAI_AGAIN after adding Network Policies. The fix required supporting both CoreDNS and Azure DNS in the same egress rule.

#debug Field Notes

AKS Was Running But the Site Was Unreachable: an NSG Story

The cluster was healthy and the pods were running, but requests from outside the corporate network timed out. An NSG rule was allowing only two CIDRs. Fixed it with a Terraform boolean toggle.

#debug Field Notes

Four Things That Broke on the First AKS Deployment

Deployed a Node.js/Next.js app to AKS for the first time and hit MySQL timeouts, a hardcoded localhost URL, an ingress rewrite stripping the API prefix, and a LimitRange wall — all in the same session.