Standing Up a Full Analytics Platform on AKS in One Session: GitOps, DuckDB, and 14 Kubernetes Manifests

I needed a working demo of a Data Science Centre of Excellence platform for a telecom operator. Not slides. Not a diagram. A running system with streaming data, ML predictions, BI dashboards, and a self-serve query interface — all on Kubernetes, all deployed through GitOps.

The constraint was time. I had roughly twelve hours from first commit to demo-ready. The platform needed to look and behave like production: authenticated portal, role-based access, real-time pipeline, infrastructure monitoring. This is how I built it.

The Stack

The architecture is a medallion lakehouse pattern running entirely inside a single AKS cluster:

Kafka (KRaft mode) — event streaming, no ZooKeeper
DuckDB — embedded OLAP database, Bronze/Silver/Gold layers in a single file
Prefect — workflow orchestration for the ETL pipeline
MLflow — model experiment tracking and registry
ChromaDB — vector store for RAG document search
Grafana — infrastructure monitoring (node/pod metrics)
FastAPI — API layer, portal, embedded analytics dashboards
ArgoCD — GitOps continuous deployment
GitHub Actions — CI pipeline, image builds, tag propagation

Everything lives in a single Git repository. ArgoCD watches infra/k8s/ and auto-syncs.

Infrastructure Foundation

Node Pool and Scheduling

The AKS cluster uses a dedicated node pool (<client>dscoe) with taints to isolate demo workloads from anything else running on the cluster:

nodeSelector:
  agentpool: <client>dscoe
tolerations:
- key: "workload"
  operator: "Equal"
  value: "<namespace>"
  effect: "NoSchedule"

Every deployment, job, and CronJob in the project carries this block. Without it, pods land on the default node pool and compete with unrelated workloads. The taint ensures nothing else schedules onto the demo nodes.

Shared Storage with PVC

The DuckDB database, trained ML model, and ChromaDB vector store all need to survive pod restarts and be shared between the init container (pipeline) and the main API container. A single PVC handles this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: <client>-data
  namespace: <namespace>
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

ReadWriteOnce means only one node can mount it at a time. This forced a deployment strategy decision: Recreate instead of RollingUpdate. With RollingUpdate, the new pod tries to mount the PVC while the old pod still holds it — deadlock on a single-node setup. Recreate tears down first, then starts fresh.

strategy:
  type: Recreate

The PVC is mounted at /data and shared via volumeMounts between the init container and the main API container. Both reference the same paths:

env:
- name: DB_PATH
  value: "/data/<client>_lakehouse.duckdb"
- name: MODEL_PATH
  value: "/data/churn_model.joblib"
- name: CHROMA_PATH
  value: "/data/chroma_db"

Init Container Pattern

The API pod uses an init container to run the full data pipeline before the API starts serving:

initContainers:
- name: pipeline-seed
  image: <acr-registry>.azurecr.io/<client>-api:<sha>
  command: ["python", "flows/dscoe_flow.py"]
  env: [...]
  volumeMounts:
  - name: app-data
    mountPath: /data

The flow runs: produce synthetic events, consume from Kafka, transform through Bronze, Silver, and Gold layers, train the churn model, write to DuckDB and joblib. Only after all of this completes does the main api container start.

Every fresh deployment gets a clean, fully-hydrated dataset. The trade-off is startup time (~60 seconds for the pipeline), but for a demo this is acceptable.

Kafka in KRaft Mode

Kafka runs as a StatefulSet in KRaft mode — no ZooKeeper dependency:

command:
- /bin/bash
- -c
- |
  export CLUSTER_ID=$(kafka-storage random-uuid)
  kafka-storage format -t $CLUSTER_ID -c /etc/kafka/kraft-server.properties
  kafka-server-start /etc/kafka/kraft-server.properties

The image is apache/kafka:3.7.0. I hit one gotcha: the PVC mount at /var/lib/kafka contained a lost+found directory from the ext4 filesystem. Kafka’s log directory scan treated it as a corrupt segment and refused to start. The fix was mounting a subdirectory:

volumeMounts:
- name: kafka-data
  mountPath: /var/lib/kafka/data
  subPath: kafka-data

The subPath creates a clean directory inside the PVC, bypassing the lost+found issue entirely.

CI/CD Pipeline

GitHub Actions runs on every push to master:

Test job — installs dependencies, runs pytest against auth tests
Build & Push — builds the Docker image, tags with git SHA, pushes to Azure Container Registry
Tag propagation — sed replaces the image tag in 05-api.yaml and 12-producer-stream.yaml, commits back to the repo

- name: Update image tag in K8s manifests
  run: |
    SHA_TAG="${ACR}/${IMAGE}:${GIT_SHA}"
    sed -i "s|image: ${ACR}/${IMAGE}:.*|image: ${SHA_TAG}|" \
      infra/k8s/05-api.yaml infra/k8s/12-producer-stream.yaml

- name: Commit updated manifest
  run: |
    git config user.name  "<redacted>"
    # REVIEW: redacted — confirm (git username; may be intentionally public)
    git config user.email "<email>"
    git add infra/k8s/05-api.yaml infra/k8s/12-producer-stream.yaml
    git diff --cached --quiet || git commit -m "ci: update image tag [skip ci]"
    git push

The [skip ci] in the commit message prevents an infinite loop — the manifest commit would otherwise trigger another build.

ArgoCD watches the repo with automated sync:

syncPolicy:
  automated:
    prune: true
    selfHeal: true
  syncOptions:
  - CreateNamespace=true

prune: true removes resources from the cluster that are no longer in Git. selfHeal: true reverts manual changes. The full loop: code push, CI builds image, CI updates manifest tag, ArgoCD detects the tag change, ArgoCD syncs, new pod with new image.

CronJob for Pipeline Refresh

The streaming producer continuously pushes events to Kafka. To keep the API’s data current, a CronJob restarts the API deployment every 30 minutes:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pipeline-refresh
spec:
  schedule: "*/30 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: pipeline-refresh-sa
          containers:
          - name: trigger
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              kubectl rollout restart deployment/<client>-api -n <namespace>
              kubectl rollout status deployment/<client>-api -n <namespace> --timeout=300s

The CronJob uses a dedicated ServiceAccount with a Role scoped to only get and patch on deployments. No broader access. The bitnami/kubectl image provides kubectl without needing to bake it into the application image.

One issue I hit: bitnami/kubectl:1.28 was removed from Docker Hub mid-session, causing ImagePullBackOff. Changed to bitnami/kubectl:latest to unblock.

RBAC for Metrics

The FastAPI application queries the Kubernetes metrics-server API directly (instead of relying on Prometheus/Mimir) for infrastructure monitoring. This required a ServiceAccount with ClusterRole access:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: <client>-api-metrics-reader
rules:
- apiGroups: ["metrics.k8s.io"]
  resources: ["nodes", "pods"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["nodes", "pods"]
  verbs: ["get", "list"]

The ClusterRole is necessary (not a namespaced Role) because node metrics are cluster-scoped. Pod metrics could be namespace-scoped, but using a single ClusterRole for both keeps the config simple.

The Python code uses the kubernetes library with in-cluster config:

from kubernetes import client, config

config.load_incluster_config()
metrics_api = client.CustomObjectsApi()
nodes = metrics_api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes")

Manifest Organization

The infra/k8s/ directory contains 14 manifests, numbered for apply order:

00-namespace.yaml        # Namespace
01-kafka.yaml            # Kafka StatefulSet (KRaft)
02-mlflow.yaml           # MLflow server
03-prefect.yaml          # Prefect orchestration server
04-chromadb.yaml         # ChromaDB vector store
05-api.yaml              # FastAPI deployment + service
06-grafana.yaml          # Grafana + dashboard ConfigMaps
07-ingress.yaml          # Nginx ingress (4 resources)
08-argocd-ingress.yaml   # ArgoCD ingress (separate namespace)
09-data-pvc.yaml         # Shared PVC
11-pipeline-cron.yaml    # CronJob + RBAC
12-producer-stream.yaml  # Kafka streaming producer
13-api-rbac.yaml         # ServiceAccount + ClusterRole for metrics

ArgoCD applies everything in infra/k8s/ — the numbering is for human readability, not execution order. Kubernetes handles dependency resolution (a Deployment referencing a PVC will wait if the PVC hasn’t been created yet by another resource in the same sync).

Configuration Rules

PVC + RollingUpdate = deadlock on RWO volumes

If you’re using ReadWriteOnce and a single node, use Recreate strategy or switch to ReadWriteMany (which requires a different storage class like Azure Files).

Init containers are underrated for data seeding

Instead of running a separate Job and coordinating timing, the init container guarantees the pipeline completes before the API starts. The pod isn’t ready until the init container exits 0.

Git-based image tagging with `sed` is crude but effective

The CI pipeline directly modifies the manifest and commits. No Helm values file, no Kustomize overlay, no external tool. For a single-service repo, this is the simplest path from push to deploy.

CronJobs need their own RBAC

The pipeline-refresh CronJob runs kubectl rollout restart, which requires patch on deployments. A common mistake is running CronJobs with the default ServiceAccount, which has no permissions. Scope the Role tightly — get and patch on deployments in the target namespace, nothing else.

`bitnami/kubectl` tags get pruned

Pin to a major version (bitnami/kubectl:1.30) rather than a point release, or accept latest for non-production use. Docker Hub retention policies will remove old tags without warning.

KRaft mode eliminates ZooKeeper but requires explicit bootstrapping

The cluster ID must be generated and formatted before the broker starts. The three-line startup script handles this cleanly.

Result

From first commit to running demo: about twelve hours. The platform serves a login portal with role-based access, embedded Plotly.js analytics dashboards, a Grafana infrastructure monitoring panel, MLflow model registry, Prefect pipeline orchestration, and a natural language query interface powered by Claude Haiku. All deployed through ArgoCD, all refreshing automatically, all running on two AKS nodes.

The Git history tells the story — 40+ commits, half of them fixes. Infrastructure work is debugging work. The manifests that shipped look clean; the path to get there was not.