Standing Up a Full Analytics Platform on AKS in One Session: GitOps, DuckDB, and 14 Kubernetes Manifests
Built and deployed a telecom analytics platform on Azure Kubernetes Service in a single session — Kafka, DuckDB, MLflow, Prefect, Grafana, FastAPI — all wired through ArgoCD GitOps with CI/CD image tagging.
ON THIS PAGE
I needed a working demo of a Data Science Centre of Excellence platform for a telecom operator. Not slides. Not a diagram. A running system with streaming data, ML predictions, BI dashboards, and a self-serve query interface — all on Kubernetes, all deployed through GitOps.
The constraint was time. I had roughly twelve hours from first commit to demo-ready. The platform needed to look and behave like production: authenticated portal, role-based access, real-time pipeline, infrastructure monitoring. This is how I built it.
The Stack
The architecture is a medallion lakehouse pattern running entirely inside a single AKS cluster:
- Kafka (KRaft mode) — event streaming, no ZooKeeper
- DuckDB — embedded OLAP database, Bronze/Silver/Gold layers in a single file
- Prefect — workflow orchestration for the ETL pipeline
- MLflow — model experiment tracking and registry
- ChromaDB — vector store for RAG document search
- Grafana — infrastructure monitoring (node/pod metrics)
- FastAPI — API layer, portal, embedded analytics dashboards
- ArgoCD — GitOps continuous deployment
- GitHub Actions — CI pipeline, image builds, tag propagation
Everything lives in a single Git repository. ArgoCD watches infra/k8s/ and auto-syncs.
Infrastructure Foundation
Node Pool and Scheduling
The AKS cluster uses a dedicated node pool (<client>dscoe) with taints to isolate demo workloads from anything else running on the cluster:
nodeSelector:
agentpool: <client>dscoe
tolerations:
- key: "workload"
operator: "Equal"
value: "<namespace>"
effect: "NoSchedule"
Every deployment, job, and CronJob in the project carries this block. Without it, pods land on the default node pool and compete with unrelated workloads. The taint ensures nothing else schedules onto the demo nodes.
Shared Storage with PVC
The DuckDB database, trained ML model, and ChromaDB vector store all need to survive pod restarts and be shared between the init container (pipeline) and the main API container. A single PVC handles this:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: <client>-data
namespace: <namespace>
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
ReadWriteOnce means only one node can mount it at a time. This forced a deployment strategy decision: Recreate instead of RollingUpdate. With RollingUpdate, the new pod tries to mount the PVC while the old pod still holds it — deadlock on a single-node setup. Recreate tears down first, then starts fresh.
strategy:
type: Recreate
The PVC is mounted at /data and shared via volumeMounts between the init container and the main API container. Both reference the same paths:
env:
- name: DB_PATH
value: "/data/<client>_lakehouse.duckdb"
- name: MODEL_PATH
value: "/data/churn_model.joblib"
- name: CHROMA_PATH
value: "/data/chroma_db"
Init Container Pattern
The API pod uses an init container to run the full data pipeline before the API starts serving:
initContainers:
- name: pipeline-seed
image: <acr-registry>.azurecr.io/<client>-api:<sha>
command: ["python", "flows/dscoe_flow.py"]
env: [...]
volumeMounts:
- name: app-data
mountPath: /data
The flow runs: produce synthetic events, consume from Kafka, transform through Bronze, Silver, and Gold layers, train the churn model, write to DuckDB and joblib. Only after all of this completes does the main api container start.
Every fresh deployment gets a clean, fully-hydrated dataset. The trade-off is startup time (~60 seconds for the pipeline), but for a demo this is acceptable.
Kafka in KRaft Mode
Kafka runs as a StatefulSet in KRaft mode — no ZooKeeper dependency:
command:
- /bin/bash
- -c
- |
export CLUSTER_ID=$(kafka-storage random-uuid)
kafka-storage format -t $CLUSTER_ID -c /etc/kafka/kraft-server.properties
kafka-server-start /etc/kafka/kraft-server.properties
The image is apache/kafka:3.7.0. I hit one gotcha: the PVC mount at /var/lib/kafka contained a lost+found directory from the ext4 filesystem. Kafka’s log directory scan treated it as a corrupt segment and refused to start. The fix was mounting a subdirectory:
volumeMounts:
- name: kafka-data
mountPath: /var/lib/kafka/data
subPath: kafka-data
The subPath creates a clean directory inside the PVC, bypassing the lost+found issue entirely.
CI/CD Pipeline
GitHub Actions runs on every push to master:
- Test job — installs dependencies, runs pytest against auth tests
- Build & Push — builds the Docker image, tags with git SHA, pushes to Azure Container Registry
- Tag propagation —
sedreplaces the image tag in05-api.yamland12-producer-stream.yaml, commits back to the repo
- name: Update image tag in K8s manifests
run: |
SHA_TAG="${ACR}/${IMAGE}:${GIT_SHA}"
sed -i "s|image: ${ACR}/${IMAGE}:.*|image: ${SHA_TAG}|" \
infra/k8s/05-api.yaml infra/k8s/12-producer-stream.yaml
- name: Commit updated manifest
run: |
git config user.name "<redacted>"
# REVIEW: redacted — confirm (git username; may be intentionally public)
git config user.email "<email>"
git add infra/k8s/05-api.yaml infra/k8s/12-producer-stream.yaml
git diff --cached --quiet || git commit -m "ci: update image tag [skip ci]"
git push
The [skip ci] in the commit message prevents an infinite loop — the manifest commit would otherwise trigger another build.
ArgoCD watches the repo with automated sync:
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
prune: true removes resources from the cluster that are no longer in Git. selfHeal: true reverts manual changes. The full loop: code push, CI builds image, CI updates manifest tag, ArgoCD detects the tag change, ArgoCD syncs, new pod with new image.
CronJob for Pipeline Refresh
The streaming producer continuously pushes events to Kafka. To keep the API’s data current, a CronJob restarts the API deployment every 30 minutes:
apiVersion: batch/v1
kind: CronJob
metadata:
name: pipeline-refresh
spec:
schedule: "*/30 * * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
serviceAccountName: pipeline-refresh-sa
containers:
- name: trigger
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl rollout restart deployment/<client>-api -n <namespace>
kubectl rollout status deployment/<client>-api -n <namespace> --timeout=300s
The CronJob uses a dedicated ServiceAccount with a Role scoped to only get and patch on deployments. No broader access. The bitnami/kubectl image provides kubectl without needing to bake it into the application image.
One issue I hit: bitnami/kubectl:1.28 was removed from Docker Hub mid-session, causing ImagePullBackOff. Changed to bitnami/kubectl:latest to unblock.
RBAC for Metrics
The FastAPI application queries the Kubernetes metrics-server API directly (instead of relying on Prometheus/Mimir) for infrastructure monitoring. This required a ServiceAccount with ClusterRole access:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: <client>-api-metrics-reader
rules:
- apiGroups: ["metrics.k8s.io"]
resources: ["nodes", "pods"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["nodes", "pods"]
verbs: ["get", "list"]
The ClusterRole is necessary (not a namespaced Role) because node metrics are cluster-scoped. Pod metrics could be namespace-scoped, but using a single ClusterRole for both keeps the config simple.
The Python code uses the kubernetes library with in-cluster config:
from kubernetes import client, config
config.load_incluster_config()
metrics_api = client.CustomObjectsApi()
nodes = metrics_api.list_cluster_custom_object("metrics.k8s.io", "v1beta1", "nodes")
Manifest Organization
The infra/k8s/ directory contains 14 manifests, numbered for apply order:
00-namespace.yaml # Namespace
01-kafka.yaml # Kafka StatefulSet (KRaft)
02-mlflow.yaml # MLflow server
03-prefect.yaml # Prefect orchestration server
04-chromadb.yaml # ChromaDB vector store
05-api.yaml # FastAPI deployment + service
06-grafana.yaml # Grafana + dashboard ConfigMaps
07-ingress.yaml # Nginx ingress (4 resources)
08-argocd-ingress.yaml # ArgoCD ingress (separate namespace)
09-data-pvc.yaml # Shared PVC
11-pipeline-cron.yaml # CronJob + RBAC
12-producer-stream.yaml # Kafka streaming producer
13-api-rbac.yaml # ServiceAccount + ClusterRole for metrics
ArgoCD applies everything in infra/k8s/ — the numbering is for human readability, not execution order. Kubernetes handles dependency resolution (a Deployment referencing a PVC will wait if the PVC hasn’t been created yet by another resource in the same sync).
Configuration Rules
PVC + RollingUpdate = deadlock on RWO volumes
If you’re using ReadWriteOnce and a single node, use Recreate strategy or switch to ReadWriteMany (which requires a different storage class like Azure Files).
Init containers are underrated for data seeding
Instead of running a separate Job and coordinating timing, the init container guarantees the pipeline completes before the API starts. The pod isn’t ready until the init container exits 0.
Git-based image tagging with sed is crude but effective
The CI pipeline directly modifies the manifest and commits. No Helm values file, no Kustomize overlay, no external tool. For a single-service repo, this is the simplest path from push to deploy.
CronJobs need their own RBAC
The pipeline-refresh CronJob runs kubectl rollout restart, which requires patch on deployments. A common mistake is running CronJobs with the default ServiceAccount, which has no permissions. Scope the Role tightly — get and patch on deployments in the target namespace, nothing else.
bitnami/kubectl tags get pruned
Pin to a major version (bitnami/kubectl:1.30) rather than a point release, or accept latest for non-production use. Docker Hub retention policies will remove old tags without warning.
KRaft mode eliminates ZooKeeper but requires explicit bootstrapping
The cluster ID must be generated and formatted before the broker starts. The three-line startup script handles this cleanly.
Result
From first commit to running demo: about twelve hours. The platform serves a login portal with role-based access, embedded Plotly.js analytics dashboards, a Grafana infrastructure monitoring panel, MLflow model registry, Prefect pipeline orchestration, and a natural language query interface powered by Claude Haiku. All deployed through ArgoCD, all refreshing automatically, all running on two AKS nodes.
The Git history tells the story — 40+ commits, half of them fixes. Infrastructure work is debugging work. The manifests that shipped look clean; the path to get there was not.
Discussion