Blog Field Notes DuckDB as an Embedded Lakehouse on Kubernetes: Medallion Layers in a Single File
Platform #duckdb#lakehouse#medallion#kubernetes#pvc#olap#analytics#embedded-database

DuckDB as an Embedded Lakehouse on Kubernetes: Medallion Layers in a Single File

Used DuckDB as a single-file analytical database running inside a Kubernetes pod, with Bronze/Silver/Gold medallion layers, shared PVC storage, and read-only connections for secure query serving.

· Gideon Warui
ON THIS PAGE

When I needed a fast, embeddable analytical database for the <client> demo — one that could scale from development to a live platform without managing a separate Postgres cluster or a heavyweight Spark job — I chose DuckDB. Not because it’s trendy, but because it solved three hard problems in one tool: zero operational overhead, ACID semantics, and the ability to evolve a schema inside the same file across medallion transformation layers.

This is how I architected a complete lakehouse on Kubernetes using DuckDB’s single-file persistence, Prefect orchestration, and read-only API connections.

Why DuckDB Over Postgres or Spark

I evaluated three paths:

Postgres + pg_partman would mean managing a separate service, backups, upgrade cycles, and connection pool complexity. For a data platform where transformations are batch-driven and analytical queries dominate, I’d be paying for transactional guarantees I didn’t need.

Spark would mean a cluster, YARN or K8s operator overhead, JVM memory thrashing on small datasets, and needing separate storage (S3, HDFS). For a demo that needs to run on 2 CPU cores and 2GB RAM, it was overkill.

DuckDB meant:

  • Single .duckdb file, embedded directly in the Python process
  • Native columnar storage and vectorized execution — analytical queries run 10-50x faster than Postgres on aggregations
  • ACID transactions and write-ahead logging built in
  • SQL that’s stricter and more standard than MySQL or SQLite
  • No daemon, no network port, no connection pooling to debug

The tradeoff: single-writer (or read-only followers). For a batch pipeline that runs once per hour and serves hundreds of read-only queries, that’s not a constraint — it’s a feature.

Medallion Architecture in One File

I organized the lakehouse into three tables inside the same DuckDB file:

Bronze (bronze_raw_events) ingests raw JSON from Kafka without transformation:

  • Event metadata: event_id, event_type (BSS, OSS, IoT), subscriber_id, event_timestamp
  • Raw payload as JSON: raw_payload — contains nested BSS plans, signal metrics, device battery
  • A quality_score computed upstream by a Prefect task

Silver (silver_clean_events) flattens, validates, and filters:

CREATE OR REPLACE TABLE silver_clean_events AS
SELECT
    event_id,
    event_type,
    source_system,
    UPPER(subscriber_id)            AS subscriber_id,
    CAST(event_timestamp AS TIMESTAMP) AS event_timestamp,
    quality_score,
    ingested_at,

    -- BSS fields (null for non-BSS rows)
    CASE WHEN event_type = 'BSS'
         THEN raw_payload->>'plan'             END                          AS bss_plan,
    CASE WHEN event_type = 'BSS'
         THEN CAST(raw_payload->>'data_used_mb' AS INTEGER) END            AS bss_data_used_mb,
    CASE WHEN event_type = 'BSS'
         THEN CAST(raw_payload->>'bill_amount_zmw' AS DOUBLE) END          AS bss_bill_zmw,

    -- OSS fields
    CASE WHEN event_type = 'OSS'
         THEN CAST(raw_payload->>'signal_dbm'  AS INTEGER) END             AS oss_signal_dbm,
    CASE WHEN event_type = 'OSS'
         THEN CAST(raw_payload->>'latency_ms'  AS INTEGER) END             AS oss_latency_ms,

    -- IoT fields
    CASE WHEN event_type = 'IoT'
         THEN CAST(raw_payload->>'battery_pct' AS INTEGER) END             AS iot_battery_pct

FROM bronze_raw_events
WHERE
    quality_score >= 45
    AND subscriber_id IS NOT NULL
    AND event_timestamp IS NOT NULL
;

The Silver layer enforces three things:

  1. Type safety: Cast strings to INTEGER, DOUBLE, TIMESTAMP — any malformed data fails here and gets quarantined
  2. Quality threshold: Drop records with quality_score < 45 — they never pollute downstream
  3. Denormalization: Flatten JSON into typed columns so Gold doesn’t have to do it again

Gold (gold_subscriber_features) aggregates to subscriber-level features for ML:

CREATE OR REPLACE TABLE gold_subscriber_features AS
SELECT
    subscriber_id,

    -- Event volume
    COUNT(*)                                                AS total_events,
    COUNT(*) FILTER (WHERE event_type = 'BSS')             AS bss_events,
    COUNT(*) FILTER (WHERE event_type = 'OSS')             AS oss_events,

    -- Network quality
    ROUND(AVG(oss_signal_dbm), 1)                          AS avg_signal_dbm,
    ROUND(AVG(oss_latency_ms), 1)                          AS avg_latency_ms,
    ROUND(AVG(oss_packet_loss_pct), 2)                     AS avg_packet_loss_pct,

    -- Billing behaviour
    ROUND(AVG(bss_bill_zmw), 2)                            AS avg_bill_zmw,
    ROUND(AVG(bss_data_used_mb), 0)                        AS avg_data_used_mb,
    MAX(bss_days_overdue)                                  AS max_days_overdue,
    COUNT(*) FILTER (
        WHERE event_type = 'BSS' AND bss_payment_status = 'overdue'
    )                                                       AS overdue_count,

    -- Device health
    ROUND(AVG(iot_battery_pct), 1)                         AS avg_battery_pct,

    -- Data quality
    ROUND(AVG(quality_score), 1)                           AS avg_quality_score,

    -- Derived churn-risk label
    CASE
        WHEN MAX(bss_days_overdue) > 30
             OR AVG(oss_signal_dbm) < -90
             OR AVG(oss_latency_ms) > 150
             THEN 1
        ELSE 0
    END                                                     AS churn_label

FROM silver_clean_events
GROUP BY subscriber_id
HAVING COUNT(*) >= 2
;

Each layer is idempotent — CREATE OR REPLACE TABLE allows the pipeline to be rerun without manual cleanup. The aggregation applies FILTER clauses to compute event-type-specific metrics without bloating the row width.

Kubernetes Architecture: Init Container + Shared PVC

The trick to running a single-writer, multi-reader DuckDB setup is to separate concerns:

  1. Init container (pipeline-seed) runs the medallion transforms and exits
  2. API container starts after, mounts the same PVC, and serves read-only queries
  3. Deployment strategy: Recreate ensures only one pod writes at a time

The manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: <client>-api
  namespace: <namespace>
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: <client>-api
  template:
    metadata:
      labels:
        app: <client>-api
    spec:
      nodeSelector:
        agentpool: <client>pool  # REVIEW: redacted — confirm
      initContainers:
      - name: pipeline-seed
        image: <acr-registry>.azurecr.io/<client>-api:latest
        command: ["python", "flows/<client>_flow.py"]
        env:
        - name: DB_PATH
          value: "/data/<client>_lakehouse.duckdb"
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        volumeMounts:
        - name: app-data
          mountPath: /data
      containers:
      - name: api
        image: <acr-registry>.azurecr.io/<client>-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: DB_PATH
          value: "/data/<client>_lakehouse.duckdb"
        - name: MODEL_PATH
          value: "/data/churn_model.joblib"
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        volumeMounts:
        - name: app-data
          mountPath: /data
        readinessProbe:
          httpGet:
            path: /api/v1/health
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 10
      volumes:
      - name: app-data
        persistentVolumeClaim:
          claimName: <client>-data

The PVC itself is minimal:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: <client>-data
  namespace: <namespace>
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

Key decisions

  • Recreate strategy, not RollingUpdate: A single .duckdb file cannot be safely accessed by two processes simultaneously. If the new pod starts before the old one terminates, WAL corruption follows. Recreate forces a hard kill, then start fresh — acceptable for a batch system.
  • Init container writes, main container reads: The init container runs the Prefect flow, which loads Bronze, runs the Silver and Gold transforms, and exits cleanly. The main API container then starts, mounts the completed .duckdb file, and never writes to it.
  • ReadWriteOnce PVC: One writer and one reader at a time is sufficient; RWO avoids NFS overhead.

Read-Only Connections and Null Safety

The API connects to DuckDB in read-only mode:

def get_db():
    """Get a read-only DuckDB connection."""
    return duckdb.connect(DB_PATH, read_only=True)

Read-only mode enforces that any attempt to INSERT, UPDATE, or DELETE fails with a hard error — no accidental writes, no surprises.

But read-only doesn’t mean failure-proof. DuckDB returns NULL for missing aggregates and computed columns. The API must handle NaN and infinity:

@app.get("/api/v1/subscribers")
def list_subscribers(limit: int = 20, risk: Optional[str] = None):
    con = get_db()
    try:
        q = (
            "SELECT subscriber_id, total_events, "
            "COALESCE(avg_signal_dbm, 0) AS avg_signal_dbm, "
            "COALESCE(avg_latency_ms, 0) AS avg_latency_ms, "
            "COALESCE(avg_bill_zmw, 0) AS avg_bill_zmw, "
            "COALESCE(max_days_overdue, 0) AS max_days_overdue, "
            "COALESCE(churn_label, 0) AS churn_label "
            "FROM gold_subscriber_features"
        )
        if risk == "high":
            q += " WHERE churn_label = 1"
        elif risk == "low":
            q += " WHERE churn_label = 0"
        q += f" ORDER BY churn_label DESC LIMIT {limit}"
        rows = con.execute(q).fetchdf()
        rows = rows.fillna(0)
        records = rows.to_dict(orient="records")
        for rec in records:
            for k, v in rec.items():
                if isinstance(v, float) and (np.isnan(v) or np.isinf(v)):
                    rec[k] = 0.0
        return records
    finally:
        con.close()

Three layers of defense:

  1. COALESCE in SQL: Convert NULL to 0 before returning from DuckDB
  2. Pandas fillna: Catch any NULL that slipped through
  3. NaN/Inf check: Catch floating-point pathologies before JSON serialization

This is paranoid, but production APIs need to be. A single NaN in a JSON response breaks downstream clients.

Self-Serve Queries: LLM-to-SQL Against Gold

The self-serve module uses Claude Haiku to translate natural language into DuckDB SQL, then executes read-only:

@router.post("/sql")
def query_sql(req: QueryRequest):
    if not LLM_AVAILABLE:
        raise HTTPException(503, "LLM not available")

    client = anthropic.Anthropic(api_key=ANTHROPIC_KEY)

    try:
        message = client.messages.create(
            model=HAIKU_MODEL,
            max_tokens=512,
            system=SYSTEM_PROMPT,
            messages=[{"role": "user", "content": req.question}],
        )
        raw = message.content[0].text
    except Exception as e:
        raise HTTPException(502, f"LLM request failed: {str(e)}")

    sql = extract_sql(raw)

    if sql == "CANNOT_ANSWER":
        raise HTTPException(422, "I can't answer that with available subscriber data.")

    if not validate_sql(sql):
        raise HTTPException(422, "Generated query was not a valid SELECT statement.")

    try:
        con = duckdb.connect(DB_PATH, read_only=True)
        rows = con.execute(sql).fetchdf()
        con.close()
        rows = rows.fillna(0)
        records = rows.to_dict(orient="records")
        for rec in records:
            for k, v in rec.items():
                if isinstance(v, float) and (v != v or v == float('inf')):
                    rec[k] = 0.0
        return {"sql": sql, "results": records}
    except Exception as e:
        raise HTTPException(500, f"Query execution failed: {str(e)}")

The system prompt provides schema context:

TABLE: gold_subscriber_features
- subscriber_id (VARCHAR)
- total_events (INTEGER)
- bss_events (INTEGER)
- oss_events (INTEGER)
- avg_signal_dbm (DOUBLE)
- avg_latency_ms (DOUBLE)
- avg_packet_loss_pct (DOUBLE)
- avg_bill_zmw (DOUBLE)
- avg_data_used_mb (DOUBLE)
- max_days_overdue (INTEGER)
- churn_label (INTEGER) — 0 = retained, 1 = at risk

TABLE: silver_clean_events
[fields...]

The validator rejects anything that’s not a SELECT:

def validate_sql(sql: str) -> bool:
    normalized = sql.strip().upper()
    if not normalized.startswith("SELECT"):
        return False
    dangerous = ["INSERT", "UPDATE", "DELETE", "DROP", "CREATE", "ALTER"]
    for keyword in dangerous:
        if re.search(rf'\b{keyword}\b', normalized):
            return False
    return True

Because the connection is read-only, even a malicious LLM output can’t corrupt data. The validator is defense-in-depth, not the primary gate.

Schema Design Decisions

Why not normalize further?

The Silver layer could split bss_*, oss_*, and iot_* fields into separate tables joined on event_id. But that adds query complexity and doesn’t save disk space — columnar DuckDB compression handles nulls efficiently. A single wide table is faster to query and easier to reason about.

Why quality_score at Bronze, not Silver?

The quality computation happens upstream during ingestion (a Prefect task validates JSON schema, checks for missing required fields, assigns a score 0-100). Including it at Bronze allows Silver to make filtering decisions, and Gold can report it back to the API. Computing it later would mean losing the ability to explain why records were dropped.

Why HAVING COUNT(*) >= 2 at Gold?

A subscriber with a single event might skew statistics (one data point = no mean, no variance). Requiring 2+ events ensures meaningful aggregates. For a telecom platform, a subscriber with only one event is likely a test or erroneous record anyway.

Operational Insights

Init container seed time

256Mi and 100m CPU is enough for the pipeline to load ~100k Bronze records, flatten to ~95k Silver (after 5% quality filter), and aggregate to ~50k Gold subscribers in under 60 seconds. The limits (2Gi, 1000m) protect against runaway queries.

Read-only latency

Querying Gold from the API is sub-10ms for most queries. The /api/v1/analytics/summary endpoint (which scans the entire Gold table for counts and averages) completes in ~50ms on 50k rows.

Disaster recovery

The .duckdb file is the entire state. If the PVC is lost, the next pod init will rebuild it from Kafka + Bronze fixtures. No separate backup strategy needed — the source of truth is the event stream, not the lakehouse.

Scaling beyond this

If subscribers exceed 1M or queries need sub-second latency, migrate to Postgres with bulk-load from DuckDB. For now, DuckDB is the pragmatic choice.

Conclusion

DuckDB proved that a single-file lakehouse is viable on Kubernetes when you accept the constraint of batch-driven updates and read-heavy workloads. The medallion architecture inside one file eliminates versioning and synchronization headaches. Init containers and read-only connections enforce a clean separation between compute (writing the lakehouse) and serving (reading it). The result is a platform that requires zero operational overhead, scales to millions of events, and can run on a $100/month AKS cluster.

For platforms that don’t yet know if they need Spark or Postgres, DuckDB buys clarity. Build the medallion layers first, prove the schema, then graduate to a distributed system only when you hit its natural limits.

#duckdb#lakehouse#medallion#kubernetes#pvc#olap#analytics#embedded-database