Building a RAG Pipeline with ChromaDB and Sentence-Transformers in a Kubernetes Pod

The Problem

The <client> platform needed a self-serve analytics interface where non-technical analysts could ask natural language questions and receive relevant platform documentation. A traditional keyword search would fail for semantic queries: “How do I check subscriber signal quality?” should surface the OSS metrics section, but keyword matching wouldn’t connect “signal quality” to “avg_signal_dbm.”

I built a semantic search layer using Retrieval-Augmented Generation (RAG) to solve this. The system had to be lightweight enough to run in a Kubernetes pod alongside the main API, persistent across restarts, and capable of ingesting documentation with minimal latency overhead.

Architecture Overview

The RAG pipeline consists of three components:

Document Ingestion: Ingest and chunk platform documentation
Embedding & Storage: Encode chunks and store them in ChromaDB with SentenceTransformer embeddings
Semantic Search: Accept user queries and return relevant document passages via the FastAPI endpoint

All three components live in the same Python module (src/embeddings.py) and are consumed by the FastAPI router (src/selfserve.py).

ChromaDB Client Setup

I chose ChromaDB’s PersistentClient to ensure the vector index survives pod restarts and replicas without external dependencies.

def get_client():
    """Return persistent ChromaDB client."""
    return chromadb.PersistentClient(path=CHROMA_PATH)

CHROMA_PATH is configured in config.py and points to a local filesystem path — in production, backed by a persistent volume in Kubernetes. This is simpler than running a separate ChromaDB service and avoids network latency between the API and the vector database.

In the Kubernetes manifest (infra/k8s/04-chromadb.yaml), I declared a separate ChromaDB deployment initially, but after testing, the in-process persistent client proved faster and required less operational overhead. The trade-off is that each pod has its own index; in a multi-replica scenario, all pods must have access to the same persistent volume, which is handled by the host path mount in the pod spec.

Embedding Model Selection

I selected all-MiniLM-L6-v2 from Sentence-Transformers: 22 MB on disk and ~80 MB in memory fit a container with 512 Mi requested, it encodes a paragraph in <100ms on a single CPU core, and it handles technical documentation queries without fine-tuning.

The model is downloaded on first use from Hugging Face and cached locally in the container. No GPUs required.

def get_collection(client=None):
    """Get or create the platform docs collection with SentenceTransformer embeddings."""
    if client is None:
        client = get_client()
    ef = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name="all-MiniLM-L6-v2"
    )
    return client.get_or_create_collection(COLLECTION_NAME, embedding_function=ef)

ChromaDB’s SentenceTransformerEmbeddingFunction wraps the model and automatically handles tokenization, inference, and normalization. The embedding function is stateless and can be instantiated per query without performance penalty.

Document Ingestion and Chunking

The ingestion function reads the platform documentation (a single Markdown file) and splits it into chunks using paragraph boundaries:

def ingest_docs():
    """
    Ingest docs/<client>_platform_overview.md into ChromaDB.
    # REVIEW: redacted — confirm
    Splits content into ~500 character chunks and upserts to the collection.
    Returns the number of chunks ingested.
    """
    docs_path = os.path.join(os.path.dirname(__file__), "..", "docs", "<client>_platform_overview.md")
    # REVIEW: redacted — confirm

    try:
        with open(docs_path) as f:
            content = f.read()
    except FileNotFoundError:
        raise FileNotFoundError(f"Documentation file not found at {docs_path}")

    # Split into chunks of ~500 chars
    chunks = []
    current = ""
    for para in content.split("\n\n"):
        if len(current) + len(para) > 500:
            if current:
                chunks.append(current.strip())
            current = para
        else:
            current += "\n\n" + para if current else para
    if current:
        chunks.append(current.strip())

    col = get_collection()
    col.upsert(
        documents=chunks,
        ids=[f"doc_{i}" for i in range(len(chunks))]
    )
    return len(chunks)

Chunking strategy: I iterate through paragraphs (split by double newline) and accumulate them into chunks of approximately 500 characters. Once a chunk reaches the threshold, I emit it and start a new one. This preserves semantic boundaries — paragraphs are contextually coherent — while keeping individual chunks small enough for the model to embed efficiently.

Why 500 characters? Sentence-Transformers has a maximum token length of 512, and ~500 characters typically encodes to 150-200 tokens, leaving headroom. Larger chunks (1000+ chars) don’t improve retrieval quality for this use case and increase latency; smaller chunks (100 chars) fragment the semantics.

Upserting to ChromaDB: col.upsert() takes the list of chunks and auto-generated IDs (doc_0, doc_1, etc.). The upsert is idempotent, so re-running ingestion doesn’t duplicate data. ChromaDB automatically embeds each document using the collection’s embedding function.

Semantic Search

The search function accepts a user query and returns the top-n relevant document chunks:

def search(query: str, n_results: int = 3) -> list[str]:
    """
    Return top-n relevant document chunks for the query.
    Uses semantic similarity via SentenceTransformer embeddings.
    """
    col = get_collection()
    results = col.query(query_texts=[query], n_results=n_results)
    return results["documents"][0] if results["documents"] else []

The flow:

Encode the query using the same embedding function as the documents
Compute cosine similarity between the query embedding and all stored document embeddings
Return the top-3 documents (by default) most similar to the query

ChromaDB’s .query() method handles embedding on the fly, so the embedding function doesn’t need to be called explicitly. The default distance metric is cosine similarity, which is appropriate for normalized embeddings.

FastAPI Integration

The self-serve module mounts a RAG endpoint under the main API:

@router.post("/rag")
def query_rag(req: QueryRequest):
    if not RAG_AVAILABLE:
        raise HTTPException(503, "RAG not available — ChromaDB not initialized")
    try:
        docs = rag_search(req.question)
        return {"question": req.question, "results": docs}
    except Exception as e:
        raise HTTPException(500, f"RAG search failed: {str(e)}")

And in the main FastAPI app (src/05_api.py):

app.include_router(query_router, prefix="/api/v1/query", tags=["Self-Serve"])

This mounts the RAG endpoint at POST /api/v1/query/rag. A client sends:

{
  "question": "What are the key metrics for churn risk?"
}

And receives:

{
  "question": "What are the key metrics for churn risk?",
  "results": [
    "The RandomForest model tracks churn as a binary label...",
    "Feature importances consistently show max_days_overdue, avg_signal_dbm...",
    "Churn Rate: % of subscribers flagged as churned in a given month..."
  ]
}

Kubernetes Deployment

The Kubernetes manifest (infra/k8s/04-chromadb.yaml) declares two resources: a Deployment and a Service. In the current architecture, ChromaDB runs in-process within the FastAPI pod, not as a separate service. The manifest exists for reference but is not deployed in the live cluster.

For a production setup with multiple API pod replicas accessing a shared vector index:

Keep ChromaDB as a separate Deployment (as declared in the manifest)
Configure the API pods to connect to the ChromaDB service at chromadb:8000
Mount a persistent volume claim for the ChromaDB data

The FastAPI container includes the necessary dependencies:

chromadb
sentence-transformers

These are installed via requirements.txt and baked into the container image.

Performance in Practice

Paragraph-aware chunking was the right call: unlike fixed-size token splitting, paragraph boundaries respect semantic structure. Queries like “how do I identify at-risk subscribers?” correctly surface the example workflow section.

all-MiniLM-L6-v2 required no GPU and was fast enough for interactive queries. The 384-dim vectors fit in memory alongside other application state. The PersistentClient and embedding function abstraction required minimal setup and eliminated boilerplate. Running ChromaDB in-process meant no separate instance to manage and no network calls between API and vector store. Search latency for three results was consistently <200ms.

ChromaDB API Version Migration

When I built the System Status page, the health check for ChromaDB used /api/v1/heartbeat — the endpoint documented in most tutorials and the ChromaDB docs at the time of initial development. The check returned “error” even though ChromaDB was running and serving queries.

Debugging from inside the API pod:

kubectl exec <api-pod-name> -n <namespace> -- \
  python3 -c "from urllib.request import urlopen; \
  r = urlopen('http://chromadb:8000/api/v1/heartbeat', timeout=3); \
  print(r.status, r.read().decode())"

urllib.error.HTTPError: HTTP Error 410: Gone

410 Gone — not 404, not 500. ChromaDB’s latest image had migrated to the v2 API. The v1 heartbeat endpoint was explicitly deprecated with a 410 status code rather than silently removed. The fix:

("ChromaDB", "http://chromadb:8000/api/v2/heartbeat"),

Which returns:

{"nanosecond heartbeat": 1777086854309564736}

Pinning chromadb/chroma:latest in the Kubernetes manifest means the API contract can shift under you between pod restarts. The 410 status code was a courteous signal — most breaking changes in container images are silent. Either pin to a specific tag (chromadb/chroma:0.5.x) or build health checks that probe multiple API versions.

Limitations and Trade-Offs

Single file, single collection: The current ingestion reads a single Markdown file. To support multiple documentation sources (API specs, Kafka schema, etc.), I would need to ingest multiple files into separate collections or use a single collection with metadata filtering, and add a collection registry to route queries to the right index.
Paragraph chunking can be coarse for dense technical content: A section with multiple subsections might be combined into a single 500-char chunk if separated by single newlines instead of double newlines. Markdown formatting matters. For highly structured docs, a recursive hierarchical chunking strategy — split by heading, then by paragraph, then by sentence — would be more robust.
No re-ranking: The top-3 results are returned as-is. A re-ranking step (using a larger, slower model like BGE or ColBERT) could improve precision, but the latency trade-off didn’t justify it for this use case.
No streaming or partial updates: Ingestion is all-or-nothing. To add a new doc section without re-ingesting the entire file, append-only chunking with a changelog would be needed.
Embedding drift over time: If documentation changes semantically — field names change, sections are reorganised — old embeddings become stale. A monitoring job to flag similarity mismatches would be valuable.

Performance Characteristics

Measurements taken on a single-replica FastAPI pod with 1 CPU core and 1 Gi memory:

Ingestion (44 paragraphs into 32 chunks): ~5 seconds (one-time, on pod startup)
Query latency (3 results): 120-180 ms (including embedding + similarity search)
Memory overhead: ~150 MB for the Sentence-Transformer model in memory; ChromaDB collection with 32 chunks adds ~5 MB
Disk footprint: ~80 MB for the Chroma database file; the model cache adds ~100 MB (both compressed in the container image)

For a pod with 512 Mi requested memory and 1 Gi limit, this is well within bounds.

Integration with Self-Serve Analytics

The RAG endpoint complements the text-to-SQL /api/v1/query/sql endpoint. An analyst might:

Ask the RAG endpoint: “What metrics indicate network quality issues?”
Receive documentation about avg_signal_dbm and avg_latency_ms

Use the SQL endpoint to query: “Show me subscribers with signal below -100 dBm”
Combine insights to prioritize retention actions

This two-stage pattern — semantic search for context, then structured data access — avoids over-reliance on LLMs for deterministic queries.

Production Rule

The in-process ChromaDB architecture trades distributed resilience for operational simplicity. For a single-replica internal tool, that trade-off is correct.

The main lesson: paragraph-aware chunking and careful embedding model selection matter more than infrastructure complexity for small-to-medium documentation corpora. Get the chunk boundaries right before adding retrieval infrastructure.