Blog Field Notes Replacing kube-prometheus-stack with VictoriaMetrics on EKS
Platform #victoriametrics#victorialogs#vector#grafana#prometheus#loki#eks#kubernetes#observability#cost-optimization

Replacing kube-prometheus-stack with VictoriaMetrics on EKS

Replaced kube-prometheus-stack and Loki with VictoriaMetrics, VictoriaLogs, and Vector on EKS, cutting the observability memory footprint by 56% and adding dual-sink log archival to S3.

· Gideon Warui
ON THIS PAGE

I was running kube-prometheus-stack, Loki, and Promtail on EKS. The setup worked, but Prometheus alone requested 512Mi of memory for a cluster running fewer than 20 applications, and Loki on filesystem storage had no path to durable log archival for compliance.

I replaced the stack with VictoriaMetrics, VictoriaLogs, and Vector. The result: a 56% reduction in observability memory requests and a dual-sink log pipeline — VictoriaLogs for hot queryable storage and S3 for compliance archival. Six gotchas came up during the rollout; each is documented inline below.

The existing setup:

  • kube-prometheus-stack: Prometheus StatefulSet (512Mi memory request), Alertmanager, kube-state-metrics, node-exporter
  • Loki (SimpleScalable mode, filesystem storage): read and write replicas with local filesystem backend
  • Promtail DaemonSet: log shipping from nodes to Loki

Prometheus’s memory footprint was disproportionate for this cluster size — the StatefulSet requested 512Mi and limited at 1Gi, spending most of that on WAL and index overhead. Loki with filesystem storage provided no durability guarantees and no long-term archival path for compliance.

The replacement architecture:

Metrics:  VictoriaMetrics K8s Stack (vmstack)
           └── VMSingle       — Prometheus-compatible TSDB, single binary
           └── VMAgent        — scraping (replaces Prometheus scrape config)
           └── VMAlert        — alerting rules evaluation
           └── Alertmanager   — alert routing
           └── kube-state-metrics + node-exporter

Logs:     VictoriaLogs        — log database (Elasticsearch-compatible API)
           └── retention 7d, 10Gi PVC, port 9428

Log ship: Vector (DaemonSet)
           └── Sink 1: VictoriaLogs — hot storage, queryable via Grafana
           └── Sink 2: S3 — cold archive (date-partitioned, gzip, 90d lifecycle)

Grafana:  Standalone Helm release
           └── Prometheus datasource → VMSingle
           └── VictoriaLogs datasource → victoriametrics-logs-datasource plugin

Everything runs in the observability namespace.


Part 1: VictoriaMetrics Stack (vmstack)

The victoria-metrics-k8s-stack chart is a drop-in replacement for kube-prometheus-stack. It supports the same ServiceMonitor/PodMonitor CRDs, the same Grafana dashboard gnet IDs, and a Prometheus-compatible API — but replaces the Prometheus StatefulSet with VMSingle, a single binary that combines TSDB storage, scraping coordination, and query serving.

# values-vmstack.yaml (abbreviated)
vmsingle:
  enabled: true
  spec:
    retentionPeriod: "7d"
    storage:
      storageClassName: gp2
      resources:
        requests:
          storage: 10Gi
    resources:
      requests:
        memory: 256Mi   # vs 512Mi for Prometheus — 50% reduction
        cpu: 100m
      limits:
        memory: 1Gi
        cpu: 500m

vmagent:
  enabled: true
  spec:
    resources:
      requests:
        memory: 64Mi
        cpu: 50m

grafana:
  enabled: false  # managed as separate Helm release

Grafana is disabled in the vmstack chart and installed separately. This gives independent upgrade control over the visualization layer versus the metrics backend.

Gotcha 1: VMSingle PVC stuck Pending

After the first helm install, the VMSingle pod was pending. Describing the PVC:

kubectl describe pvc vmsingle-vmstack -n observability
Events:
  Warning  ProvisioningFailed  no persistent volumes available for this claim
             and no storage class is set

The cluster had no default StorageClass. The vmstack chart’s storageClassName was set in values.yaml, but the VictoriaMetrics operator generates the PVC spec from its own CR, and on first reconcile it did not propagate the storageClassName field — creating a PVC with no storage class.

Fix:

# Patch gp2 as the cluster default StorageClass
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# Delete the stuck PVC so the operator recreates it (now picks up the default)
kubectl delete pvc vmsingle-db-vmsingle-vmstack-0 -n observability

The operator immediately reconciled and provisioned a new PVC correctly.


Part 2: VictoriaLogs

VictoriaLogs replaces Loki. It exposes an Elasticsearch-compatible bulk ingest API at /insert/elasticsearch/_bulk — any shipper with an Elasticsearch sink (Vector, Fluentd, Logstash) writes to it without a custom output plugin.

# values-victorialogs.yaml
server:
  retentionPeriod: 7d
  persistentVolume:
    enabled: true
    storageClassName: gp2
    size: 10Gi
  resources:
    requests:
      memory: 64Mi   # Loki read+write combined was ~300Mi — 78% reduction
      cpu: 50m
    limits:
      memory: 512Mi
      cpu: 500m
  extraArgs:
    envflag.enable: "true"
    envflag.prefix: "VM_"
    loggerFormat: json

Loki in SimpleScalable mode runs separate read and write replicas; even at one replica each, the combined memory request exceeds 300Mi. VictoriaLogs at 64Mi request handles the same log volume for this cluster size with headroom to spare.


Part 3: Vector DaemonSet

Vector replaces Promtail. The reason for the switch: Promtail only ships to Loki, while Vector can simultaneously write to VictoriaLogs (hot storage, 7-day queryable) and S3 (cold archive, compliance retention).

# values-vector.yaml (abbreviated)
role: Agent  # DaemonSet mode

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT:role/vector-s3-logs"

tolerations:
  - operator: Exists
    effect: NoSchedule
  - operator: Exists
    effect: NoExecute

customConfig:
  sources:
    kubernetes_logs:
      type: kubernetes_logs
      exclude_paths_glob_patterns:
        - "/var/log/pods/observability_vector-*/**"

  transforms:
    enrich_logs:
      type: remap
      inputs: [kubernetes_logs]
      source: |
        .cluster = "my-cluster"
        if exists(.kubernetes.pod_labels."app.kubernetes.io/name") {
          .app = .kubernetes.pod_labels."app.kubernetes.io/name"
        }

  sinks:
    victorialogs:
      type: elasticsearch
      inputs: [enrich_logs]
      endpoints:
        - "http://victorialogs-victoria-logs-single-server.observability.svc.cluster.local:9428/insert/elasticsearch/"
      bulk:
        action: index
        index: "k8s-logs"
      request:
        headers:
          VL-Msg-Field: message
          VL-Time-Field: timestamp
          VL-Stream-Fields: cluster,kubernetes.namespace_name,kubernetes.pod_name,kubernetes.container_name
      healthcheck:
        enabled: false

    s3_archive:
      type: aws_s3
      inputs: [enrich_logs]
      bucket: my-logs-bucket
      region: us-east-2
      key_prefix: "k8s/%F/"
      compression: gzip
      encoding:
        codec: json
      batch:
        max_bytes: 10485760
        timeout_secs: 300

The S3 sink uses IRSA — no static credentials in the pod. The ServiceAccount annotation maps to an IAM role with s3:PutObject on the target bucket. The %F in key_prefix expands to the current date (YYYY-MM-DD), producing date-partitioned prefixes like k8s/2026-03-22/.

Gotcha 2: Vector customConfig passes through Helm tpl

The customConfig field in the Vector Helm chart is rendered via the Go template engine (tpl). Any {{ }} syntax in the config is interpreted as a template expression. S3 key prefixes like k8s/%F/ are safe, but a key prefix containing {{ — for example, a Kubernetes label interpolation syntax — fails with a template parse error.

The fix: avoid {{ }} in customConfig values, or escape them as {{ {{ }} }}.

Gotcha 3: encoding.codec not accepted in elasticsearch sink

The first iteration of the Vector config included encoding.codec: json on the VictoriaLogs elasticsearch sink, mirroring the S3 sink. Vector rejected this at startup:

error: unknown field `codec`, expected one of `except_fields`, `only_fields`, `timestamp_format`

The elasticsearch sink type does not accept encoding.codec — it always serializes as JSON. The field is only valid on sinks like aws_s3, file, and console. Remove it from the elasticsearch sink entirely.

Gotcha 4: Vector self-log feedback loop

Without the exclude_paths_glob_patterns on the kubernetes_logs source, Vector ships its own logs back into VictoriaLogs, which generates more logs, which Vector ships — a low-volume but unnecessary feedback loop. Exclude Vector’s own pod log paths:

exclude_paths_glob_patterns:
  - "/var/log/pods/observability_vector-*/**"

Part 4: Grafana

Grafana is installed as a separate Helm release to decouple its upgrade lifecycle from the metrics backend. Two non-obvious configuration requirements for Grafana 12 with VictoriaMetrics:

Gotcha 5: Grafana 12 removed [alerting].enabled

Grafana 12 removed the legacy [alerting] config section. Setting alerting.enabled: true in grafana.ini produces a startup warning and is silently ignored. Only unified_alerting is supported:

grafana.ini:
  unified_alerting:
    enabled: true
  # Do NOT include:
  # alerting:
  #   enabled: true

Gotcha 6: Grafana RWO PVC requires Recreate strategy

Grafana uses a ReadWriteOnce PVC for its SQLite database. RWO volumes can only be mounted by one pod at a time. The default Deployment rolling update strategy creates a new pod before terminating the old one — the new pod cannot mount the PVC until the old pod releases it, causing the rollout to stall indefinitely.

Fix:

# In Grafana Helm values
deploymentStrategy:
  type: Recreate

This terminates the existing pod before starting the replacement, ensuring the PVC is free when the new pod starts.

VictoriaLogs datasource plugin

Grafana does not ship a built-in datasource for VictoriaLogs. The victoriametrics-logs-datasource plugin must be explicitly installed:

plugins:
  - victoriametrics-logs-datasource

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        url: http://vmsingle-vmstack.observability.svc.cluster.local:8428
        isDefault: true

      - name: VictoriaLogs
        type: victoriametrics-logs-datasource
        url: http://victorialogs-victoria-logs-single-server.observability.svc.cluster.local:9428
        jsonData:
          maxLines: 1000

The Prometheus datasource points at VMSingle using the standard Prometheus API — no VictoriaMetrics-specific datasource plugin needed for metrics.


Results

All four Helm releases running in the observability namespace:

ReleaseChartPurpose
vmstackvm/victoria-metrics-k8s-stackMetrics: vmsingle, vmagent, vmalert, alertmanager, kube-state-metrics, node-exporter
victorialogsvm/victoria-logs-singleLog database
vectorvector/vectorLog shipping (DaemonSet)
grafanagrafana/grafanaDashboards

Memory requests before and after:

ComponentBeforeAfterChange
Prometheus/VMSingle512Mi256Mi-50%
Loki (read + write)~300Mireplaced
VictoriaLogs64Minew
Promtail / Vector~64Mi64Misame
Total (metrics + logs)~876Mi~384Mi-56%

Grafana dashboards load with both Kubernetes cluster metrics (from VMSingle via kube-state-metrics and node-exporter) and structured logs (from VictoriaLogs via the plugin). The S3 sink runs continuously, batching up to 10 MiB per object with a 5-minute flush interval.


Production Rules

VictoriaMetrics is a drop-in replacement for Prometheus. The victoria-metrics-k8s-stack chart supports ServiceMonitor/PodMonitor CRDs and a Prometheus-compatible API. Existing Grafana dashboards from grafana.com work without modification — point the datasource at VMSingle’s port 8428.

VictoriaLogs uses the Elasticsearch bulk API for ingest. Any log shipper with an Elasticsearch output writes to it without modification. The required headers — VL-Msg-Field, VL-Time-Field, and VL-Stream-Fields — tell VictoriaLogs how to parse the log structure.

Vector’s customConfig is rendered through Go’s tpl function. Avoid {{ }} in string values, or escape them. Template syntax in values is evaluated, not passed through literally.

encoding.codec is not valid on Vector’s elasticsearch sink. The sink always produces JSON output. Adding the field causes Vector to fail at startup.

Grafana on an RWO PVC must use deploymentStrategy.type: Recreate. Rolling updates with a ReadWriteOnce PVC will stall. Recreate terminates the old pod first, freeing the PVC before the new pod starts.

#victoriametrics#victorialogs#vector#grafana#prometheus#loki#eks#kubernetes#observability#cost-optimization