Replacing kube-prometheus-stack with VictoriaMetrics on EKS
Replaced kube-prometheus-stack and Loki with VictoriaMetrics, VictoriaLogs, and Vector on EKS, cutting the observability memory footprint by 56% and adding dual-sink log archival to S3.
ON THIS PAGE
I was running kube-prometheus-stack, Loki, and Promtail on EKS. The setup worked, but Prometheus alone requested 512Mi of memory for a cluster running fewer than 20 applications, and Loki on filesystem storage had no path to durable log archival for compliance.
I replaced the stack with VictoriaMetrics, VictoriaLogs, and Vector. The result: a 56% reduction in observability memory requests and a dual-sink log pipeline — VictoriaLogs for hot queryable storage and S3 for compliance archival. Six gotchas came up during the rollout; each is documented inline below.
The existing setup:
- kube-prometheus-stack: Prometheus StatefulSet (512Mi memory request), Alertmanager, kube-state-metrics, node-exporter
- Loki (SimpleScalable mode, filesystem storage): read and write replicas with local filesystem backend
- Promtail DaemonSet: log shipping from nodes to Loki
Prometheus’s memory footprint was disproportionate for this cluster size — the StatefulSet requested 512Mi and limited at 1Gi, spending most of that on WAL and index overhead. Loki with filesystem storage provided no durability guarantees and no long-term archival path for compliance.
The replacement architecture:
Metrics: VictoriaMetrics K8s Stack (vmstack)
└── VMSingle — Prometheus-compatible TSDB, single binary
└── VMAgent — scraping (replaces Prometheus scrape config)
└── VMAlert — alerting rules evaluation
└── Alertmanager — alert routing
└── kube-state-metrics + node-exporter
Logs: VictoriaLogs — log database (Elasticsearch-compatible API)
└── retention 7d, 10Gi PVC, port 9428
Log ship: Vector (DaemonSet)
└── Sink 1: VictoriaLogs — hot storage, queryable via Grafana
└── Sink 2: S3 — cold archive (date-partitioned, gzip, 90d lifecycle)
Grafana: Standalone Helm release
└── Prometheus datasource → VMSingle
└── VictoriaLogs datasource → victoriametrics-logs-datasource plugin
Everything runs in the observability namespace.
Part 1: VictoriaMetrics Stack (vmstack)
The victoria-metrics-k8s-stack chart is a drop-in replacement for kube-prometheus-stack. It supports the same ServiceMonitor/PodMonitor CRDs, the same Grafana dashboard gnet IDs, and a Prometheus-compatible API — but replaces the Prometheus StatefulSet with VMSingle, a single binary that combines TSDB storage, scraping coordination, and query serving.
# values-vmstack.yaml (abbreviated)
vmsingle:
enabled: true
spec:
retentionPeriod: "7d"
storage:
storageClassName: gp2
resources:
requests:
storage: 10Gi
resources:
requests:
memory: 256Mi # vs 512Mi for Prometheus — 50% reduction
cpu: 100m
limits:
memory: 1Gi
cpu: 500m
vmagent:
enabled: true
spec:
resources:
requests:
memory: 64Mi
cpu: 50m
grafana:
enabled: false # managed as separate Helm release
Grafana is disabled in the vmstack chart and installed separately. This gives independent upgrade control over the visualization layer versus the metrics backend.
Gotcha 1: VMSingle PVC stuck Pending
After the first helm install, the VMSingle pod was pending. Describing the PVC:
kubectl describe pvc vmsingle-vmstack -n observability
Events:
Warning ProvisioningFailed no persistent volumes available for this claim
and no storage class is set
The cluster had no default StorageClass. The vmstack chart’s storageClassName was set in values.yaml, but the VictoriaMetrics operator generates the PVC spec from its own CR, and on first reconcile it did not propagate the storageClassName field — creating a PVC with no storage class.
Fix:
# Patch gp2 as the cluster default StorageClass
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
# Delete the stuck PVC so the operator recreates it (now picks up the default)
kubectl delete pvc vmsingle-db-vmsingle-vmstack-0 -n observability
The operator immediately reconciled and provisioned a new PVC correctly.
Part 2: VictoriaLogs
VictoriaLogs replaces Loki. It exposes an Elasticsearch-compatible bulk ingest API at /insert/elasticsearch/_bulk — any shipper with an Elasticsearch sink (Vector, Fluentd, Logstash) writes to it without a custom output plugin.
# values-victorialogs.yaml
server:
retentionPeriod: 7d
persistentVolume:
enabled: true
storageClassName: gp2
size: 10Gi
resources:
requests:
memory: 64Mi # Loki read+write combined was ~300Mi — 78% reduction
cpu: 50m
limits:
memory: 512Mi
cpu: 500m
extraArgs:
envflag.enable: "true"
envflag.prefix: "VM_"
loggerFormat: json
Loki in SimpleScalable mode runs separate read and write replicas; even at one replica each, the combined memory request exceeds 300Mi. VictoriaLogs at 64Mi request handles the same log volume for this cluster size with headroom to spare.
Part 3: Vector DaemonSet
Vector replaces Promtail. The reason for the switch: Promtail only ships to Loki, while Vector can simultaneously write to VictoriaLogs (hot storage, 7-day queryable) and S3 (cold archive, compliance retention).
# values-vector.yaml (abbreviated)
role: Agent # DaemonSet mode
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT:role/vector-s3-logs"
tolerations:
- operator: Exists
effect: NoSchedule
- operator: Exists
effect: NoExecute
customConfig:
sources:
kubernetes_logs:
type: kubernetes_logs
exclude_paths_glob_patterns:
- "/var/log/pods/observability_vector-*/**"
transforms:
enrich_logs:
type: remap
inputs: [kubernetes_logs]
source: |
.cluster = "my-cluster"
if exists(.kubernetes.pod_labels."app.kubernetes.io/name") {
.app = .kubernetes.pod_labels."app.kubernetes.io/name"
}
sinks:
victorialogs:
type: elasticsearch
inputs: [enrich_logs]
endpoints:
- "http://victorialogs-victoria-logs-single-server.observability.svc.cluster.local:9428/insert/elasticsearch/"
bulk:
action: index
index: "k8s-logs"
request:
headers:
VL-Msg-Field: message
VL-Time-Field: timestamp
VL-Stream-Fields: cluster,kubernetes.namespace_name,kubernetes.pod_name,kubernetes.container_name
healthcheck:
enabled: false
s3_archive:
type: aws_s3
inputs: [enrich_logs]
bucket: my-logs-bucket
region: us-east-2
key_prefix: "k8s/%F/"
compression: gzip
encoding:
codec: json
batch:
max_bytes: 10485760
timeout_secs: 300
The S3 sink uses IRSA — no static credentials in the pod. The ServiceAccount annotation maps to an IAM role with s3:PutObject on the target bucket. The %F in key_prefix expands to the current date (YYYY-MM-DD), producing date-partitioned prefixes like k8s/2026-03-22/.
Gotcha 2: Vector customConfig passes through Helm tpl
The customConfig field in the Vector Helm chart is rendered via the Go template engine (tpl). Any {{ }} syntax in the config is interpreted as a template expression. S3 key prefixes like k8s/%F/ are safe, but a key prefix containing {{ — for example, a Kubernetes label interpolation syntax — fails with a template parse error.
The fix: avoid {{ }} in customConfig values, or escape them as {{ {{ }} }}.
Gotcha 3: encoding.codec not accepted in elasticsearch sink
The first iteration of the Vector config included encoding.codec: json on the VictoriaLogs elasticsearch sink, mirroring the S3 sink. Vector rejected this at startup:
error: unknown field `codec`, expected one of `except_fields`, `only_fields`, `timestamp_format`
The elasticsearch sink type does not accept encoding.codec — it always serializes as JSON. The field is only valid on sinks like aws_s3, file, and console. Remove it from the elasticsearch sink entirely.
Gotcha 4: Vector self-log feedback loop
Without the exclude_paths_glob_patterns on the kubernetes_logs source, Vector ships its own logs back into VictoriaLogs, which generates more logs, which Vector ships — a low-volume but unnecessary feedback loop. Exclude Vector’s own pod log paths:
exclude_paths_glob_patterns:
- "/var/log/pods/observability_vector-*/**"
Part 4: Grafana
Grafana is installed as a separate Helm release to decouple its upgrade lifecycle from the metrics backend. Two non-obvious configuration requirements for Grafana 12 with VictoriaMetrics:
Gotcha 5: Grafana 12 removed [alerting].enabled
Grafana 12 removed the legacy [alerting] config section. Setting alerting.enabled: true in grafana.ini produces a startup warning and is silently ignored. Only unified_alerting is supported:
grafana.ini:
unified_alerting:
enabled: true
# Do NOT include:
# alerting:
# enabled: true
Gotcha 6: Grafana RWO PVC requires Recreate strategy
Grafana uses a ReadWriteOnce PVC for its SQLite database. RWO volumes can only be mounted by one pod at a time. The default Deployment rolling update strategy creates a new pod before terminating the old one — the new pod cannot mount the PVC until the old pod releases it, causing the rollout to stall indefinitely.
Fix:
# In Grafana Helm values
deploymentStrategy:
type: Recreate
This terminates the existing pod before starting the replacement, ensuring the PVC is free when the new pod starts.
VictoriaLogs datasource plugin
Grafana does not ship a built-in datasource for VictoriaLogs. The victoriametrics-logs-datasource plugin must be explicitly installed:
plugins:
- victoriametrics-logs-datasource
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://vmsingle-vmstack.observability.svc.cluster.local:8428
isDefault: true
- name: VictoriaLogs
type: victoriametrics-logs-datasource
url: http://victorialogs-victoria-logs-single-server.observability.svc.cluster.local:9428
jsonData:
maxLines: 1000
The Prometheus datasource points at VMSingle using the standard Prometheus API — no VictoriaMetrics-specific datasource plugin needed for metrics.
Results
All four Helm releases running in the observability namespace:
| Release | Chart | Purpose |
|---|---|---|
vmstack | vm/victoria-metrics-k8s-stack | Metrics: vmsingle, vmagent, vmalert, alertmanager, kube-state-metrics, node-exporter |
victorialogs | vm/victoria-logs-single | Log database |
vector | vector/vector | Log shipping (DaemonSet) |
grafana | grafana/grafana | Dashboards |
Memory requests before and after:
| Component | Before | After | Change |
|---|---|---|---|
| Prometheus/VMSingle | 512Mi | 256Mi | -50% |
| Loki (read + write) | ~300Mi | — | replaced |
| VictoriaLogs | — | 64Mi | new |
| Promtail / Vector | ~64Mi | 64Mi | same |
| Total (metrics + logs) | ~876Mi | ~384Mi | -56% |
Grafana dashboards load with both Kubernetes cluster metrics (from VMSingle via kube-state-metrics and node-exporter) and structured logs (from VictoriaLogs via the plugin). The S3 sink runs continuously, batching up to 10 MiB per object with a 5-minute flush interval.
Production Rules
VictoriaMetrics is a drop-in replacement for Prometheus. The victoria-metrics-k8s-stack chart supports ServiceMonitor/PodMonitor CRDs and a Prometheus-compatible API. Existing Grafana dashboards from grafana.com work without modification — point the datasource at VMSingle’s port 8428.
VictoriaLogs uses the Elasticsearch bulk API for ingest. Any log shipper with an Elasticsearch output writes to it without modification. The required headers — VL-Msg-Field, VL-Time-Field, and VL-Stream-Fields — tell VictoriaLogs how to parse the log structure.
Vector’s customConfig is rendered through Go’s tpl function. Avoid {{ }} in string values, or escape them. Template syntax in values is evaluated, not passed through literally.
encoding.codec is not valid on Vector’s elasticsearch sink. The sink always produces JSON output. Adding the field causes Vector to fail at startup.
Grafana on an RWO PVC must use deploymentStrategy.type: Recreate. Rolling updates with a ReadWriteOnce PVC will stall. Recreate terminates the old pod first, freeing the PVC before the new pod starts.
Discussion