Deploy ClickHouse at Scale: Kubernetes Helm Chart, Monitoring, and Backup Scripts
DatabasesDevOpsMonitoring

Deploy ClickHouse at Scale: Kubernetes Helm Chart, Monitoring, and Backup Scripts

UUnknown
2026-03-08
9 min read
Advertisement

Opinionated, production-ready guide to run ClickHouse on Kubernetes: Helm config, Prometheus/Grafana, automated S3 backups, and DR playbooks.

Deploy ClickHouse at Scale: Kubernetes Helm Chart, Monitoring, and Backup Scripts

Hook: If you manage large analytics workloads, you know the cost of reinventing operational patterns: flaky stateful deployments, missing observability, and brittle backups. This hands-on guide gives an opinionated, production-ready path to run ClickHouse on Kubernetes in 2026 — Helm chart, Prometheus/Grafana observability, automated backup/restore scripts, and practical disaster recovery (DR) playbooks.

The short story (most important first)

ClickHouse adoption surged through 2024–2025 and into 2026 as organizations moved OLAP workloads out of monoliths and into cloud-native pipelines—backed by sizable investment and fast-moving operator/Helm ecosystems. To run ClickHouse reliably at scale you must solve four problems: deployment, storage & replication, observability, and backup/DR. This article gives an opinionated Helm values template, Prometheus/Grafana integration, automated S3 backups, and DR runbooks you can use today.

Why this matters in 2026

ClickHouse's growth and increased funding (late 2025) accelerated richer cloud integrations and operator maturity. That means more organizations run ClickHouse on Kubernetes for real-time analytics, event reporting, and data warehousing use cases. Kubernetes stateful workloads are now first-class: storage snapshots, CSI drivers, and operator patterns are robust, but they require disciplined configuration and automation to avoid data loss at scale.

Opinionated architecture overview

Design choices I recommend for production:

  • Use a dedicated ClickHouse namespace (Kubernetes RBAC and resource quotas).
  • Deploy with a ClickHouse Operator + Helm—operator manages config, replicas and shard/replica sets.
  • PersistentVolumes via CSI with encrypted storage (cloud-provider or on-prem CSI). Enable snapshots.
  • Replicated tables + replica placement across AZs/regions for HA.
  • Backups to S3-style object storage with lifecycle policies and immutability where possible.
  • Observability via Prometheus + Grafana and alerting for replication lag, replica count, disk pressure, and query latency.

1) Opinionated Helm chart values (example)

Many teams use Altinity's ClickHouse Operator or the community operator packaged as a Helm chart. Below is an opinionated values.yaml snippet you can start from. It assumes the operator chart is named clickhouse-operator and you use a StorageClass named fast-ssd-encrypted.

# values.yaml (opinionated)
replicaCount: 3
image:
  repository: clickhouse/clickhouse-server
  tag: 23.12
resources:
  limits:
    cpu: 8
    memory: 32Gi
  requests:
    cpu: 2
    memory: 8Gi
persistence:
  enabled: true
  storageClass: fast-ssd-encrypted
  size: 1Ti
  accessMode: ReadWriteOnce
operator:
  enabled: true
  createCustomResources: true
service:
  type: ClusterIP
  port: 9000
metrics:
  enabled: true
  port: 8123
  exporter:
    enabled: true
    image: percona/clickhouse_exporter:latest
securityContext:
  runAsUser: 101
  fsGroup: 101
podDisruptionBudget:
  enabled: true
  minAvailable: 2
nodeSelector:
  workload: analytics
tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "analytics"
    effect: "NoSchedule"

Notes:

  • Set resource requests and limits to match node class.
  • Use node selectors and tolerations to isolate I/O-sensitive pods.
  • Enable the ClickHouse metrics exporter so Prometheus can scrape it.

2) Installing with Helm and Operator

Deploy the operator and a ClickHouseCluster (or CHI) custom resource. Example commands:

# add helm repo for clickhouse operator (example)
helm repo add altinity https://altinity.github.io/helm-charts/
helm repo update

# install operator
helm upgrade --install clickhouse-operator altinity/clickhouse-operator -n clickhouse --create-namespace -f values.yaml

# apply a ClickHouse custom resource (CHI) describing shards/replicas
kubectl apply -f clickhouse-cluster.yaml -n clickhouse

Keep your CHI in source-control. The operator will reconcile config and rolling updates. Prefer controlled upgrades during maintenance window and test on staging first.

3) Observability: Prometheus and Grafana

Integrate ClickHouse metrics into your Prometheus stack. Use ServiceMonitor if you're using kube-prometheus-stack. Key metrics: ch_replica_delay, ch_tables_sum_bytes, ch_query_duration_ms, ch_mutations_in_queue.

ServiceMonitor example

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: clickhouse-servicemonitor
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: clickhouse
  namespaceSelector:
    matchNames:
      - clickhouse
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

Recommended Grafana dashboards:

  • ClickHouse Overview: CPU, Memory, Disk, Query Latency
  • Table Metrics: row counts, bytes, merges
  • Replication: replica lag, parts count, quorum errors

Tip: Use alerting rules for these conditions:

  • Replica lag > 1 minute or growing
  • Disk usage > 80% on any data volume
  • Mutations in queue > threshold (indicates broken merges)
  • High query latency sustained for 5 minutes

Sample Prometheus alert rule (YAML)

groups:
- name: clickhouse.rules
  rules:
  - alert: ClickHouseReplicaLag
    expr: ch_replica_delay_seconds > 60
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "ClickHouse replica lag > 60s"
      description: "Replica lag is {{ $value }} seconds on {{ $labels.instance }}"

4) Backups: Automated S3 backups with clickhouse-backup

By 2026 the community standardized around tools like clickhouse-backup for logical snapshots backed by S3-compatible storage. For production you want incremental, encrypted backups with retention and policy-based lifecycle.

Install clickhouse-backup on an operator sidecar or CronJob

Recommended pattern: run backups from a CronJob that uses the ClickHouse HTTP API or clickhouse-backup tool. The script below uses clickhouse-backup CLI and AWS S3.

#!/bin/bash
# /opt/scripts/ch-backup.sh
set -euo pipefail

# env vars: CH_HOST, CH_PORT, S3_BUCKET, AWS_REGION
DATE=$(date -u +%Y-%m-%dT%H-%M-%SZ)
BACKUP_NAME=ch-backup-${DATE}

# create backup
clickhouse-backup create ${BACKUP_NAME} --concurrency 4

# push backup to S3
export AWS_PAGER=""
clickhouse-backup upload ${BACKUP_NAME} --to s3://$S3_BUCKET --s3-region $AWS_REGION

# keep only last 14 backups (example)
clickhouse-backup list | awk '/backup_/ {print $1}' | tail -n +15 | xargs -r -n1 clickhouse-backup delete

Wrap the script in a Kubernetes CronJob. Use a service account with fine-grained S3 permissions. Store secrets in sealed-secrets or a secrets manager and mount as env vars.

Example Kubernetes CronJob snippet

apiVersion: batch/v1
kind: CronJob
metadata:
  name: clickhouse-backup
  namespace: clickhouse
spec:
  schedule: "0 2 * * *" # daily at 02:00
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: clickhouse-backup-sa
          containers:
          - name: clickhouse-backup
            image: ghcr.io/altinity/clickhouse-backup:latest
            command: ["/bin/sh","-c","/opt/scripts/ch-backup.sh"]
            env:
            - name: S3_BUCKET
              valueFrom:
                secretKeyRef:
                  name: ch-s3-secret
                  key: bucket
          restartPolicy: OnFailure

5) Restore and DR playbook

Backups are only useful if you can restore them reliably. Define a tested DR playbook that includes these phases: Failover identification, Restore validation, and Cut-over.

Fast restore options (production)

  1. Point-in-time restore using clickhouse-backup: download backup from S3 and perform local restore on new ClickHouse cluster.
  2. Replica rebuilds: If replicas exist across regions, use replication and replace failed nodes to let replicas catch up.
  3. PV snapshots: For failing clusters where storage is intact, restore CSI snapshots and reattach PVs to new pods. This is the quickest if supported by CSI.

Restore example with clickhouse-backup

# 1. Download
clickhouse-backup download my-backup --from s3://my-bucket

# 2. Restore on empty ClickHouse server
clickhouse-backup restore my-backup

# 3. Start service and validate
kubectl rollout restart statefulset/clickhouse -n clickhouse
# Run validation queries

Validation queries should include checksum comparisons of table counts, simple aggregates, and application-level smoke tests.

6) Disaster Recovery: Cross-region and runbook

DR at scale means planning for region loss, corrupt backups, and operator bugs. Key practices:

  • Multi-region replicas: Maintain at least one replica in a separate region. Configure replication and read-only traffic preference.
  • Immutable backups: Use S3 Object Lock or retention policies to protect against accidental deletion or ransomware.
  • Periodic DR drills: Test restores quarterly. Measure RTO and RPO.
  • Documented runbooks: Keep automated scripts plus manual fallback steps in a runbook repository with role assignments.
"A tested restore is worth more than a thousand untested backups."

Sample DR runbook checklist

  1. Detect outage (Prometheus rule fires)
  2. Notify on-call and create incident channel
  3. Assess if replicas are healthy in other zones/regions
  4. If region loss: spin up ClickHouse cluster in recovery region using latest verified backup
  5. Run data validation suite (counts, checksums, smoke queries)
  6. Failover read traffic via DNS or load balancer
  7. Postmortem and restore primary cluster if possible

7) Security, compliance, and licensing notes (practical)

In 2026 you should verify licensing for enterprise features (some vendors offer proprietary tooling). For open-source ClickHouse:

  • Enable TLS for client and inter-server communication.
  • Use network policies to limit access to ClickHouse pods.
  • Encrypt data-at-rest using the storage provider and S3 encryption for backups.
  • Audit: enable query auditing where required.

8) Troubleshooting checklist

Common operational issues and quick checks:

  • Replica lag: Check network, replica config, and disk I/O. Look for stalled merges.
  • High disk usage: Enable TTLs, partitioning, and merge aggressive policies. Consider adding nodes and rebalancing.
  • Slow queries: Profile with system.query_log and consider materialized views or pre-aggregated tables.
  • Backups failing: Check S3 permissions, IAM roles, and clickhouse-backup logs.

9) Advanced strategies for scale

When your cluster grows beyond single-cluster limits, consider:

  • Sharding by time or tenant to keep per-node storage bounded.
  • Federated queries only for light cross-cluster joins; push heavy joins into ETL pipelines.
  • Use materialized views to accelerate common aggregations and reduce query load.
  • Automated rebalancing scripts to redistribute data when adding nodes (scripting with zookeeper or operator helpers).

10) Checklist before going to production

  • Helm/operator deployed and CRs in GitOps.
  • StorageClass validated for throughput and IOPS.
  • Monitoring in place with alerts and runbooks for each alert.
  • Backup CronJobs and S3 lifecycle policies configured, tested restore documented.
  • Security hardening: TLS, RBAC, network policies.
  • DR plan and performed at least one successful restore in staging.

Appendix: Quick resources & commands

  • Operator repos: Altinity clickhouse-operator (GitHub)
  • clickhouse-backup: community tool for S3-based backups
  • Grafana dashboards: search "ClickHouse" in Grafana Dashboards directory (IDs change; prefer curated dashboards from your operator vendor)
  • Prometheus exporters: percona clickhouse_exporter or altinity exporters

Final takeaways

Running ClickHouse on Kubernetes in 2026 is increasingly mainstream, but it’s not turnkey. Use an operator + opinionated Helm values, integrate metrics and alerting early, automate S3-backed backups with clickhouse-backup, and practice DR drills. The real ROI comes when teams standardize templates and runbooks so incidents resolve predictably.

Actionable next steps:

  1. Clone the operator Helm chart and create a values.yaml aligned to your instance types.
  2. Deploy monitoring and set the recommended alerts for replica lag and disk usage.
  3. Implement daily S3 backups using the provided CronJob pattern and test restores in staging.

Call to action

If you want a ready-to-run starter kit, download our ClickHouse Helm + Monitoring + Backup repo for Kubernetes — includes Helm values, ServiceMonitor, Grafana dashboard templates, and a tested backup/restore pipeline. Run the first DR drill within 7 days.

Advertisement

Related Topics

#Databases#DevOps#Monitoring
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-08T00:03:47.594Z