Migrating Analytics from Snowflake to ClickHouse: Checklist, Scripts, and Benchmarks
DatabasesMigrationDevOps

Migrating Analytics from Snowflake to ClickHouse: Checklist, Scripts, and Benchmarks

UUnknown
2026-03-07
10 min read
Advertisement

A practical 2026 playbook to migrate analytics from Snowflake to ClickHouse with schema mapping, ETL rewrites, benchmark scripts, and pitfalls.

Stop reinventing analytics: a practical playbook to migrate from Snowflake to ClickHouse in 2026

Hook: If you’re spending cycles and credits wrestling Snowflake costs or need sub-second analytical queries for high-concurrency dashboards, this playbook gives you a step-by-step migration path that dev teams and platform engineers can run. It includes schema mapping, ETL rewrite examples, benchmark scripts, cost comparison heuristics, and a checklist of common pitfalls to avoid.

The big picture (2026 context)

The analytics landscape accelerated in late 2024–2025 as teams chased lower latency and lower cost-per-query. ClickHouse’s 2025 funding and wider ecosystem growth pushed it into mainstream consideration for high-volume OLAP workloads. In 2026, the key migration drivers are:

  • Cloud cost pressure and predictability — Snowflake’s credit model can be opaque; teams want fixed-node economics.
  • High-concurrency, real-time dashboards — ClickHouse offers fast vectorized execution and high QPS with MergeTree-based engines.
  • Open-source & hybrid deployments — ClickHouse Cloud and self-hosted options give more deployment choices.

What this playbook delivers

  • Concrete schema mapping rules from Snowflake to ClickHouse
  • ETL rewrite examples: Snowflake -> S3 -> ClickHouse and streaming alternatives
  • Benchmark scripts (bash + Python) to measure performance and cost
  • Checklist and common pitfalls for production rollouts

1) Schema differences & mapping rules (practical)

Snowflake and ClickHouse differ in type system, nullability, indexing, and DDL semantics. Below are pragmatic mappings to port schemas reliably.

Key mapping rules

  • Numeric types: Snowflake NUMBER/DECIMAL -> ClickHouse Decimal(precision,scale) for exact money; Float/Double -> Float64.
  • Integer types: Snowflake INTEGER -> Int32/Int64 depending on range. Prefer Int64 for ID fields unless heavy compression is needed.
  • String: Snowflake VARCHAR/STRING/CHAR -> ClickHouse String. For low-cardinality dimensions use LowCardinality(String) to reduce memory and improve group-by performance.
  • Time & Date: Snowflake TIMESTAMP_NTZ/TIMESTAMP_TZ -> ClickHouse DateTime64(3) with timezone handling at the application level. ClickHouse stores timezone separately; store UTC and convert in queries.
  • Semi-structured: Snowflake VARIANT/OBJECT/ARRAY -> ClickHouse JSON type options: use String for raw JSON or JSONExtract functions; ClickHouse also supports Nested columns and Map-related functions but lacks a direct VARIANT equivalent.
  • Nullability: ClickHouse historically favored non-nullable columns, now supports Nullable(T). Use Nullable sparingly — it adds overhead. Consider sentinel values (e.g., empty string, 0) when appropriate.
  • Primary keys & indexes: Snowflake has no primary-key enforcement. ClickHouse uses primary key/ordering key in MergeTree as a sort key, not an enforcement. Choose order_by columns to optimize ranges and group-by performance.

Example: Snowflake table -> ClickHouse DDL

-- Snowflake
  CREATE TABLE events (
    event_id NUMBER(38,0),
    user_id NUMBER(38,0),
    event_type VARCHAR,
    payload VARIANT,
    ts TIMESTAMP_NTZ
  );

  -- ClickHouse (recommended mapping)
  CREATE TABLE analytics.events (
    event_id UInt64,
    user_id UInt64,
    event_type LowCardinality(String),
    payload String, -- store JSON as String or use JSON functions
    ts DateTime64(3)
  ) ENGINE = MergeTree()
  PARTITION BY toYYYYMM(ts)
  ORDER BY (user_id, ts);
  

2) ETL rewrite examples

We present two common migration patterns: batch dump (Snowflake -> S3 -> ClickHouse) and near-real-time streaming (Snowflake -> Kafka -> ClickHouse). Both are production-ready and show exact commands.

Pattern A — Batch: Snowflake export to S3 then ClickHouse ingest

Steps: export from Snowflake using COPY INTO to S3 as compressed CSV/Parquet, then use ClickHouse client or S3 table function to insert.

# Snowflake: export parquet to S3
  COPY INTO 's3://my-bucket/snowflake-exports/events_'
  FROM my_db.public.events
  FILE_FORMAT = (TYPE = PARQUET COMPRESSION = 'SNAPPY')
  HEADER = TRUE
  SINGLE = FALSE;

  # ClickHouse client ingest using s3 table function (server-side download)
  CREATE TABLE analytics.staging_events AS analytics.events ENGINE = Memory;

  INSERT INTO analytics.events
  SELECT
    CAST(JSONExtractString(c, 'event_id') AS UInt64) AS event_id, -- adjust if needed
    ...
  FROM s3('https://s3.amazonaws.com/my-bucket/snowflake-exports/events_*.parquet', 'Parquet')
  

Note: ClickHouse's s3 table function lets the server pull files directly; avoid local download for large datasets.

Pattern B — Near-real-time: Snowflake -> Kafka -> ClickHouse

Use Snowpipe or a CDC connector (Debezium/Fivetran/Airbyte) to stream changes into Kafka, then use ClickHouse's Kafka engine or materialized views to consume.

-- ClickHouse: create Kafka engine table and materialized view
  CREATE TABLE kafka_events (
    raw String
  ) ENGINE = Kafka SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'ch_events',
    kafka_format = 'JSONEachRow';

  CREATE MATERIALIZED VIEW mv_events TO analytics.events AS
  SELECT
    toUInt64(JSONExtractUInt(raw, 'event_id')) AS event_id,
    toUInt64(JSONExtractUInt(raw, 'user_id')) AS user_id,
    JSONExtractString(raw, 'event_type') AS event_type,
    JSONExtractString(raw, 'payload') AS payload,
    parseDateTimeBestEffort(JSONExtractString(raw, 'ts')) AS ts
  FROM kafka_events;
  

3) Benchmark and cost scripts

Benchmarking must measure both latency and economics. Below are reusable scripts to compare query latency and to estimate Snowflake credits vs ClickHouse node-hours.

What to measure

  • Query latency P50/P95/P99 under target concurrency
  • Throughput — rows/sec ingested and queries/sec served
  • Cost — Snowflake credits consumed vs ClickHouse cloud/node cost
  • Storage cost — compressed storage on Snowflake vs ClickHouse (S3 or local)

Simple bash benchmark runner (ClickHouse)

# run_bench_ch.sh - measure query timings for ClickHouse
  #!/bin/bash
  CH_HOST=localhost
  CH_USER=default
  QUERY_FILE=queries.sql
  LOG=bench_ch.log

  echo "Starting ClickHouse bench: $(date)" > $LOG
  for q in $(cat $QUERY_FILE); do
    start=$(date +%s%3N)
    echo "$q" | clickhouse-client --host $CH_HOST --user $CH_USER --query="$(cat)" >/dev/null
    end=$(date +%s%3N)
    echo "$q|$((end-start))" >> $LOG
  done
  echo "Done: $(date)" >> $LOG
  

Python runner to compare Snowflake and ClickHouse and estimate cost

"""
  bench_compare.py
  Run same SQL against Snowflake and ClickHouse, record latency and estimate cost.
  Requires: snowflake-connector-python, clickhouse-connect
  """
  import time
  import json
  import clickhouse_connect
  import snowflake.connector

  # Config
  sf_cfg = { 'user': 'USER', 'password': 'PWD', 'account': 'acct', 'warehouse': 'WH' }
  ch_cfg = { 'host': 'localhost', 'port': 9000 }

  queries = ["SELECT count(*) FROM analytics.events WHERE ts >= now() - interval 1 day;",]

  # Snowflake client
  sf = snowflake.connector.connect(**sf_cfg)
  ch = clickhouse_connect.Client(**ch_cfg)

  results = []
  for q in queries:
      t0 = time.time(); _ = ch.query(q); ch_time = time.time() - t0
      t0 = time.time(); _ = sf.cursor().execute(q).fetchall(); sf_time = time.time() - t0
      # crude Snowflake cost estimate: warehouse size to credits/hr mapping
      credits_per_hour = 1.0  # fill in based on WH size
      sf_cost = credits_per_hour * (sf_time/3600.0) *  (/*$credit_price*/ 3.00) # update price
      ch_node_hour_cost = 0.50 # $/node-hour placeholder
      ch_cost = ch_node_hour_cost * (ch_time/3600.0)
      results.append({'query': q, 'ch_ms': ch_time*1000, 'sf_ms': sf_time*1000, 'sf_cost': sf_cost, 'ch_cost': ch_cost})

  print(json.dumps(results, indent=2))
  

Notes: replace placeholder costs with your vendor pricing (Snowflake credit price, ClickHouse Cloud node cost or self-hosted infra cost). Run the runner under realistic concurrency using tools like ghz/vegeta or custom thread pools.

4) Performance tuning knobs (ClickHouse focus)

  • ORDER BY: This affects how MergeTree physically sorts data. Put high-selectivity columns first for range queries.
  • PARTITION BY: Use monthly partitions (toYYYYMM(ts)) for time-series retention and efficient TTL DROP PARTITION.
  • Compression: Choose LZ4 (default) for speed; ZSTD for better compression at higher CPU cost.
  • LowCardinality: Use for string dimensions with many repeats to reduce memory usage and accelerate GROUP BY.
  • Materialized views: Pre-aggregate heavy group-bys. Beware of ordering and resource impact when building them concurrently.

5) Common migration pitfalls and how to avoid them

Handle these early; they are the top causes of failed or delayed migrations.

  1. Assuming 1:1 SQL parity

    Snowflake has functions and extensions (e.g., VARIANT, semi-structured query helpers) that ClickHouse doesn’t replicate exactly. Test complex SQL and rewrite using ClickHouse equivalents (JSONExtract*, array functions).

  2. Underestimating ORDER BY & PARTITION design

    Choosing a poor ORDER BY causes poor compression and slow range scans. Prototype with realistic data shapes (skew, cardinality) and iterate.

  3. Ignoring nullability and sentinel planning

    Nullable columns increase storage and CPU overhead. Where possible, normalize inputs or use sentinel values and document them across teams.

  4. Cost comparison errors

    Don’t compare ClickHouse per-query cost to Snowflake credits directly without normalizing for concurrency and idle costs. Include infra baseline (nodes or cloud), storage, and operational overhead.

  5. Security & governance gaps

    Map Snowflake roles and masking policies to ClickHouse RBAC and row-level security (via views or external proxy). Audit access and encryption-at-rest before cutting production traffic.

  6. Not validating eventual consistency

    ClickHouse is eventually consistent for MergeTree merges; short-lived reads can see incomplete merges. For strict transactional semantics, retain source system or rethink consumer logic.

  7. Data freshness & retention mismatch

    Snowflake Time Travel and failsafe features don’t exist in ClickHouse. Implement backups and retention explicitly (S3 backups or snapshots).

6) Migration checklist (step-by-step)

  1. Run a discovery of schemas, queries, and job owners (catalog all objects).
  2. Classify workloads: interactive dashboards, backfills, ETL pipelines, long-running ad-hoc queries.
  3. Prototype: choose 1–2 representative dashboards and a staging dataset; map schema and run benchmark scripts.
  4. Rewrite ETL for chosen ingestion pattern (batch vs streaming). Include data validation checksums and row counts per batch.
  5. Performance tune ORDER BY/PARTITION and test under concurrency; use low-cardinality types where applicable.
  6. Implement access control, encryption, and audit logging. Update IAM and DB proxies.
  7. Run a parallel production validation period: route a subset of traffic to ClickHouse and compare results (row-level diff sampling).
  8. Cut over gradually: Migrate read-only dashboards first, then ETLs, then switch writers.
  9. Keep a rollback plan: snapshot or redirect ingestion back to Snowflake for a rollback window.

Pro tip: Keep Snowflake for ad-hoc exploration during migration. It’s often faster for unknown data discovery while ClickHouse becomes your high-scale serving layer.

7) Real-world example: migrating a 1TB event table

Summary of a successful migration we’ve seen in 2025–2026:

  • Use-case: analytics dashboards with 1,000 concurrent users, sub-second median latencies required.
  • Strategy: Export month-by-month Parquet snapshots to S3, then server-side ingest to a Partitioned MergeTree with ORDER BY (user_id, ts).
  • Results: 3–5x lower query latency on high-concurrency dashboards and ~45% lower monthly spend when comparing Snowflake credits vs ClickHouse Cloud node costs (including storage on S3).
  • Lessons: Materialized views for 3 high-cardinality group-bys reduced query cost further; however, the initial partition and order tuning required several iterations.
  • ClickHouse eco-system growth: managed services, richer cloud integrations, and better SQL compatibility layers are maturing in 2025–2026.
  • Hybrid architectures: Many teams run Snowflake for exploratory analysis and ClickHouse for serving high-QPS dashboards—expect tooling to standardize this pattern.
  • Vectorized UDF and ML integration: ClickHouse is adding better integration with model scoring pipelines; plan for nearline feature stores.

Actionable takeaways

  • Start with a 2–4 week proof-of-concept: pick representative queries and datasets and run the benchmark scripts in this guide.
  • Design ORDER BY and PARTITION to reflect query patterns, not Snowflake primary-key assumptions.
  • Use streaming for low-latency needs; batch exports for bulk historical replays.
  • Always validate correctness with row-level checksums and sampling across both systems during cutover.

Conclusion & next steps

Migrating analytics from Snowflake to ClickHouse is a high-return engineering effort when your workloads demand predictable costs and low-latency, high-concurrency reads. Use the schema mapping rules, ETL patterns, and benchmark scripts above as your migration backbone. Expect to iterate on order_by/partitioning and materialized views to reach target performance.

Call to action

If you want a custom migration checklist or a tailored benchmark for your dataset, download our sample repo (DDL, scripts, and runners) or contact our migration workshop at codenscripts.com/migrate. Start a free POC this quarter and reduce your analytics cost and latency while keeping Snowflake for exploratory workloads during the transition.

Advertisement

Related Topics

#Databases#Migration#DevOps
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-07T00:23:39.931Z