Migrating Analytics from Snowflake to ClickHouse: Checklist, Scripts, and Benchmarks
DatabasesMigrationDevOps

Migrating Analytics from Snowflake to ClickHouse: Checklist, Scripts, and Benchmarks

ccodenscripts
2026-03-07
10 min read

A practical 2026 playbook to migrate analytics from Snowflake to ClickHouse with schema mapping, ETL rewrites, benchmark scripts, and pitfalls.

Stop reinventing analytics: a practical playbook to migrate from Snowflake to ClickHouse in 2026

Hook: If you’re spending cycles and credits wrestling Snowflake costs or need sub-second analytical queries for high-concurrency dashboards, this playbook gives you a step-by-step migration path that dev teams and platform engineers can run. It includes schema mapping, ETL rewrite examples, benchmark scripts, cost comparison heuristics, and a checklist of common pitfalls to avoid.

The big picture (2026 context)

The analytics landscape accelerated in late 2024–2025 as teams chased lower latency and lower cost-per-query. ClickHouse’s 2025 funding and wider ecosystem growth pushed it into mainstream consideration for high-volume OLAP workloads. In 2026, the key migration drivers are:

  • Cloud cost pressure and predictability — Snowflake’s credit model can be opaque; teams want fixed-node economics.
  • High-concurrency, real-time dashboards — ClickHouse offers fast vectorized execution and high QPS with MergeTree-based engines.
  • Open-source & hybrid deployments — ClickHouse Cloud and self-hosted options give more deployment choices.

What this playbook delivers

  • Concrete schema mapping rules from Snowflake to ClickHouse
  • ETL rewrite examples: Snowflake -> S3 -> ClickHouse and streaming alternatives
  • Benchmark scripts (bash + Python) to measure performance and cost
  • Checklist and common pitfalls for production rollouts

1) Schema differences & mapping rules (practical)

Snowflake and ClickHouse differ in type system, nullability, indexing, and DDL semantics. Below are pragmatic mappings to port schemas reliably.

Key mapping rules

  • Numeric types: Snowflake NUMBER/DECIMAL -> ClickHouse Decimal(precision,scale) for exact money; Float/Double -> Float64.
  • Integer types: Snowflake INTEGER -> Int32/Int64 depending on range. Prefer Int64 for ID fields unless heavy compression is needed.
  • String: Snowflake VARCHAR/STRING/CHAR -> ClickHouse String. For low-cardinality dimensions use LowCardinality(String) to reduce memory and improve group-by performance.
  • Time & Date: Snowflake TIMESTAMP_NTZ/TIMESTAMP_TZ -> ClickHouse DateTime64(3) with timezone handling at the application level. ClickHouse stores timezone separately; store UTC and convert in queries.
  • Semi-structured: Snowflake VARIANT/OBJECT/ARRAY -> ClickHouse JSON type options: use String for raw JSON or JSONExtract functions; ClickHouse also supports Nested columns and Map-related functions but lacks a direct VARIANT equivalent.
  • Nullability: ClickHouse historically favored non-nullable columns, now supports Nullable(T). Use Nullable sparingly — it adds overhead. Consider sentinel values (e.g., empty string, 0) when appropriate.
  • Primary keys & indexes: Snowflake has no primary-key enforcement. ClickHouse uses primary key/ordering key in MergeTree as a sort key, not an enforcement. Choose order_by columns to optimize ranges and group-by performance.

Example: Snowflake table -> ClickHouse DDL

-- Snowflake
  CREATE TABLE events (
    event_id NUMBER(38,0),
    user_id NUMBER(38,0),
    event_type VARCHAR,
    payload VARIANT,
    ts TIMESTAMP_NTZ
  );

  -- ClickHouse (recommended mapping)
  CREATE TABLE analytics.events (
    event_id UInt64,
    user_id UInt64,
    event_type LowCardinality(String),
    payload String, -- store JSON as String or use JSON functions
    ts DateTime64(3)
  ) ENGINE = MergeTree()
  PARTITION BY toYYYYMM(ts)
  ORDER BY (user_id, ts);
  

2) ETL rewrite examples

We present two common migration patterns: batch dump (Snowflake -> S3 -> ClickHouse) and near-real-time streaming (Snowflake -> Kafka -> ClickHouse). Both are production-ready and show exact commands.

Pattern A — Batch: Snowflake export to S3 then ClickHouse ingest

Steps: export from Snowflake using COPY INTO to S3 as compressed CSV/Parquet, then use ClickHouse client or S3 table function to insert.

# Snowflake: export parquet to S3
  COPY INTO 's3://my-bucket/snowflake-exports/events_'
  FROM my_db.public.events
  FILE_FORMAT = (TYPE = PARQUET COMPRESSION = 'SNAPPY')
  HEADER = TRUE
  SINGLE = FALSE;

  # ClickHouse client ingest using s3 table function (server-side download)
  CREATE TABLE analytics.staging_events AS analytics.events ENGINE = Memory;

  INSERT INTO analytics.events
  SELECT
    CAST(JSONExtractString(c, 'event_id') AS UInt64) AS event_id, -- adjust if needed
    ...
  FROM s3('https://s3.amazonaws.com/my-bucket/snowflake-exports/events_*.parquet', 'Parquet')
  

Note: ClickHouse's s3 table function lets the server pull files directly; avoid local download for large datasets.

Pattern B — Near-real-time: Snowflake -> Kafka -> ClickHouse

Use Snowpipe or a CDC connector (Debezium/Fivetran/Airbyte) to stream changes into Kafka, then use ClickHouse's Kafka engine or materialized views to consume.

-- ClickHouse: create Kafka engine table and materialized view
  CREATE TABLE kafka_events (
    raw String
  ) ENGINE = Kafka SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'ch_events',
    kafka_format = 'JSONEachRow';

  CREATE MATERIALIZED VIEW mv_events TO analytics.events AS
  SELECT
    toUInt64(JSONExtractUInt(raw, 'event_id')) AS event_id,
    toUInt64(JSONExtractUInt(raw, 'user_id')) AS user_id,
    JSONExtractString(raw, 'event_type') AS event_type,
    JSONExtractString(raw, 'payload') AS payload,
    parseDateTimeBestEffort(JSONExtractString(raw, 'ts')) AS ts
  FROM kafka_events;
  

3) Benchmark and cost scripts

Benchmarking must measure both latency and economics. Below are reusable scripts to compare query latency and to estimate Snowflake credits vs ClickHouse node-hours.

What to measure

  • Query latency P50/P95/P99 under target concurrency
  • Throughput — rows/sec ingested and queries/sec served
  • Cost — Snowflake credits consumed vs ClickHouse cloud/node cost
  • Storage cost — compressed storage on Snowflake vs ClickHouse (S3 or local)

Simple bash benchmark runner (ClickHouse)

# run_bench_ch.sh - measure query timings for ClickHouse
  #!/bin/bash
  CH_HOST=localhost
  CH_USER=default
  QUERY_FILE=queries.sql
  LOG=bench_ch.log

  echo "Starting ClickHouse bench: $(date)" > $LOG
  for q in $(cat $QUERY_FILE); do
    start=$(date +%s%3N)
    echo "$q" | clickhouse-client --host $CH_HOST --user $CH_USER --query="$(cat)" >/dev/null
    end=$(date +%s%3N)
    echo "$q|$((end-start))" >> $LOG
  done
  echo "Done: $(date)" >> $LOG
  

Python runner to compare Snowflake and ClickHouse and estimate cost

"""
  bench_compare.py
  Run same SQL against Snowflake and ClickHouse, record latency and estimate cost.
  Requires: snowflake-connector-python, clickhouse-connect
  """
  import time
  import json
  import clickhouse_connect
  import snowflake.connector

  # Config
  sf_cfg = { 'user': 'USER', 'password': 'PWD', 'account': 'acct', 'warehouse': 'WH' }
  ch_cfg = { 'host': 'localhost', 'port': 9000 }

  queries = ["SELECT count(*) FROM analytics.events WHERE ts >= now() - interval 1 day;",]

  # Snowflake client
  sf = snowflake.connector.connect(**sf_cfg)
  ch = clickhouse_connect.Client(**ch_cfg)

  results = []
  for q in queries:
      t0 = time.time(); _ = ch.query(q); ch_time = time.time() - t0
      t0 = time.time(); _ = sf.cursor().execute(q).fetchall(); sf_time = time.time() - t0
      # crude Snowflake cost estimate: warehouse size to credits/hr mapping
      credits_per_hour = 1.0  # fill in based on WH size
      sf_cost = credits_per_hour * (sf_time/3600.0) *  (/*$credit_price*/ 3.00) # update price
      ch_node_hour_cost = 0.50 # $/node-hour placeholder
      ch_cost = ch_node_hour_cost * (ch_time/3600.0)
      results.append({'query': q, 'ch_ms': ch_time*1000, 'sf_ms': sf_time*1000, 'sf_cost': sf_cost, 'ch_cost': ch_cost})

  print(json.dumps(results, indent=2))
  

Notes: replace placeholder costs with your vendor pricing (Snowflake credit price, ClickHouse Cloud node cost or self-hosted infra cost). Run the runner under realistic concurrency using tools like ghz/vegeta or custom thread pools.

4) Performance tuning knobs (ClickHouse focus)

  • ORDER BY: This affects how MergeTree physically sorts data. Put high-selectivity columns first for range queries.
  • PARTITION BY: Use monthly partitions (toYYYYMM(ts)) for time-series retention and efficient TTL DROP PARTITION.
  • Compression: Choose LZ4 (default) for speed; ZSTD for better compression at higher CPU cost.
  • LowCardinality: Use for string dimensions with many repeats to reduce memory usage and accelerate GROUP BY.
  • Materialized views: Pre-aggregate heavy group-bys. Beware of ordering and resource impact when building them concurrently.

5) Common migration pitfalls and how to avoid them

Handle these early; they are the top causes of failed or delayed migrations.

  1. Assuming 1:1 SQL parity

    Snowflake has functions and extensions (e.g., VARIANT, semi-structured query helpers) that ClickHouse doesn’t replicate exactly. Test complex SQL and rewrite using ClickHouse equivalents (JSONExtract*, array functions).

  • Underestimating ORDER BY & PARTITION design

    Choosing a poor ORDER BY causes poor compression and slow range scans. Prototype with realistic data shapes (skew, cardinality) and iterate.

  • Ignoring nullability and sentinel planning

    Nullable columns increase storage and CPU overhead. Where possible, normalize inputs or use sentinel values and document them across teams.

  • Cost comparison errors

    Don’t compare ClickHouse per-query cost to Snowflake credits directly without normalizing for concurrency and idle costs. Include infra baseline (nodes or cloud), storage, and operational overhead.

  • Security & governance gaps

    Map Snowflake roles and masking policies to ClickHouse RBAC and row-level security (via views or external proxy). Audit access and encryption-at-rest before cutting production traffic.

  • Not validating eventual consistency

    ClickHouse is eventually consistent for MergeTree merges; short-lived reads can see incomplete merges. For strict transactional semantics, retain source system or rethink consumer logic.

  • Data freshness & retention mismatch

    Snowflake Time Travel and failsafe features don’t exist in ClickHouse. Implement backups and retention explicitly (S3 backups or snapshots).

  • 6) Migration checklist (step-by-step)

    1. Run a discovery of schemas, queries, and job owners (catalog all objects).
    2. Classify workloads: interactive dashboards, backfills, ETL pipelines, long-running ad-hoc queries.
    3. Prototype: choose 1–2 representative dashboards and a staging dataset; map schema and run benchmark scripts.
    4. Rewrite ETL for chosen ingestion pattern (batch vs streaming). Include data validation checksums and row counts per batch.
    5. Performance tune ORDER BY/PARTITION and test under concurrency; use low-cardinality types where applicable.
    6. Implement access control, encryption, and audit logging. Update IAM and DB proxies.
    7. Run a parallel production validation period: route a subset of traffic to ClickHouse and compare results (row-level diff sampling).
    8. Cut over gradually: Migrate read-only dashboards first, then ETLs, then switch writers.
    9. Keep a rollback plan: snapshot or redirect ingestion back to Snowflake for a rollback window.

    Pro tip: Keep Snowflake for ad-hoc exploration during migration. It’s often faster for unknown data discovery while ClickHouse becomes your high-scale serving layer.

    7) Real-world example: migrating a 1TB event table

    Summary of a successful migration we’ve seen in 2025–2026:

    • Use-case: analytics dashboards with 1,000 concurrent users, sub-second median latencies required.
    • Strategy: Export month-by-month Parquet snapshots to S3, then server-side ingest to a Partitioned MergeTree with ORDER BY (user_id, ts).
    • Results: 3–5x lower query latency on high-concurrency dashboards and ~45% lower monthly spend when comparing Snowflake credits vs ClickHouse Cloud node costs (including storage on S3).
    • Lessons: Materialized views for 3 high-cardinality group-bys reduced query cost further; however, the initial partition and order tuning required several iterations.
    • ClickHouse eco-system growth: managed services, richer cloud integrations, and better SQL compatibility layers are maturing in 2025–2026.
    • Hybrid architectures: Many teams run Snowflake for exploratory analysis and ClickHouse for serving high-QPS dashboards—expect tooling to standardize this pattern.
    • Vectorized UDF and ML integration: ClickHouse is adding better integration with model scoring pipelines; plan for nearline feature stores.

    Actionable takeaways

    • Start with a 2–4 week proof-of-concept: pick representative queries and datasets and run the benchmark scripts in this guide.
    • Design ORDER BY and PARTITION to reflect query patterns, not Snowflake primary-key assumptions.
    • Use streaming for low-latency needs; batch exports for bulk historical replays.
    • Always validate correctness with row-level checksums and sampling across both systems during cutover.

    Conclusion & next steps

    Migrating analytics from Snowflake to ClickHouse is a high-return engineering effort when your workloads demand predictable costs and low-latency, high-concurrency reads. Use the schema mapping rules, ETL patterns, and benchmark scripts above as your migration backbone. Expect to iterate on order_by/partitioning and materialized views to reach target performance.

    Call to action

    If you want a custom migration checklist or a tailored benchmark for your dataset, download our sample repo (DDL, scripts, and runners) or contact our migration workshop at codenscripts.com/migrate. Start a free POC this quarter and reduce your analytics cost and latency while keeping Snowflake for exploratory workloads during the transition.

    Related Topics

    #Databases#Migration#DevOps
    c

    codenscripts

    Contributor

    Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

    2026-05-10T23:42:58.666Z
    Sponsored ad