Understanding the ClickHouse MergeTree Engine

A deep dive into the MergeTree engine - parts, granules, sparse indexes, merges, mutations, data skipping indices, compression codecs, TTL, and the key settings that control it all.

ClickHouse earns its reputation as one of the fastest analytical databases available today. But that speed is not magic - it is a direct result of how the MergeTree engine organizes, stores, and processes data. If you operate ClickHouse in production, understanding this engine is not optional. It is the foundation for every optimization effort and troubleshooting session you will ever face.

This post walks through the core mechanics of MergeTree - how data is physically stored, how the sparse index works, how merges and mutations behave, and the key settings and knobs that let you control it all.

Parts: The Unit of Storage

Every INSERT into a MergeTree table creates a new data part on disk. A part is a self-contained directory containing column data files, an index file, and metadata - all sorted by the table's ORDER BY key. Parts are immutable once written. ClickHouse never modifies a part in place.

Parts come in two formats:

Wide format: each column is stored as a separate file on disk. This is the default for larger parts and is optimal for analytical scans that only touch a subset of columns.
Compact format: all columns are packed into a single file. This is used for smaller parts to reduce the number of file descriptors and filesystem overhead from frequent small inserts.

The min_bytes_for_wide_part and min_rows_for_wide_part settings control when ClickHouse switches from compact to wide format. If a part is below both thresholds, it is written in compact format.

You can inspect your parts at any time:

SELECT
      table,
      partition,
      name,
      part_type,
      rows,
      bytes_on_disk
  FROM system.parts
  WHERE active AND database = 'your_db'
  ORDER BY bytes_on_disk DESC;

Granules and the Sparse Index

Inside each part, data is divided into granules - the smallest unit of data ClickHouse reads during a query. Each granule contains up to index_granularity rows (default: 8192). The "up to" is important: index_granularity_bytes (default: 10MB) is also enabled by default, which means ClickHouse will start a new granule when either the row count limit or the byte size limit is reached, whichever comes first. For tables with wide rows, granules may contain significantly fewer than 8192 rows. ClickHouse never splits a row across granules, so every granule holds an integer number of rows.

For each part, ClickHouse creates marks - entries in an index file that record the primary key value at the start of each granule along with the byte offset into the column data files. This is what makes the MergeTree index sparse: instead of indexing every row, it indexes every granule boundary.

This sparse design is what allows ClickHouse to handle billions of rows while keeping the index small enough to fit entirely in RAM. When a query filters on the primary key, ClickHouse uses binary search on the marks to locate the relevant granules and reads only those, skipping everything else.

The tradeoff: when reading a single primary key range, ClickHouse may read up to index_granularity * 2 extra rows at the boundaries. For analytical workloads scanning thousands or millions of rows, this overhead is negligible. For point lookups on individual rows, it means you are always reading at least one full granule.

Primary Key vs. Sorting Key

These two concepts are related but distinct, and confusing them leads to suboptimal schema design.

The sorting key (ORDER BY) determines the physical sort order of data within each part. Every part's data is sorted by this expression.

The primary key (PRIMARY KEY) determines which columns are included in the sparse index (the marks file). By default, the primary key equals the sorting key. But you can define a primary key that is a prefix of the sorting key:

CREATE TABLE events (
      tenant_id UInt32,
      timestamp DateTime,
      event_type LowCardinality(String),
      payload String
  )
  ENGINE = MergeTree
  ORDER BY (tenant_id, timestamp, event_type)
  PRIMARY KEY (tenant_id, timestamp);

Here, data is physically sorted by all three columns, but only tenant_id and timestamp appear in the sparse index. This keeps the index smaller while still allowing ClickHouse to benefit from the full sort order during merges and queries that scan sorted ranges.

There is another benefit: better compression. ClickHouse compresses column data within each part, and compression algorithms work best when similar values are adjacent. Sorting by additional columns beyond what the primary key needs groups identical or similar values together. In the example above, sorting by event_type as the third column means all rows with the same tenant, timestamp range, and event type are physically next to each other on disk - leading to significantly better compression ratios for the event_type column and any columns that correlate with it.

The rule is simple: the primary key must always be a prefix of the sorting key expression tuple.

The Merge Process

Over time, background threads pick up small parts and merge them into larger ones. This is the "merge" in MergeTree. Fewer, larger parts mean fewer files to open and less overhead when scanning data. The engine continuously optimizes the physical layout of your data without manual intervention.

When Merges Happen

Merges are triggered automatically by a pool of background worker threads. The pool size is controlled by the server-level background_pool_size setting (default: 16), and background_merges_mutations_concurrency_ratio (default: 2) multiplies it to determine the maximum number of concurrent merge and mutation tasks - so by default, up to 32 tasks can run simultaneously. Merges and mutations share this pool.

After every insert creates a new part, ClickHouse evaluates whether a merge is warranted. The merge selection algorithm picks parts within the same partition, preferring to combine parts of similar size to avoid repeatedly rewriting large data. The total size of parts that can be combined in a single merge is bounded by max_bytes_to_merge_at_max_space_in_pool (default: 150 GB) when disk space is plentiful. When the background pool is nearly full - specifically, when the number of free task slots drops below number_of_free_entries_in_pool_to_lower_max_size_of_merge (default: 8) - ClickHouse automatically reduces the maximum allowed merge size. This keeps slots available for smaller, more urgent merges rather than letting a few large merges monopolize the entire pool. When disk space is constrained, max_bytes_to_merge_at_min_space_in_pool sets a lower bound on merge sizes so that small merges can still proceed and reduce part counts even under pressure.

You can also trigger merges manually with OPTIMIZE TABLE (merges one set of parts) or OPTIMIZE TABLE ... FINAL (forces a full compaction into a single part per partition). Use FINAL sparingly in production - it is an expensive blocking operation that rewrites all data. The setting min_age_to_force_merge_seconds can be used to automatically force merges on partitions where all parts are older than a given age - useful for ensuring old data eventually gets fully compacted even when the normal heuristics would not select it.

When Merges Cannot Keep Up

Merges are not free. They consume CPU, memory, and disk I/O. When ingest rate outpaces the merge capacity, part counts grow. ClickHouse has a multi-layered set of safety mechanisms for this.

Active parts thresholds (per partition):

parts_to_delay_insert (default: 150): when the number of active parts in a partition exceeds this value, ClickHouse adds an artificial delay to each INSERT. This is not a fixed slowdown - the delay grows exponentially as the part count approaches parts_to_throw_insert, applying increasing back-pressure on writers to give merges time to catch up.
parts_to_throw_insert (default: 300): when the part count reaches this threshold, inserts are rejected entirely with a "Too many parts" error. This is ClickHouse protecting itself from an unsustainable workload.

Inactive parts thresholds (per partition):

After parts are merged, the old source parts become inactive and are scheduled for deletion. If cleanup lags behind - for example, because parts are still held by long-running queries - inactive parts accumulate on disk. Two parallel settings protect against this:

inactive_parts_to_delay_insert (default: 0, disabled): when set, delays inserts once inactive parts exceed this threshold.
inactive_parts_to_throw_insert (default: 0, disabled): when set, rejects inserts if inactive parts pile up beyond this limit.

Global limit:

max_parts_in_total (default: 100,000): a safety limit across all partitions of a table. If the total part count exceeds this, inserts are rejected regardless of per-partition counts.

The most common cause of "too many parts" is not a configuration problem - it is a workload problem. Doing many small inserts (single rows or just a handful of rows per insert) creates a flood of tiny parts that merges cannot consolidate fast enough. The fix is almost always batching your inserts - either application-side or by enabling ClickHouse's async inserts feature, which accumulates rows internally and flushes them in larger batches.

Over-granular partitioning is another common culprit. Daily partitions instead of monthly partitions multiply the number of parts the engine has to manage. Each partition merges independently, so more partitions means more parallel merge work for the same data volume.

Mutations: When Data Gets Rewritten

When you run an ALTER TABLE ... UPDATE or ALTER TABLE ... DELETE, ClickHouse creates a mutation. Unlike regular merges, mutations rewrite affected parts to produce new versions with the modified data. This can be extremely expensive - a mutation on a large table effectively rewrites every part.

Mutations run in the background and can take a long time. You can monitor their progress:

SELECT
      table,
      command,
      create_time,
      parts_to_do,
      is_done,
      latest_fail_reason
  FROM system.mutations
  WHERE NOT is_done
  ORDER BY create_time;

A few practical guidelines for mutations:

Do not use them for routine data retention. If you regularly delete old data with ALTER TABLE ... DELETE WHERE date < ..., switch to TTL instead. A monthly delete mutation on a large table means a full rewrite of every affected part.
Run mutations during low traffic. They compete for the same I/O and CPU resources as merges and queries.
For large rewrites, consider INSERT ... SELECT + rename. Creating a new table with the desired data and swapping it in via RENAME TABLE is often faster and less disruptive than a mutation. You can also chunk this partition by partition using INSERT INTO new_table SELECT ... FROM old_table WHERE partition_key = ... combined with ALTER TABLE ... DROP PARTITION. The tradeoff is disk space: this approach requires enough free space to hold both the old and new copies simultaneously, while mutations work on individual parts with limited parallelism and only need temporary space for the parts currently being rewritten.

Data Skipping Indices

The primary sparse index only helps when queries filter on the sorting key columns. For queries that filter on other columns, ClickHouse would normally scan every granule. Data skipping indices solve this by storing lightweight summaries for non-key columns, allowing ClickHouse to skip granules that cannot match.

ALTER TABLE events
      ADD INDEX idx_event_type event_type
      TYPE set(100) GRANULARITY 4;

The GRANULARITY parameter controls how many granules are grouped into one index block. A value of 4 means the index stores one summary per 4 consecutive granules.

The main index types:

minmax: stores the minimum and maximum value per index block. Effective for range filters on numeric or date columns.
set(max_rows): stores up to max_rows unique values per block. If the block has more unique values than the limit, the index entry is not created for that block. Works well for low-cardinality columns.
bloom_filter(false_positive_rate): a probabilistic structure that can tell you a value is definitely not in a block. The default false-positive rate is 0.025. Good for high-cardinality columns where you filter for specific values.
ngrambf_v1(n, size, hashes, seed): an n-gram bloom filter for substring matching (e.g., LIKE '%search_term%').
tokenbf_v1(size, hashes, seed): a token-based bloom filter for text search on tokenized content.

Data skipping indices are not a replacement for choosing the right sorting key. They are a complement for secondary access patterns that the primary key does not cover. If a column is in your most important filter, it probably belongs in your ORDER BY.

Compression and Column Settings

ClickHouse compresses column data by default using LZ4. But you can choose different codecs per column, and this choice can have a significant impact on storage size and query performance.

CREATE TABLE metrics (
      timestamp DateTime CODEC(DoubleDelta, LZ4),
      value Float64 CODEC(Gorilla, LZ4),
      tag LowCardinality(String)
  )
  ENGINE = MergeTree
  ORDER BY timestamp;

Specialized codecs like DoubleDelta (for monotonically increasing sequences like timestamps) and Gorilla (for floating-point values that change slowly) can dramatically reduce data size by encoding the differences between consecutive values before applying general-purpose compression.

A few things we frequently find when reviewing schemas:

String columns with a small number of distinct values not using LowCardinality. This is a common miss. If a column like event_type or status has tens or hundreds of distinct values, LowCardinality(String) stores a dictionary of unique values and replaces each occurrence with a small integer pointer. This is not just a compression technique - it also makes filtering and grouping faster. It is always worth checking: run a query to count distinct values, and if the cardinality is low, switch to LowCardinality.
Oversized integer types. An Int64 for a column whose values never exceed 255 wastes bytes in every granule. Use Int8, Int16, or UInt8 where the data range allows. These add up at scale.
Using heavier compression (ZSTD) on rarely-queried archive columns, while keeping LZ4 (faster decompression) for hot columns. This is a valid strategy when storage cost matters more than read latency for specific columns.

You can inspect the storage impact of each column:

SELECT
      column,
      type,
      formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
      formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
      round(sum(column_data_uncompressed_bytes) / sum(column_data_compressed_bytes), 2) AS ratio
  FROM system.columns
  WHERE database = 'your_db' AND table = 'your_table'
  GROUP BY column, type
  ORDER BY sum(column_data_compressed_bytes) DESC;

TTL: Automatic Data Lifecycle

MergeTree supports TTL (Time-To-Live) rules at both the column and table level. TTL is not just for deletion - it can trigger data movement, recompression, or aggregation.

Table-level TTL supports several actions:

DELETE: remove expired rows (the default).
TO DISK 'disk_name' / TO VOLUME 'volume_name': move expired parts to cheaper storage tiers (e.g., from SSD to HDD, or from local disk to S3).
RECOMPRESS codec: apply heavier compression to aged data. For example, switch from LZ4 to ZSTD(3) after 30 days.
GROUP BY: aggregate expired rows into summary records before deleting the originals. Useful for downsampling time-series data.

ALTER TABLE metrics
      MODIFY TTL
          timestamp + INTERVAL 7 DAY TO DISK 'cold_storage',
          timestamp + INTERVAL 30 DAY RECOMPRESS CODEC(ZSTD(3)),
          timestamp + INTERVAL 90 DAY DELETE;

TTL rules are enforced during merges. The merge_with_ttl_timeout setting controls how frequently ClickHouse triggers TTL-specific merges. If you need TTL to be applied promptly, reduce this value - but be aware this creates additional merge load.

For data retention, TTL is almost always preferable to periodic ALTER TABLE ... DELETE mutations. TTL integrates with the existing merge process and avoids the expensive full-part rewrites that mutations cause.

Key Settings Reference

Here is a summary of the most important MergeTree settings and what they control:

Granularity and part format:

Setting	Default	Purpose
`index_granularity`	8192	Rows per granule. Smaller values make point queries more precise but increase index size.
`index_granularity_bytes`	10MB	Max bytes per granule. Enables adaptive granularity for variable-width rows.
`min_bytes_for_wide_part`	10MB	Parts smaller than this are stored in compact format.
`min_rows_for_wide_part`	0	Row-count threshold for wide format (0 = use bytes threshold only).

Merge behavior:

Setting	Default	Purpose
`max_bytes_to_merge_at_max_space_in_pool`	150 GB	Maximum total size of source parts for a single automatic merge when disk space is plentiful.
`max_bytes_to_merge_at_min_space_in_pool`	1 MB	Minimum merge size allowed under low disk space. Ensures small merges can still proceed to reduce part counts.
`number_of_free_entries_in_pool_to_lower_max_size_of_merge`	8	When fewer than this many background pool slots are free, ClickHouse reduces the max merge size to keep slots open for smaller merges.
`min_age_to_force_merge_seconds`	0 (disabled)	When set, forces merges on partitions where all parts are older than this threshold - useful for compacting old data.
`merge_with_ttl_timeout`	14400 (4h)	Minimum seconds between TTL-triggered merges.
`ttl_only_drop_parts`	0	When enabled, TTL drops entire parts instead of rewriting them row by row (more efficient when whole parts expire).

Insert throttling and part count limits:

Setting	Default	Purpose
`parts_to_delay_insert`	150	Active part count per partition at which inserts start being throttled with exponential backoff.
`parts_to_throw_insert`	300	Active part count per partition at which inserts are rejected.
`inactive_parts_to_delay_insert`	0 (disabled)	Inactive part count per partition at which inserts are throttled. Protects against cleanup lag.
`inactive_parts_to_throw_insert`	0 (disabled)	Inactive part count per partition at which inserts are rejected.
`max_parts_in_total`	100000	Global part count limit across all partitions of a table.

Key Takeaways

Every INSERT creates an immutable part. Parts come in compact or wide format depending on their size. Background merges consolidate parts; mutations rewrite them.
The sparse index works at the granule level (default 8192 rows), not at the row level. It is designed to fit in RAM and enables ClickHouse to skip irrelevant granules during reads.
The sorting key controls physical data order; the primary key (a prefix of the sorting key) controls what goes into the sparse index.
"Too many parts" is almost always a workload problem (small inserts, over-granular partitions), not a configuration problem. Batch your inserts or use async inserts.
Data skipping indices (minmax, set, bloom_filter) let you accelerate queries that filter on non-key columns, but they complement - not replace - a well-chosen sorting key.
Choose the right data types and compression codecs. LowCardinality for low-cardinality strings and right-sized integers can reduce storage by orders of magnitude.
Use TTL for data lifecycle management instead of periodic delete mutations. TTL can move data to cheaper storage, recompress it, or aggregate it - all integrated with the merge process.

Everything covered in this post applies to the base MergeTree engine, but ClickHouse ships with a family of specialized variants that extend it. ReplacingMergeTree deduplicates rows by primary key during merges - useful for upsert-style workloads. AggregatingMergeTree and SummingMergeTree pre-aggregate rows during merges, which is powerful for materialized views that maintain running totals or rollups. CollapsingMergeTree and VersionedCollapsingMergeTree handle row-level change tracking by collapsing insert/cancel pairs. Each of these inherits all the mechanics described above - parts, granules, sparse indexes, merges, TTL - and adds its own merge-time logic on top. Any of these engines can also be prefixed with Replicated (e.g., ReplicatedMergeTree, ReplicatedAggregatingMergeTree) to get automatic multi-replica synchronization via ClickHouse Keeper or ZooKeeper, and all of them can be sharded across nodes using a Distributed table for horizontal scaling.