Managing Mutations in ClickHouse

Understand why ALTER TABLE UPDATE and DELETE operations can grind your ClickHouse cluster to a halt, and learn practical alternatives like insert-select-rename and partition-level operations.

One of the most common causes of unexpected slowdowns in ClickHouse production systems is mutations that have gone wrong. The symptoms are vague - general system degradation, queries getting slower, background operations stalling - and the root cause is not always obvious until you check system.mutations and find an ALTER TABLE operation that has been running for hours or even days.

Understanding what mutations are, why they are expensive, and when to avoid them entirely is essential knowledge for anyone operating ClickHouse at scale.

What Mutations Are and Why They Are Expensive

In ClickHouse, a mutation is any operation that logically modifies existing data in a MergeTree table. The two most common triggers are:

ALTER TABLE ... UPDATE - modifying values in existing rows
ALTER TABLE ... DELETE - removing rows that match a condition

Because MergeTree data parts are immutable, ClickHouse cannot modify them in place. Instead, a mutation rewrites every affected part, producing new versions with the changes applied. In many conditions, this means a full table rewrite - every part in the table gets read, transformed, and written back to disk.

This is fundamentally different from an UPDATE or DELETE in a traditional row-store database. In ClickHouse, even a seemingly simple ALTER TABLE events DELETE WHERE user_id = 42 may need to rewrite billions of rows across hundreds of parts, consuming significant CPU, disk I/O, and time.

Mutations run as background operations, and while they execute, they compete for resources with regular merges and queries. A long-running mutation on a large table can visibly degrade the entire system's performance.

Diagnosing Stuck or Slow Mutations

You can check the status of all currently running mutations with a straightforward query:

SELECT
      database,
      table,
      command,
      create_time,
      now() - create_time AS elapsed,
      parts_to_do,
      is_done
  FROM system.mutations
  WHERE is_done = 0
  ORDER BY create_time ASC

This shows you what mutation command is running, when it started, how many parts still need processing, and whether it has completed. If you see a mutation that has been running for a long time with parts_to_do not decreasing, you have a stuck or severely throttled mutation.

Common causes of slow mutations include:

Large tables. A mutation on a table with hundreds of billions of rows is inherently slow, regardless of how targeted your WHERE clause is.
Competing workloads. Heavy insert traffic or ongoing merges compete with mutations for I/O and CPU. The mutation gets throttled to avoid starving other operations.
Broad conditions. A DELETE or UPDATE with a condition that touches most partitions means most parts need rewriting.

Alternatives to Heavy Mutations

The best way to deal with mutations is to avoid them where possible. Here are the practical alternatives.

Use TTL for Data Retention

One of the most common mutation patterns we see is a monthly or periodic ALTER TABLE ... DELETE WHERE timestamp < ... to purge old data. This is inefficient - it rewrites the entire table just to remove old rows. Instead, use ClickHouse's built-in TTL mechanism:

ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;
  ALTER TABLE events MODIFY SETTING ttl_only_drop_parts = 1;

By default, TTL works through special TTL merges that rewrite parts - reading all rows and writing back only the non-expired ones. While this is still more efficient than a full table mutation, the real performance win comes from enabling ttl_only_drop_parts = 1, which tells ClickHouse to drop entire parts where all rows have expired rather than rewriting them row by row. When your TTL expression is aligned with your partition key (e.g., both use monthly granularity), this effectively drops whole partitions worth of data as a lightweight metadata operation - orders of magnitude faster and with negligible system impact. If you are running periodic delete mutations for data retention, switching to TTL with ttl_only_drop_parts should be your first priority.

Insert-Select-Rename

For large data transformations - changing column types, restructuring data, or applying complex updates - consider the insert-select-rename pattern instead of a mutation. The approach is:

Create a new table with the desired schema or structure.
INSERT INTO new_table SELECT ... FROM old_table with the transformations applied in the SELECT.
Rename the old table to a backup name and rename the new table to the original name.

This sounds counterintuitive, but it is often faster, less throttled, and more predictable than running an ALTER TABLE mutation on a live table. The insert-select operation runs as a regular query and does not compete with the merge process the way a mutation does.

Chunk by Partition

If you must run a mutation, you can reduce its impact by chunking it partition by partition rather than running it against the entire table at once. Using the insert-select approach, you can process one partition at a time:

INSERT INTO new_table
  SELECT * FROM old_table
  WHERE toYYYYMM(timestamp) = 202601;

This gives you more control over the pace, lets you monitor progress, and allows other operations to proceed between chunks.

Schedule During Low Traffic

If a mutation is genuinely necessary and the alternatives do not apply, at minimum schedule it during low-traffic periods. Since mutations compete with queries and merges for resources, running them when the cluster is otherwise idle minimizes the blast radius.

Key Takeaways

Mutations (ALTER TABLE UPDATE/DELETE) rewrite data parts and can mean a full table rewrite. They are far more expensive than equivalent operations in row-store databases.
Monitor running mutations via system.mutations and watch for stuck or long-running operations that are degrading system performance.
Replace periodic DELETE mutations with TTL for data retention. Enable ttl_only_drop_parts = 1 and align the TTL expression with your partition key for maximum efficiency - this drops entire parts as a lightweight operation rather than rewriting them.
For large transformations, prefer the insert-select-rename pattern over mutations. It is often faster, more predictable, and avoids competing with the merge process.
When mutations are unavoidable, chunk them by partition and run them during low-traffic windows.