ClickHouse Clusters and Replication

How ClickHouse cluster definitions, ZooKeeper/Keeper paths, and replication actually fit together - and why getting the distinction right matters for every topology decision you make.

The words "cluster," "shard," and "replica" show up in ClickHouse config files and in ZooKeeper/Keeper paths. They look the same. They mean different things. Confusing one for the other is behind most of the cluster misconfiguration issues we see in production.

This post unpacks what a cluster definition actually does, how replication works independently of it, and where the two must agree.

How a Node Knows Its Cluster Topology

Each ClickHouse node has a config.xml (or a file in config.d/) that declares cluster definitions under <remote_servers>. A node can belong to multiple clusters at once. Each cluster block is its own independent topology declaration.

<remote_servers>
      <my_cluster>
          <shard>
              <replica>
                  <host>ch-node-1</host>
                  <port>9000</port>
              </replica>
              <replica>
                  <host>ch-node-2</host>
                  <port>9000</port>
              </replica>
          </shard>
      </my_cluster>
  </remote_servers>

Those <host> entries are what surface in system.clusters - the system table reflecting these config declarations at runtime. Each row is one replica within one cluster, as seen by the local node:

SELECT
      cluster,
      shard_num,
      replica_num,
      host_name,
      is_local
  FROM system.clusters
  ORDER BY cluster, shard_num, replica_num;

is_local tells you whether a row refers to the current node. Run this on every node in your fleet, compare the output, and you can reconstruct exactly which nodes consider themselves part of which cluster/shard/replica position.

Worth knowing: errors_count increments when ClickHouse fails to connect to a peer. A nonzero value alone does not indicate a problem - transient bumps happen during ZooKeeper/Keeper restarts, brief network blips, or rolling upgrades. What matters is the trend. Monitor errors_count over time alongside system.errors and your ZooKeeper connection metrics (session expirations, outstanding requests). A steadily climbing count points to a real connectivity issue; a one-time bump that stops growing is usually harmless.

The Two Layers: Routing vs. Replication

Here is the thing people get wrong most often. The shard_num and replica_num values in system.clusters are labels, not enforcement. ClickHouse never validates that your data is actually sharded or replicated according to these numbers. The cluster definition is the routing layer - it tells ClickHouse where to send queries and writes. Replication is controlled by something else entirely: the ZooKeeper/Keeper path used when creating a ReplicatedMergeTree table.

With the Atomic database engine (the default since ClickHouse 20.10), the recommended approach is to not specify engine parameters at all and rely on the server-level default_replica_path and default_replica_name settings:

<default_replica_path>/clickhouse/tables/{uuid}/{shard}</default_replica_path>
  <default_replica_name>{replica}</default_replica_name>

This lets you write:

CREATE TABLE hits (...)
  ENGINE = ReplicatedMergeTree
  ORDER BY ...

No explicit paths. ClickHouse substitutes the table's UUID into the Keeper path automatically. When you use ON CLUSTER, the initiator generates a single UUID and pushes it to all nodes, so every replica of the same table shares the same UUID-based path without you having to coordinate anything.

If you do need to specify paths explicitly, use the {uuid} macro rather than embedding the table name:

CREATE TABLE hits (...)
  ENGINE = ReplicatedMergeTree(
      '/clickhouse/tables/{uuid}/{shard}',
      '{replica}'
  )

Why {uuid} over table names? A table name in the path creates a brittle coupling. With ON CLUSTER, ClickHouse generates a consistent UUID across all nodes, so the path stays stable even if the table is renamed. The downside: when adding a new replica to an existing cluster, you need to create the table with the same UUID as the existing replicas. You can retrieve the original DDL with SHOW CREATE TABLE ... SETTING show_table_uuid_in_table_create_query_if_not_nil from an existing node.

{shard} and {replica} are macros, substituted from the node's <macros> config block:

<macros>
      <shard>01</shard>
      <replica>ch-node-1</replica>
  </macros>

Two nodes with the same {shard} value share the same Keeper path prefix. They replicate to each other - regardless of what shard_num their cluster config declares. Different {shard} values mean different Keeper paths and independent replication groups.

So you can have a cluster config that says nodes A and B belong to the same shard, but if their macros produce different Keeper paths, they will never replicate data to each other. The config says "route here." Keeper says "replicate there." Two independent layers, and both need to be right.

Where the Layers Must Agree

The two layers are independent, but several features break if they disagree.

Internal Replication

With Distributed tables and internal_replication = true (the recommended setting for any ReplicatedMergeTree setup), ClickHouse picks one replica per shard to write to and lets the replication engine handle propagation. For this to work, the cluster config must correctly map which nodes are replicas of the same shard.

<shard>
      <internal_replication>true</internal_replication>
      <replica>
          <host>ch-node-1</host>
          <port>9000</port>
      </replica>
      <replica>
          <host>ch-node-2</host>
          <port>9000</port>
      </replica>
  </shard>

When internal_replication = false (the default), the Distributed table writes to all replicas independently. If those replicas are ReplicatedMergeTree tables, you get double replication - the Distributed table writes to both, and each replica also replicates to the other via Keeper. Duplicate data.

ON CLUSTER DDL

ON CLUSTER DDL reads the cluster definition to figure out which hosts need to execute a statement. The initiator creates a task entry in Keeper, and a background DDLWorker on each node picks up tasks addressed to its host. If the hostname in the cluster config does not match what the node reports as its own identity, the task never gets picked up. Schema propagation breaks silently.

Parallel Replicas

ClickHouse can query a single shard's replicas in parallel to speed up large scans. The cluster definition must list multiple replicas within a shard (not multiple shards) for this to work. The coordinator splits the data range across replicas using the sampling key. If the cluster config does not correctly declare which nodes are replicas of the same shard, parallel replicas simply will not work.

Common Pitfalls

Cluster definition drift. Nodes with different cluster definitions - different hosts listed, different shard assignments - make ON CLUSTER queries behave unpredictably. Tasks get created for hosts that do not exist, or existing hosts get skipped. Use config management (Ansible, Kubernetes ConfigMaps, or Keeper-based config substitutions) to keep cluster.xml identical everywhere.

Keeper path collisions. When multiple ClickHouse clusters share one Keeper ensemble, the paths for ReplicatedMergeTree tables must be unique per cluster. Two clusters creating tables with the same Keeper path will incorrectly treat each other as replicas of the same table. The cleanest fix is using {uuid} in your default_replica_path (e.g. /clickhouse/tables/{uuid}/{shard}) - UUIDs are globally unique so collisions are impossible. If you are using explicit paths without {uuid}, add a {cluster} macro to disambiguate: /clickhouse/{cluster}/tables/{shard}/{database}/{table}.

Hostname mismatches. The DDLWorker matches tasks by hostname. If the initiator places a task for ch-node-1 but that node identifies itself as localhost or a Docker-internal hostname, the task never executes. Check with SELECT hostName() on each node and confirm it matches the cluster config.

Tables going read-only. Replicated tables go read-only when a node loses its Keeper connection. A common cause: running Keeper on the same host as ClickHouse. Under load, resource contention starves Keeper. Monitor system.replicas for is_readonly = 1 and give Keeper dedicated resources.

Key Takeaways

<remote_servers> is a routing map. The Keeper path on each ReplicatedMergeTree table is a replication contract. They are independent layers, and they need to stay consistent with each other.
Set internal_replication = true when using ReplicatedMergeTree with Distributed tables. The default (false) causes duplicate data.
Prefer default_replica_path with {uuid} over hardcoding table names in Keeper paths. With Atomic databases, you can omit engine parameters entirely and let ClickHouse handle path generation. If adding replicas later, retrieve the original DDL with SHOW CREATE TABLE ... SETTING show_table_uuid_in_table_create_query_if_not_nil.
Keep cluster configs identical across all nodes. Config drift is a silent source of topology bugs that are painful to diagnose after the fact.
Use system.clusters to audit topology. is_local confirms node identity; monitor the trend of errors_count alongside system.errors and Keeper connection metrics rather than reacting to one-off nonzero values.