How ClickHouse cluster definitions, ZooKeeper/Keeper paths, and replication actually fit together - and why getting the distinction right matters for every topology decision you make.
The words "cluster," "shard," and "replica" show up in ClickHouse config files and in ZooKeeper/Keeper paths. They look the same. They mean different things. Confusing one for the other is behind most of the cluster misconfiguration issues we see in production.
This post unpacks what a cluster definition actually does, how replication works independently of it, and where the two must agree.
How a Node Knows Its Cluster Topology
Each ClickHouse node has a config.xml (or a file in config.d/) that declares cluster definitions under <remote_servers>. A node can belong to multiple clusters at once. Each cluster block is its own independent topology declaration.
<remote_servers>
<my_cluster>
<shard>
<replica>
<host>ch-node-1</host>
<port>9000</port>
</replica>
<replica>
<host>ch-node-2</host>
<port>9000</port>
</replica>
</shard>
</my_cluster>
</remote_servers>
Those <host> entries are what surface in system.clusters - the system table reflecting these config declarations at runtime. Each row is one replica within one cluster, as seen by the local node:
SELECT
cluster,
shard_num,
replica_num,
host_name,
is_local
FROM system.clusters
ORDER BY cluster, shard_num, replica_num;
is_local tells you whether a row refers to the current node. Run this on every node in your fleet, compare the output, and you can reconstruct exactly which nodes consider themselves part of which cluster/shard/replica position.
Worth knowing: errors_count increments when ClickHouse fails to connect to a peer. Nonzero values point to misconfigured hosts or network partitions.
The Two Layers: Routing vs. Replication
Here is the thing people get wrong most often. The shard_num and replica_num values in system.clusters are labels, not enforcement. ClickHouse never validates that your data is actually sharded or replicated according to these numbers. The cluster definition is the routing layer - it tells ClickHouse where to send queries and writes. Replication is controlled by something else entirely: the ZooKeeper/Keeper path used when creating a ReplicatedMergeTree table.
CREATE TABLE hits (...)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/hits',
'{replica}'
)
{shard} and {replica} are macros, substituted from the node's <macros> config block:
<macros>
<shard>01</shard>
<replica>ch-node-1</replica>
</macros>
Two nodes with the same {shard} value share the same Keeper path prefix. They replicate to each other - regardless of what shard_num their cluster config declares. Different {shard} values mean different Keeper paths and independent replication groups.
So you can have a cluster config that says nodes A and B belong to the same shard, but if their macros produce different Keeper paths, they will never replicate data to each other. The config says "route here." Keeper says "replicate there." Two independent layers, and both need to be right.
Where the Layers Must Agree
The two layers are independent, but several features break if they disagree.
Internal Replication
With Distributed tables and internal_replication = true (the recommended setting for any ReplicatedMergeTree setup), ClickHouse picks one replica per shard to write to and lets the replication engine handle propagation. For this to work, the cluster config must correctly map which nodes are replicas of the same shard.
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>ch-node-1</host>
<port>9000</port>
</replica>
<replica>
<host>ch-node-2</host>
<port>9000</port>
</replica>
</shard>
When internal_replication = false (the default), the Distributed table writes to all replicas independently. If those replicas are ReplicatedMergeTree tables, you get double replication - the Distributed table writes to both, and each replica also replicates to the other via Keeper. Duplicate data.
ON CLUSTER DDL
ON CLUSTER DDL reads the cluster definition to figure out which hosts need to execute a statement. The initiator creates a task entry in Keeper, and a background DDLWorker on each node picks up tasks addressed to its host. If the hostname in the cluster config does not match what the node reports as its own identity, the task never gets picked up. Schema propagation breaks silently.
Parallel Replicas
ClickHouse can query a single shard's replicas in parallel to speed up large scans. The cluster definition must list multiple replicas within a shard (not multiple shards) for this to work. The coordinator splits the data range across replicas using the sampling key. If the cluster config does not correctly declare which nodes are replicas of the same shard, parallel replicas simply will not work.
Common Pitfalls
Cluster definition drift. Nodes with different cluster definitions - different hosts listed, different shard assignments - make ON CLUSTER queries behave unpredictably. Tasks get created for hosts that do not exist, or existing hosts get skipped. Use config management (Ansible, Kubernetes ConfigMaps, or Keeper-based config substitutions) to keep cluster.xml identical everywhere.
Keeper path collisions. When multiple ClickHouse clusters share one Keeper ensemble, the paths for ReplicatedMergeTree tables must be unique per cluster. Two clusters creating tables with the same Keeper path will incorrectly treat each other as replicas of the same table. Add a {cluster} macro to your paths: /clickhouse/{cluster}/tables/{shard}/{database}/{table}.
Hostname mismatches. The DDLWorker matches tasks by hostname. If the initiator places a task for ch-node-1 but that node identifies itself as localhost or a Docker-internal hostname, the task never executes. Check with SELECT hostName() on each node and confirm it matches the cluster config.
Tables going read-only. Replicated tables go read-only when a node loses its Keeper connection. A common cause: running Keeper on the same host as ClickHouse. Under load, resource contention starves Keeper. Monitor system.replicas for is_readonly = 1 and give Keeper dedicated resources.
Key Takeaways
<remote_servers>is a routing map. The Keeper path on eachReplicatedMergeTreetable is a replication contract. They are independent layers, and they need to stay consistent with each other.- Set
internal_replication = truewhen usingReplicatedMergeTreewithDistributedtables. The default (false) causes duplicate data. - Use macros (
{shard},{replica},{cluster}) in yourReplicatedMergeTreepaths so oneCREATE TABLEstatement works correctly across all nodes viaON CLUSTER. - Keep cluster configs identical across all nodes. Config drift is a silent source of topology bugs that are painful to diagnose after the fact.
- Use
system.clustersto audit topology.is_localanderrors_countare your primary tools for validating that the routing layer matches reality.