Monitoring Replication Health in ClickHouse

Keep your ClickHouse replicas in sync with practical monitoring queries. Learn to track replication lag, queue depth, and how to separate insert and query workloads across replicas.

ClickHouse is a distributed system, and in any replicated setup, keeping replicas in sync is a fundamental operational concern. Replication issues tend to be silent - they do not cause immediate query failures. Instead, they quietly accumulate lag until a failover exposes stale data, or a growing replication queue starts consuming resources. By the time you notice, the problem may be significant.

Monitoring replication health should be part of your baseline operational checks, right alongside resource utilization and part counts. This post covers what to watch, how to query for it, and how to architect your cluster to reduce replication pressure.

How Replication Works in ClickHouse

When you use the ReplicatedMergeTree engine family, ClickHouse maintains replicas of your data across multiple nodes. The mechanics are straightforward: when data is inserted into one replica, the other replicas pick up the new parts and apply them locally. Inserts can go to any replica - there is no single primary node.

Each replica maintains a replication queue - a list of operations (inserts, merges, mutations) that it needs to apply to stay in sync with the others. Under normal conditions, this queue stays short and items are processed quickly. Problems arise when the queue grows faster than a replica can process it, or when certain operations get stuck.

ClickHouse supports quorum inserts, where a write is only considered successful once a specified number of replicas have confirmed it. This provides stronger consistency guarantees but adds latency and complexity. Whether you use quorum inserts depends on your consistency requirements.

What to Monitor

The core replication health check queries system.replicas, which provides per-table replication status for every replicated table on the node:

SELECT
      database,
      table,
      is_leader,
      is_readonly,
      total_replicas,
      active_replicas,
      queue_size,
      inserts_in_queue,
      merges_in_queue,
      log_max_index - log_pointer AS replication_lag
  FROM system.replicas
  ORDER BY (log_max_index - log_pointer) DESC

The key columns to focus on:

queue_size - The total number of operations waiting to be processed. A consistently growing queue means the replica is falling behind. This should be your primary alerting metric for replication health.

inserts_in_queue and merges_in_queue - These break down the queue by operation type. A large number of inserts in the queue means the replica is not keeping up with incoming data. A large number of merges means background operations are stacking up.

replication_lag (computed as log_max_index - log_pointer) - This represents how far behind the replica is in processing the replication log. A value near zero is healthy. A steadily increasing value is a problem.

active_replicas vs total_replicas - If active replicas drops below total replicas, a node is unreachable. This should trigger an immediate alert.

is_readonly - A replica in readonly mode cannot accept inserts or process mutations. This is an error state that needs immediate investigation - it typically indicates a problem with the coordination layer.

Setting Up Alerts

For Prometheus-based monitoring, the most important alerting rules are:

Replication queue size exceeding a threshold over a sustained period (not just a momentary spike).
Replication lag (log_max_index - log_pointer) growing over time.
Active replicas dropping below the expected total.
Readonly replicas - any replica entering readonly mode should be an immediate alert.

These can be derived from the ClickHouse Prometheus exporter or by querying system tables and exposing the results.

Diagnosing and Fixing Replication Lag

When you detect replication lag, the diagnosis process follows a predictable pattern.

Check resource utilization first. A replica that is CPU-bound, memory-constrained, or hitting disk I/O limits will naturally fall behind on replication. The fix here is not replication-specific - it is addressing the underlying resource bottleneck.

Look for stuck operations in the queue. A single large mutation or a problematic merge can block the queue and prevent subsequent operations from being processed. Check system.replication_queue for entries that have been retrying for an extended period.

Review insert pressure. If the insert rate is very high and inserts are going to all replicas, each replica needs to process both its own inserts and replicate inserts from others. This can create a multiplier effect on resource consumption.

Separating Insert and Query Replicas

One architectural technique for managing replication pressure is to designate specific replicas for inserts and others for serving queries. Rather than having every replica handle both workloads, you direct your write traffic to a subset of replicas and your read traffic to others. This prevents heavy insert workloads from degrading query performance and gives query-serving replicas time to catch up on replication.

This separation can be implemented at the load balancer or application level - ClickHouse itself does not enforce a primary/secondary distinction, so you control the routing.

Key Takeaways

Replication issues are silent until they are not. Monitor system.replicas proactively - do not wait for a failover or stale data report to discover lag.
queue_size and replication_lag (log_max_index - log_pointer) are your primary health indicators. Alert on sustained growth, not momentary spikes.
When diagnosing lag, check resource utilization first, then look for stuck operations in the replication queue, then review insert pressure.
Consider separating insert and query replicas to isolate workloads and reduce replication pressure on the nodes serving user-facing queries.