ClickHouse incremental materialized views only ever see one insert block, so a gaps-and-islands window function looks impossible. Here is how to seed the window with stored state and run it inside the MV - with verified SQL, three silent traps, and the sharding constraint that actually limits it.
Most analytics teams have written some version of this query: given a stream of events, each carrying the current state of an entity, collapse consecutive events with the same state into a single segment - one row per contiguous stay, with a start time, an end time, and where the entity went next.
This is the classic gaps-and-islands problem, and in a batch SQL engine you reach for a window function. The instinct is that window functions have no place in a real-time ClickHouse materialized view, because an incremental MV only ever sees one freshly-inserted block, never the full per-entity history a window needs.
That instinct is almost right. The trick this post builds up to is that you can run the window function inside the MV, if you scope it to a single insert block and seed it with one row of stored state. The result reproduces the batch query incrementally:
- a current-state table (one row per entity),
- a closed-intervals table (the completed segments),
- two materialized views that maintain them, and
- a union view that stitches the closed segments to the still-open one.
Every SQL block below was executed and verified on ClickHouse 24.7. We will also hit three traps that don't raise errors - a silent column collision, a firing-order myth, and a hard sharding constraint - and show what actually happens.
The problem
Say we ingest an event stream. Each event tells us the latest state_id of an entity (think: the warehouse a pallet is in, the region a vehicle is driving through, the status of an order - pick your domain):
events(tenant_id, entity_id, state_id, event_time)
Events arrive frequently - periodic heartbeats and change events - so the raw stream is noisy and repetitive:
tenant entity state event_time
acme e-1 A 09:00
acme e-1 A 09:05 <- still in A, heartbeat
acme e-1 A 09:10 <- still in A, heartbeat
acme e-1 B 09:12 <- moved to B
acme e-1 B 09:30 <- still in B
acme e-1 C 09:41 <- moved to C
We want one row per stay ("segment" / "island"):
tenant entity state next_state start_time end_time
acme e-1 A B 09:00 09:12
acme e-1 B C 09:12 09:41
acme e-1 C (open) 09:41 now()
The last segment is still open - the entity is in C and has not left yet, so its end_time is "now".
The batch answer, and why it does not fit an MV
In a batch warehouse you would detect state changes with lagInFrame, mark the start of each island, and aggregate:
-- Batch gaps-and-islands. Reads the entire history every run.
SELECT
tenant_id, entity_id, state_id,
min(event_time) AS start_time,
max(event_time) AS end_time
FROM
(
SELECT *,
-- a new island starts whenever state differs from the previous event
sum(is_new_island) OVER (PARTITION BY tenant_id, entity_id ORDER BY event_time) AS island_id
FROM
(
SELECT *,
state_id != lagInFrame(state_id) OVER (PARTITION BY tenant_id, entity_id ORDER BY event_time) AS is_new_island
FROM events
)
)
GROUP BY tenant_id, entity_id, state_id, island_id;
Correct and readable. It is also a full historical recompute: every time you want fresh segments you re-scan the entire event table. For a live dashboard over a high-volume stream, that is exactly the "full aggregation on every query" anti-pattern that incremental materialized views exist to avoid.
A ClickHouse incremental materialized view is a trigger on
INSERT. ItsSELECTruns over the rows in the inserted block only, never the whole table. A window partitioned byentity_idneeds every prior event for that entity to find the previous island boundary, and those rows are not in the incoming block.
So instead of looking back through all history, we keep a tiny bit of state and push it forward as each event lands - then feed that one stored row back into the window as a seed.
The pattern
entity_state_current- aReplacingMergeTreeholding exactly one row per entity: its current (open) segment.entity_state_segments- aMergeTreeof closed intervals.- Two materialized views on
events: a transition detector that appends closed intervals, and a state updater that maintains the current row. entity_state_view- aUNION ALLview exposing closed intervals plus the open one as a single logical table.
The current-state table
-- Current state: latest segment per entity. One row per (tenant, entity).
-- ReplacingMergeTree(last_event_ts) keeps the row with the highest
-- last_event_ts, so duplicate/out-of-order inserts collapse to the newest.
CREATE TABLE entity_state_current
(
tenant_id String,
entity_id String,
state_id LowCardinality(String),
start_time DateTime64(3, 'UTC'), -- when the current segment began
last_event_ts DateTime64(3, 'UTC') -- most recent event seen for entity
)
ENGINE = ReplacingMergeTree(last_event_ts)
ORDER BY (tenant_id, entity_id);
Two best-practice choices:
LowCardinality(String)forstate_id- dictionary-encode string columns with under ~10K distinct values.- The sort key
(tenant_id, entity_id)is also the dedup key.ReplacingMergeTreededuplicates on the fullORDER BYtuple, giving exactly "one row per entity within a tenant", with the lower-cardinalitytenant_idleading.
The closed-intervals table
-- Closed intervals: one row per completed stay.
CREATE TABLE entity_state_segments
(
tenant_id String,
entity_id String,
state_id LowCardinality(String), -- the state being left
next_state_id LowCardinality(String), -- the state moved into
start_time DateTime64(3, 'UTC'),
end_time DateTime64(3, 'UTC'),
last_event_ts DateTime64(3, 'UTC')
)
ENGINE = MergeTree
ORDER BY (tenant_id, entity_id, start_time);
The sort key matches how segments are queried - "the timeline for this entity, in order" - so filtering on the (tenant_id, entity_id) prefix skips granules.
First cut: a simple state-diff MV (and why it is not enough)
The obvious transition detector joins each incoming event to the stored state and emits a row when they differ:
-- FIRST CUT - correct only when each event arrives in its own insert.
CREATE MATERIALIZED VIEW entity_state_transition_mv TO entity_state_segments AS
SELECT
e.tenant_id,
s.prev_state_id AS state_id, -- state being left
e.state_id AS next_state_id,
s.prev_start AS start_time,
e.event_time AS end_time,
e.entity_id,
s.prev_last AS last_event_ts
FROM events AS e
INNER JOIN
(
SELECT tenant_id, entity_id,
state_id AS prev_state_id,
start_time AS prev_start,
last_event_ts AS prev_last
FROM entity_state_current FINAL
) AS s
ON s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id
WHERE s.prev_state_id != e.state_id;
Trap #1: the silent column collision
The first version of that query selected s.state_id directly - but both events and entity_state_current expose a column literally named state_id:
-- Subtly broken: e and s both have a `state_id`
SELECT e.tenant_id, s.state_id, e.state_id AS next_state_id, s.start_time, ...
FROM events AS e
INNER JOIN (SELECT * FROM entity_state_current FINAL) AS s
ON s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id
WHERE s.state_id != e.state_id;
This ran without error and produced rows - but every state_id (the state being left) came out empty, even though s.start_time from the same joined row was correct:
leaving moved_to start_time end_time
[] [B] 2026-01-01 09:00:00 2026-01-01 09:12:00 <- should be [A]
[] [C] 2026-01-01 09:12:00 2026-01-01 09:41:00 <- should be [B]
Inside the MV's join the ambiguous s.state_id silently resolved to the default (empty string). The fix is to alias every colliding column in the joined subquery (state_id AS prev_state_id) so the two sides share no names, which is why the query above reads s.prev_state_id, s.prev_start, s.prev_last. Never SELECT * from the joined side of an MV, and never assume a bad column resolution will throw. Here it just wrote quietly wrong data.
Trap #2: transitions hide inside a single insert block
Even with the aliases fixed, the first-cut MV only compares the stored state to the incoming event. If a single INSERT carries several events for the same entity - common with batched ingestion - the join sees only the pre-insert state, so intra-block changes are invisible. Insert all six sample rows at once and you get:
-- closed segments: (none)
-- current: C, start_time 09:41
Zero segments. The A->B and B->C transitions happened within the block and were never detected; the state jumped straight to C. The result is correct only if you can guarantee one event per insert, a guarantee you usually cannot make.
Moving the window function into the MV
Here is the resolution. The MV's SELECT does see the whole inserted block at once, so a window/gaps-and-islands computation over the block is perfectly legal. The only thing missing is the boundary with the previous block, and that is exactly one row: the stored current state. So we seed the block with it.
Per entity in the block we build an ordered list of points:
[ stored state ] ++ [ sorted block events ]
(the seed) (this insert)
then walk it as a gaps-and-islands: every point whose state differs from the previous one closes the prior run (emit a segment) and starts a new one. The final, unclosed run is the new current state. Because the seed carries the true run-start from earlier blocks, a run that began three inserts ago still gets the right start_time.
ClickHouse historically rejects
UNIONinside a materialized view, and where newer builds accept it the behavior has been unreliable. On 24.7 the dependable move is to avoid it entirely: collect the block withgroupArray, prepend the seed witharrayConcat, and do the island walk with array functions.
The transition detector (windowed)
CREATE MATERIALIZED VIEW entity_state_transition_mv TO entity_state_segments AS
WITH
-- One row per entity in this block: the stored "seed" state (if any) plus
-- the block's events sorted by time.
block AS
(
SELECT
e.tenant_id AS tenant_id,
e.entity_id AS entity_id,
max(e.event_time) AS last_event_ts,
any(s.prev_state_id) AS seed_state,
any(s.prev_start) AS seed_start,
max(s.has_seed) AS has_seed,
arraySort(x -> x.1, groupArray((e.event_time, e.state_id))) AS evs
FROM events AS e
LEFT JOIN
(
SELECT tenant_id, entity_id,
state_id AS prev_state_id,
start_time AS prev_start,
toUInt8(1) AS has_seed
FROM entity_state_current FINAL
) AS s ON s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id
GROUP BY e.tenant_id, e.entity_id
),
-- Gaps-and-islands over (seed ++ block):
-- boundary[i] = 1 where the state differs from the previous point
-- rs[i] = the run-start time, carried forward from the last boundary
walked AS
(
SELECT
tenant_id, entity_id, last_event_ts,
if(has_seed = 1, arrayConcat([(seed_start, seed_state)], evs), evs) AS pts,
arrayMap(x -> x.2, pts) AS st,
arrayMap(x -> x.1, pts) AS ts,
length(pts) AS n,
arrayMap(i -> if(i = 1, 0, st[i] != st[i - 1]), range(1, n + 1)) AS boundary,
arrayMap(p -> p.2,
arrayFill(p -> p.1 = 1,
arrayMap(i -> (if(i = 1, 1, st[i] != st[i - 1]), ts[i]), range(1, n + 1)))) AS rs
FROM block
)
SELECT
tenant_id, entity_id,
st[idx - 1] AS state_id, -- run being closed
st[idx] AS next_state_id, -- run it transitions into
rs[idx - 1] AS start_time, -- when the closed run began (may predate this block)
ts[idx] AS end_time, -- when the change happened
last_event_ts
FROM walked
ARRAY JOIN range(2, n + 1) AS idx -- one candidate per adjacent pair of points
WHERE boundary[idx] = 1; -- ...kept only where the state actually changed
The run-start trick is arrayFill: it forward-fills over positions where the predicate is false, keeping the values where it is true. Mark each boundary point with its own timestamp and leave non-boundaries to inherit the most recent boundary, and rs[i] becomes "the time the current run started" at every index.
The state updater (also block-aware)
The updater must be block-aware too. A naive multiIf updater that writes one row per event lets ReplacingMergeTree keep whichever has the highest last_event_ts - which, for a block like [A@09:00, A@09:05], picks 09:05 and loses the true run start of 09:00. So the updater runs the same island walk and emits the final, still-open run per entity:
CREATE MATERIALIZED VIEW entity_state_update_mv TO entity_state_current AS
WITH
block AS
(
SELECT
e.tenant_id AS tenant_id,
e.entity_id AS entity_id,
max(e.event_time) AS last_event_ts,
any(s.prev_state_id) AS seed_state,
any(s.prev_start) AS seed_start,
max(s.has_seed) AS has_seed,
arraySort(x -> x.1, groupArray((e.event_time, e.state_id))) AS evs
FROM events AS e
LEFT JOIN
(
SELECT tenant_id, entity_id,
state_id AS prev_state_id,
start_time AS prev_start,
toUInt8(1) AS has_seed
FROM entity_state_current FINAL
) AS s ON s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id
GROUP BY e.tenant_id, e.entity_id
),
walked AS
(
SELECT
tenant_id, entity_id, last_event_ts,
if(has_seed = 1, arrayConcat([(seed_start, seed_state)], evs), evs) AS pts,
arrayMap(x -> x.2, pts) AS st,
arrayMap(x -> x.1, pts) AS ts,
length(pts) AS n,
arrayMap(p -> p.2,
arrayFill(p -> p.1 = 1,
arrayMap(i -> (if(i = 1, 1, st[i] != st[i - 1]), ts[i]), range(1, n + 1)))) AS rs
FROM block
)
SELECT
tenant_id, entity_id,
st[n] AS state_id, -- final, still-open run
rs[n] AS start_time, -- when it began (carried from the seed if unchanged)
last_event_ts
FROM walked;
Verified across batching shapes
The whole point of seeding the window is that the result no longer depends on how events are grouped into inserts. All three of these produce the identical, correct timeline:
CASE 1 all six events in ONE insert -> A->B 09:00-09:12, B->C 09:12-09:41
CASE 2 one event per insert -> A->B 09:00-09:12, B->C 09:12-09:41
CASE 3 split [A,A] then [B,C] -> A->B 09:00-09:12, B->C 09:12-09:41
Note CASE 3: A->B still starts at 09:00. The run-start survived across the block boundary because it rode in on the seed.
The union view
-- Closed intervals + the currently-open stay, as one logical table.
CREATE OR REPLACE VIEW entity_state_view AS
SELECT tenant_id, entity_id, state_id, next_state_id, start_time, end_time, last_event_ts
FROM entity_state_segments
UNION ALL
SELECT
tenant_id, entity_id, state_id,
'' AS next_state_id, -- open interval: nowhere "next" yet
start_time,
now64(3, 'UTC') AS end_time, -- still open, so it ends "now"
last_event_ts
FROM entity_state_current FINAL;
Querying the timeline is now a sort-key-aligned read - no window, no full recompute:
SELECT state_id, next_state_id, start_time, end_time
FROM entity_state_view
WHERE tenant_id = 'acme' AND entity_id = 'e-1'
ORDER BY start_time;
A B 2026-01-01 09:00:00 2026-01-01 09:12:00
B C 2026-01-01 09:12:00 2026-01-01 09:41:00
C (empty) 2026-01-01 09:41:00 2026-06-18 12:07:55 <- open, end = now()
Order-key pushdown in the seed join
The expensive part of each insert is the LEFT JOIN to entity_state_current inside the block CTE. Its cost hinges on one rule: the join keys must be a left-anchored prefix of the right table's ORDER BY.
entity_state_current is sorted by (tenant_id, entity_id), and the join is on exactly s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id - the whole sort key, in order. That lets ClickHouse serve the lookup from the sorted, primary-indexed data instead of hash-building the entire state table on every insert. Because the inputs are already sorted on the join key, you can pick the merge join explicitly:
SET join_algorithm = 'full_sorting_merge';
full_sorting_merge skips the sort step when inputs are already ordered on the join key. The default hash join would instead load the right side into memory in full - on a large state table inside a per-insert trigger, that is the difference between a cheap ordered probe and an OOM risk.
Three things combine to keep the join cheap:
- Join keys
(tenant_id, entity_id)are the sort-key prefix, so granules are skipped. - The small inserted block is the driving side; the subquery selects only the columns the seed needs.
- Sorted-on-key inputs make
full_sorting_mergea low-memory, no-extra-sort choice.
Anti-pattern: if the state table were sorted (entity_id, tenant_id) while you led the join with tenant_id, or you wrapped a key in a function (ON lower(s.entity_id) = lower(e.entity_id)), the prefix no longer matches and ClickHouse scans/hashes the whole table on every insert. Keep the join keys raw and in sort-key order.
Tip - prune the seed to the block. The
GROUP BY (tenant_id, entity_id)means the join only needs state for entities present in this block. If your stream is multi-tenant and a block is single-tenant, also addWHERE tenant_id = ...inside the joined subquery to prune the right side before the join.
Myth-buster: firing order does not matter here
A natural worry: both MVs read entity_state_current and one of them writes it, so surely they must fire in a specific order, and parallel_view_processing must be off? Tested, and no. I ran the pipeline three ways:
- transition MV created first, then updater (the "intended" order),
- updater created first (named so it sorts and is created first), then transition,
SET parallel_view_processing = 1.
All three produced identical, correct output. The reason: within a single INSERT, each MV's join reads entity_state_current as it was committed before that insert. A sibling MV's write during the same insert is not visible to the other MV. Both views observe the same pre-insert snapshot, so their relative order cannot change the result.
This is a feature of the design, not luck: the transition detector and the updater both derive everything from (seed ++ block), where the seed is the pre-insert state. Neither depends on the other's output. Do not add a parallel_view_processing = 0 cargo-cult setting for this pattern. It is not needed.
The real limiter: one shard and replica per entity
Here is the constraint that genuinely matters. The seed read (entity_state_current FINAL) is local to the node processing the insert. It sees only the state that node has committed. For the seed to be the entity's true current state, every event for a given entity must be ingested on the same shard and the same replica.
- Across shards: if an entity's events are hash-routed to different shards, each shard holds only a fragment of that entity's state. The seed is wrong, and segments fracture. You must shard so an entity co-locates with its state - for example, distribute by
cityHash64(tenant_id, entity_id)and give all three tables (events,entity_state_current,entity_state_segments) the same sharding key. - Across replicas:
ReplicatedMergeTreeconverges replicas eventually. The seed read happens at insert time, before replication catches up - so if consecutive events for an entity land on different replicas of the same shard, a freshly written state row may not be visible yet, and you can momentarily re-seed from stale state. Pin an entity's events to a single replica (sticky/consistent routing onentity_id); do not rely on "any replica of the shard."
In short: this is a local, single-writer-per-entity pattern. It scales horizontally by partitioning entities across nodes, not by spreading one entity's events around. If your ingestion cannot guarantee sticky per-entity routing, prefer the refreshable-MV approach below, which recomputes from the full table and is immune to where each event landed.
Operational gotchas
FINALin the seed subquery is intentional, and is notOPTIMIZE ... FINAL. Readingentity_state_current FINALgives the deduplicated current row. TheFINALmodifier in a SELECT is the right way to dedupReplacingMergeTreereads; never scheduleOPTIMIZE TABLE ... FINALto "pre-deduplicate", let background merges run.- Stale open intervals. An entity that stops emitting keeps an open interval whose
end_timeisnow()forever. Treatlast_event_tsas a freshness signal (for example, ignore open intervals wherenow() - last_event_ts > timeout). - At-least-once delivery. A duplicated event re-enters the same run, so the island walk produces no spurious boundary, and
ReplacingMergeTree(last_event_ts)collapses duplicate state rows. - Out-of-order within a block is handled - the block is
arraySorted by time before the walk. Out-of-order across blocks (a late event after its successor already advanced the state) is not; clamp or route by event time if that happens. - Backfill is separate. Incremental MVs only process data inserted after they exist. To seed history, run the batch window-function query once into the target tables, then let the MVs take over.
When to reach for a refreshable MV instead
This trigger-based design shines when you need fresh segments at insert time, the per-event bookkeeping is cheap, and you control routing so each entity is sticky to one node. If you can tolerate minutes of staleness, or you cannot guarantee per-entity routing, a refreshable materialized view that periodically re-runs the batch window query into a MergeTree target is simpler and routing-agnostic, at the cost of recomputing each refresh.
| Incremental (this pattern) | Refreshable MV | |
|---|---|---|
| Freshness | At insert time | Every N minutes |
| Cost model | Tiny work per event | Full recompute per refresh |
| Routing | Must be sticky per entity | Indifferent |
| Best for | High-volume live segments, sticky ingest | Complex joins, tolerable lag |
Takeaways
- A batch "segment / gaps-and-islands" query assumes it can see an entity's whole history. A naive MV cannot, and a simple state-diff MV silently misses transitions that share an insert block.
- You can run the window inside the MV by scoping it to one block and seeding it with the stored current state, then walking
(seed ++ block)as a gaps-and-islands with array functions (avoidUNIONinside the MV). Make the updater block-aware too, orReplacingMergeTreewill keep the wrong run-start. - Alias colliding columns across the join - an ambiguous
s.state_idreads as empty without erroring. - Align the seed join's keys with the state table's sort key for an ordered-key lookup instead of a full-table hash build.
- Firing order is irrelevant here (both MVs see the pre-insert snapshot); the binding constraint is that every event for an entity must hit the same shard and replica. This is a local, single-writer-per-entity pattern - scale it by partitioning entities across nodes, not by spreading one entity around.
If you are running ClickHouse in production and want a second set of eyes on a streaming materialized-view design like this - sharding, routing, and all - BigData Boutique does exactly this kind of work.