Gaps and Islands in ClickHouse: Moving a Window Function Into Materialized Views

ClickHouse incremental materialized views only ever see one insert block, so a gaps-and-islands window function looks impossible. Here is how to seed the window with stored state and run it inside the MV - with verified SQL, three silent traps, and the sharding constraint that actually limits it.

Most analytics teams have written some version of this query: given a stream of events, each carrying the current state of an entity, collapse consecutive events with the same state into a single segment - one row per contiguous stay, with a start time, an end time, and where the entity went next.

This is the classic gaps-and-islands problem, and in a batch SQL engine you reach for a window function. The instinct is that window functions have no place in a real-time ClickHouse materialized view, because an incremental MV only ever sees one freshly-inserted block, never the full per-entity history a window needs.

That instinct is almost right. The trick this post builds up to is that you can run the window function inside the MV, if you scope it to a single insert block and seed it with one row of stored state. The result reproduces the batch query incrementally:

a current-state table (one row per entity),
a closed-intervals table (the completed segments),
two materialized views that maintain them, and
a union view that stitches the closed segments to the still-open one.

Every SQL block below was executed and verified on ClickHouse 24.7. We will also hit three traps that don't raise errors - a silent column collision, a firing-order myth, and a hard sharding constraint - and show what actually happens.

The problem

Say we ingest an event stream. Each event tells us the latest state_id of an entity (think: the warehouse a pallet is in, the region a vehicle is driving through, the status of an order - pick your domain):

events(tenant_id, entity_id, state_id, event_time)

Events arrive frequently - periodic heartbeats and change events - so the raw stream is noisy and repetitive:

tenant  entity  state   event_time
  acme    e-1     A       09:00
  acme    e-1     A       09:05   <- still in A, heartbeat
  acme    e-1     A       09:10   <- still in A, heartbeat
  acme    e-1     B       09:12   <- moved to B
  acme    e-1     B       09:30   <- still in B
  acme    e-1     C       09:41   <- moved to C

We want one row per stay ("segment" / "island"):

tenant  entity  state   next_state  start_time  end_time
  acme    e-1     A       B           09:00       09:12
  acme    e-1     B       C           09:12       09:41
  acme    e-1     C       (open)      09:41       now()

The last segment is still open - the entity is in C and has not left yet, so its end_time is "now".

The batch answer, and why it does not fit an MV

In a batch warehouse you would detect state changes with lagInFrame, mark the start of each island, and aggregate:

-- Batch gaps-and-islands. Reads the entire history every run.
  SELECT
      tenant_id, entity_id, state_id,
      min(event_time) AS start_time,
      max(event_time) AS end_time
  FROM
  (
      SELECT *,
          -- a new island starts whenever state differs from the previous event
          sum(is_new_island) OVER (PARTITION BY tenant_id, entity_id ORDER BY event_time) AS island_id
      FROM
      (
          SELECT *,
              state_id != lagInFrame(state_id) OVER (PARTITION BY tenant_id, entity_id ORDER BY event_time) AS is_new_island
          FROM events
      )
  )
  GROUP BY tenant_id, entity_id, state_id, island_id;

Correct and readable. It is also a full historical recompute: every time you want fresh segments you re-scan the entire event table. For a live dashboard over a high-volume stream, that is exactly the "full aggregation on every query" anti-pattern that incremental materialized views exist to avoid.

A ClickHouse incremental materialized view is a trigger on INSERT. Its SELECT runs over the rows in the inserted block only, never the whole table. A window partitioned by entity_id needs every prior event for that entity to find the previous island boundary, and those rows are not in the incoming block.

So instead of looking back through all history, we keep a tiny bit of state and push it forward as each event lands - then feed that one stored row back into the window as a seed.

The pattern

entity_state_current - a ReplacingMergeTree holding exactly one row per entity: its current (open) segment.
entity_state_segments - a MergeTree of closed intervals.
Two materialized views on events: a transition detector that appends closed intervals, and a state updater that maintains the current row.
entity_state_view - a UNION ALL view exposing closed intervals plus the open one as a single logical table.

The current-state table

-- Current state: latest segment per entity. One row per (tenant, entity).
  -- ReplacingMergeTree(last_event_ts) keeps the row with the highest
  -- last_event_ts, so duplicate/out-of-order inserts collapse to the newest.
  CREATE TABLE entity_state_current
  (
      tenant_id     String,
      entity_id     String,
      state_id      LowCardinality(String),
      start_time    DateTime64(3, 'UTC'),   -- when the current segment began
      last_event_ts DateTime64(3, 'UTC')    -- most recent event seen for entity
  )
  ENGINE = ReplacingMergeTree(last_event_ts)
  ORDER BY (tenant_id, entity_id);

Two best-practice choices:

LowCardinality(String) for state_id - dictionary-encode string columns with under ~10K distinct values.
The sort key (tenant_id, entity_id) is also the dedup key. ReplacingMergeTree deduplicates on the full ORDER BY tuple, giving exactly "one row per entity within a tenant", with the lower-cardinality tenant_id leading.

The closed-intervals table

-- Closed intervals: one row per completed stay.
  CREATE TABLE entity_state_segments
  (
      tenant_id     String,
      entity_id     String,
      state_id      LowCardinality(String),  -- the state being left
      next_state_id LowCardinality(String),  -- the state moved into
      start_time    DateTime64(3, 'UTC'),
      end_time      DateTime64(3, 'UTC'),
      last_event_ts DateTime64(3, 'UTC')
  )
  ENGINE = MergeTree
  ORDER BY (tenant_id, entity_id, start_time);

The sort key matches how segments are queried - "the timeline for this entity, in order" - so filtering on the (tenant_id, entity_id) prefix skips granules.

First cut: a simple state-diff MV (and why it is not enough)

The obvious transition detector joins each incoming event to the stored state and emits a row when they differ:

-- FIRST CUT - correct only when each event arrives in its own insert.
  CREATE MATERIALIZED VIEW entity_state_transition_mv TO entity_state_segments AS
  SELECT
      e.tenant_id,
      s.prev_state_id AS state_id,        -- state being left
      e.state_id      AS next_state_id,
      s.prev_start    AS start_time,
      e.event_time    AS end_time,
      e.entity_id,
      s.prev_last     AS last_event_ts
  FROM events AS e
  INNER JOIN
  (
      SELECT tenant_id, entity_id,
             state_id      AS prev_state_id,
             start_time    AS prev_start,
             last_event_ts AS prev_last
      FROM entity_state_current FINAL
  ) AS s
      ON s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id
  WHERE s.prev_state_id != e.state_id;

Trap #1: the silent column collision

The first version of that query selected s.state_id directly - but both events and entity_state_current expose a column literally named state_id:

-- Subtly broken: e and s both have a `state_id`
  SELECT e.tenant_id, s.state_id, e.state_id AS next_state_id, s.start_time, ...
  FROM events AS e
  INNER JOIN (SELECT * FROM entity_state_current FINAL) AS s
      ON s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id
  WHERE s.state_id != e.state_id;

This ran without error and produced rows - but every state_id (the state being left) came out empty, even though s.start_time from the same joined row was correct:

leaving  moved_to  start_time           end_time
  []       [B]       2026-01-01 09:00:00  2026-01-01 09:12:00   <- should be [A]
  []       [C]       2026-01-01 09:12:00  2026-01-01 09:41:00   <- should be [B]

Inside the MV's join the ambiguous s.state_id silently resolved to the default (empty string). The fix is to alias every colliding column in the joined subquery (state_id AS prev_state_id) so the two sides share no names, which is why the query above reads s.prev_state_id, s.prev_start, s.prev_last. Never SELECT * from the joined side of an MV, and never assume a bad column resolution will throw. Here it just wrote quietly wrong data.

Trap #2: transitions hide inside a single insert block

Even with the aliases fixed, the first-cut MV only compares the stored state to the incoming event. If a single INSERT carries several events for the same entity - common with batched ingestion - the join sees only the pre-insert state, so intra-block changes are invisible. Insert all six sample rows at once and you get:

-- closed segments: (none)
  -- current: C, start_time 09:41

Zero segments. The A->B and B->C transitions happened within the block and were never detected; the state jumped straight to C. The result is correct only if you can guarantee one event per insert, a guarantee you usually cannot make.

Moving the window function into the MV

Here is the resolution. The MV's SELECT does see the whole inserted block at once, so a window/gaps-and-islands computation over the block is perfectly legal. The only thing missing is the boundary with the previous block, and that is exactly one row: the stored current state. So we seed the block with it.

Per entity in the block we build an ordered list of points:

[ stored state ]  ++  [ sorted block events ]
     (the seed)            (this insert)

then walk it as a gaps-and-islands: every point whose state differs from the previous one closes the prior run (emit a segment) and starts a new one. The final, unclosed run is the new current state. Because the seed carries the true run-start from earlier blocks, a run that began three inserts ago still gets the right start_time.

ClickHouse historically rejects UNION inside a materialized view, and where newer builds accept it the behavior has been unreliable. On 24.7 the dependable move is to avoid it entirely: collect the block with groupArray, prepend the seed with arrayConcat, and do the island walk with array functions.

The transition detector (windowed)

CREATE MATERIALIZED VIEW entity_state_transition_mv TO entity_state_segments AS
  WITH
      -- One row per entity in this block: the stored "seed" state (if any) plus
      -- the block's events sorted by time.
      block AS
      (
          SELECT
              e.tenant_id AS tenant_id,
              e.entity_id AS entity_id,
              max(e.event_time)    AS last_event_ts,
              any(s.prev_state_id) AS seed_state,
              any(s.prev_start)    AS seed_start,
              max(s.has_seed)      AS has_seed,
              arraySort(x -> x.1, groupArray((e.event_time, e.state_id))) AS evs
          FROM events AS e
          LEFT JOIN
          (
              SELECT tenant_id, entity_id,
                     state_id   AS prev_state_id,
                     start_time AS prev_start,
                     toUInt8(1) AS has_seed
              FROM entity_state_current FINAL
          ) AS s ON s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id
          GROUP BY e.tenant_id, e.entity_id
      ),
      -- Gaps-and-islands over (seed ++ block):
      --   boundary[i] = 1 where the state differs from the previous point
      --   rs[i]       = the run-start time, carried forward from the last boundary
      walked AS
      (
          SELECT
              tenant_id, entity_id, last_event_ts,
              if(has_seed = 1, arrayConcat([(seed_start, seed_state)], evs), evs) AS pts,
              arrayMap(x -> x.2, pts) AS st,
              arrayMap(x -> x.1, pts) AS ts,
              length(pts) AS n,
              arrayMap(i -> if(i = 1, 0, st[i] != st[i - 1]), range(1, n + 1)) AS boundary,
              arrayMap(p -> p.2,
                  arrayFill(p -> p.1 = 1,
                      arrayMap(i -> (if(i = 1, 1, st[i] != st[i - 1]), ts[i]), range(1, n + 1)))) AS rs
          FROM block
      )
  SELECT
      tenant_id, entity_id,
      st[idx - 1] AS state_id,       -- run being closed
      st[idx]     AS next_state_id,  -- run it transitions into
      rs[idx - 1] AS start_time,     -- when the closed run began (may predate this block)
      ts[idx]     AS end_time,       -- when the change happened
      last_event_ts
  FROM walked
  ARRAY JOIN range(2, n + 1) AS idx  -- one candidate per adjacent pair of points
  WHERE boundary[idx] = 1;           -- ...kept only where the state actually changed

The run-start trick is arrayFill: it forward-fills over positions where the predicate is false, keeping the values where it is true. Mark each boundary point with its own timestamp and leave non-boundaries to inherit the most recent boundary, and rs[i] becomes "the time the current run started" at every index.

The state updater (also block-aware)

The updater must be block-aware too. A naive multiIf updater that writes one row per event lets ReplacingMergeTree keep whichever has the highest last_event_ts - which, for a block like [A@09:00, A@09:05], picks 09:05 and loses the true run start of 09:00. So the updater runs the same island walk and emits the final, still-open run per entity:

CREATE MATERIALIZED VIEW entity_state_update_mv TO entity_state_current AS
  WITH
      block AS
      (
          SELECT
              e.tenant_id AS tenant_id,
              e.entity_id AS entity_id,
              max(e.event_time)    AS last_event_ts,
              any(s.prev_state_id) AS seed_state,
              any(s.prev_start)    AS seed_start,
              max(s.has_seed)      AS has_seed,
              arraySort(x -> x.1, groupArray((e.event_time, e.state_id))) AS evs
          FROM events AS e
          LEFT JOIN
          (
              SELECT tenant_id, entity_id,
                     state_id   AS prev_state_id,
                     start_time AS prev_start,
                     toUInt8(1) AS has_seed
              FROM entity_state_current FINAL
          ) AS s ON s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id
          GROUP BY e.tenant_id, e.entity_id
      ),
      walked AS
      (
          SELECT
              tenant_id, entity_id, last_event_ts,
              if(has_seed = 1, arrayConcat([(seed_start, seed_state)], evs), evs) AS pts,
              arrayMap(x -> x.2, pts) AS st,
              arrayMap(x -> x.1, pts) AS ts,
              length(pts) AS n,
              arrayMap(p -> p.2,
                  arrayFill(p -> p.1 = 1,
                      arrayMap(i -> (if(i = 1, 1, st[i] != st[i - 1]), ts[i]), range(1, n + 1)))) AS rs
          FROM block
      )
  SELECT
      tenant_id, entity_id,
      st[n] AS state_id,      -- final, still-open run
      rs[n] AS start_time,    -- when it began (carried from the seed if unchanged)
      last_event_ts
  FROM walked;

Verified across batching shapes

The whole point of seeding the window is that the result no longer depends on how events are grouped into inserts. All three of these produce the identical, correct timeline:

CASE 1  all six events in ONE insert        -> A->B 09:00-09:12, B->C 09:12-09:41
  CASE 2  one event per insert                -> A->B 09:00-09:12, B->C 09:12-09:41
  CASE 3  split [A,A] then [B,C]               -> A->B 09:00-09:12, B->C 09:12-09:41

Note CASE 3: A->B still starts at 09:00. The run-start survived across the block boundary because it rode in on the seed.

The union view

-- Closed intervals + the currently-open stay, as one logical table.
  CREATE OR REPLACE VIEW entity_state_view AS
  SELECT tenant_id, entity_id, state_id, next_state_id, start_time, end_time, last_event_ts
  FROM entity_state_segments
  UNION ALL
  SELECT
      tenant_id, entity_id, state_id,
      ''                 AS next_state_id,  -- open interval: nowhere "next" yet
      start_time,
      now64(3, 'UTC')    AS end_time,       -- still open, so it ends "now"
      last_event_ts
  FROM entity_state_current FINAL;

Querying the timeline is now a sort-key-aligned read - no window, no full recompute:

SELECT state_id, next_state_id, start_time, end_time
  FROM entity_state_view
  WHERE tenant_id = 'acme' AND entity_id = 'e-1'
  ORDER BY start_time;

A   B          2026-01-01 09:00:00   2026-01-01 09:12:00
  B   C          2026-01-01 09:12:00   2026-01-01 09:41:00
  C   (empty)    2026-01-01 09:41:00   2026-06-18 12:07:55   <- open, end = now()

Order-key pushdown in the seed join

The expensive part of each insert is the LEFT JOIN to entity_state_current inside the block CTE. Its cost hinges on one rule: the join keys must be a left-anchored prefix of the right table's ORDER BY.

entity_state_current is sorted by (tenant_id, entity_id), and the join is on exactly s.tenant_id = e.tenant_id AND s.entity_id = e.entity_id - the whole sort key, in order. That lets ClickHouse serve the lookup from the sorted, primary-indexed data instead of hash-building the entire state table on every insert. Because the inputs are already sorted on the join key, you can pick the merge join explicitly:

SET join_algorithm = 'full_sorting_merge';

full_sorting_merge skips the sort step when inputs are already ordered on the join key. The default hash join would instead load the right side into memory in full - on a large state table inside a per-insert trigger, that is the difference between a cheap ordered probe and an OOM risk.

Three things combine to keep the join cheap:

Join keys (tenant_id, entity_id) are the sort-key prefix, so granules are skipped.
The small inserted block is the driving side; the subquery selects only the columns the seed needs.
Sorted-on-key inputs make full_sorting_merge a low-memory, no-extra-sort choice.

Anti-pattern: if the state table were sorted (entity_id, tenant_id) while you led the join with tenant_id, or you wrapped a key in a function (ON lower(s.entity_id) = lower(e.entity_id)), the prefix no longer matches and ClickHouse scans/hashes the whole table on every insert. Keep the join keys raw and in sort-key order.

Tip - prune the seed to the block. The GROUP BY (tenant_id, entity_id) means the join only needs state for entities present in this block. If your stream is multi-tenant and a block is single-tenant, also add WHERE tenant_id = ... inside the joined subquery to prune the right side before the join.

Myth-buster: firing order does not matter here

A natural worry: both MVs read entity_state_current and one of them writes it, so surely they must fire in a specific order, and parallel_view_processing must be off? Tested, and no. I ran the pipeline three ways:

transition MV created first, then updater (the "intended" order),
updater created first (named so it sorts and is created first), then transition,
SET parallel_view_processing = 1.

All three produced identical, correct output. The reason: within a single INSERT, each MV's join reads entity_state_current as it was committed before that insert. A sibling MV's write during the same insert is not visible to the other MV. Both views observe the same pre-insert snapshot, so their relative order cannot change the result.

This is a feature of the design, not luck: the transition detector and the updater both derive everything from (seed ++ block), where the seed is the pre-insert state. Neither depends on the other's output. Do not add a parallel_view_processing = 0 cargo-cult setting for this pattern. It is not needed.

The real limiter: one shard and replica per entity

Here is the constraint that genuinely matters. The seed read (entity_state_current FINAL) is local to the node processing the insert. It sees only the state that node has committed. For the seed to be the entity's true current state, every event for a given entity must be ingested on the same shard and the same replica.

Across shards: if an entity's events are hash-routed to different shards, each shard holds only a fragment of that entity's state. The seed is wrong, and segments fracture. You must shard so an entity co-locates with its state - for example, distribute by cityHash64(tenant_id, entity_id) and give all three tables (events, entity_state_current, entity_state_segments) the same sharding key.
Across replicas: ReplicatedMergeTree converges replicas eventually. The seed read happens at insert time, before replication catches up - so if consecutive events for an entity land on different replicas of the same shard, a freshly written state row may not be visible yet, and you can momentarily re-seed from stale state. Pin an entity's events to a single replica (sticky/consistent routing on entity_id); do not rely on "any replica of the shard."

In short: this is a local, single-writer-per-entity pattern. It scales horizontally by partitioning entities across nodes, not by spreading one entity's events around. If your ingestion cannot guarantee sticky per-entity routing, prefer the refreshable-MV approach below, which recomputes from the full table and is immune to where each event landed.

Operational gotchas

FINAL in the seed subquery is intentional, and is not OPTIMIZE ... FINAL. Reading entity_state_current FINAL gives the deduplicated current row. The FINAL modifier in a SELECT is the right way to dedup ReplacingMergeTree reads; never schedule OPTIMIZE TABLE ... FINAL to "pre-deduplicate", let background merges run.
Stale open intervals. An entity that stops emitting keeps an open interval whose end_time is now() forever. Treat last_event_ts as a freshness signal (for example, ignore open intervals where now() - last_event_ts > timeout).
At-least-once delivery. A duplicated event re-enters the same run, so the island walk produces no spurious boundary, and ReplacingMergeTree(last_event_ts) collapses duplicate state rows.
Out-of-order within a block is handled - the block is arraySorted by time before the walk. Out-of-order across blocks (a late event after its successor already advanced the state) is not; clamp or route by event time if that happens.
Backfill is separate. Incremental MVs only process data inserted after they exist. To seed history, run the batch window-function query once into the target tables, then let the MVs take over.

When to reach for a refreshable MV instead

This trigger-based design shines when you need fresh segments at insert time, the per-event bookkeeping is cheap, and you control routing so each entity is sticky to one node. If you can tolerate minutes of staleness, or you cannot guarantee per-entity routing, a refreshable materialized view that periodically re-runs the batch window query into a MergeTree target is simpler and routing-agnostic, at the cost of recomputing each refresh.

	Incremental (this pattern)	Refreshable MV
Freshness	At insert time	Every N minutes
Cost model	Tiny work per event	Full recompute per refresh
Routing	Must be sticky per entity	Indifferent
Best for	High-volume live segments, sticky ingest	Complex joins, tolerable lag

Takeaways

A batch "segment / gaps-and-islands" query assumes it can see an entity's whole history. A naive MV cannot, and a simple state-diff MV silently misses transitions that share an insert block.
You can run the window inside the MV by scoping it to one block and seeding it with the stored current state, then walking (seed ++ block) as a gaps-and-islands with array functions (avoid UNION inside the MV). Make the updater block-aware too, or ReplacingMergeTree will keep the wrong run-start.
Alias colliding columns across the join - an ambiguous s.state_id reads as empty without erroring.
Align the seed join's keys with the state table's sort key for an ordered-key lookup instead of a full-table hash build.
Firing order is irrelevant here (both MVs see the pre-insert snapshot); the binding constraint is that every event for an entity must hit the same shard and replica. This is a local, single-writer-per-entity pattern - scale it by partitioning entities across nodes, not by spreading one entity around.

If you are running ClickHouse in production and want a second set of eyes on a streaming materialized-view design like this - sharding, routing, and all - BigData Boutique does exactly this kind of work.