ClickHouse Power Tips: LowCardinality

LowCardinality is ClickHouse's dictionary encoding for columns with few unique values. Learn how it works internally, when to apply it, and the storage and query speed gains you can expect.

String columns are often the largest consumers of storage in ClickHouse tables. A column storing country codes, log levels, or event types might have only a handful of distinct values - yet ClickHouse stores the full string for every single row. On a table with a billion rows, that redundancy adds up fast. LowCardinality solves this by applying dictionary encoding at the column level: unique values are stored once in a dictionary, and each row holds only a small integer reference. The result is storage reductions of up to 90% and query speedups of 2-4x on affected columns.

How LowCardinality Works Under the Hood

When you declare a column as LowCardinality(String), ClickHouse builds an internal dictionary of all unique values in that column. Each row stores an integer position pointing into that dictionary instead of the actual string. This is transparent to your queries - you still write WHERE status = 'active', and ClickHouse handles the translation.

The dictionary encoding adapts to the data. If a column has fewer than 256 unique values, positions are stored as UInt8 (1 byte each). Up to 65,535 unique values use UInt16 (2 bytes). Beyond that, UInt32. This means a column with 10 distinct status codes across a billion rows stores 10 dictionary entries plus a billion single-byte integers - instead of a billion variable-length strings.

ClickHouse also optimizes how dictionaries are stored on disk. When all LowCardinality columns in a table fit within 8,192 distinct values, ClickHouse can store them in a single shared dictionary file. Beyond that threshold, each column gets its own dictionary file. In distributed queries, ClickHouse operates on dictionary positions for as long as possible, only resolving back to actual strings at the final stage. This reduces network transfer and memory usage across shards.

When to Use LowCardinality (and When Not To)

The sweet spot for LowCardinality is columns with fewer than 10,000 unique values. Think status codes, HTTP methods, country names, log levels, event types, browser names - any column where the same strings repeat frequently across rows.

Between 10,000 and 100,000 unique values is a gray zone. You may still see storage benefits, but query performance gains diminish. Above 100,000 unique values, LowCardinality can actually hurt performance compared to plain String because the dictionary overhead outweighs the savings. Columns like UUIDs, email addresses, URLs, or free-text fields should never use LowCardinality.

Criteria	`String`	`LowCardinality(String)`	`Enum8`/`Enum16`
Unique values	Any	< 10,000 ideal	Fixed, known set
Schema changes	No action needed	No action needed	Requires `ALTER TABLE`
Storage efficiency	Baseline	Up to 90% reduction	Similar to LowCardinality
Query speed (GROUP BY)	Baseline	2-5x faster	5-6x faster
Flexibility	Full	Full	Values must be predefined
Nullable support	Yes	Yes (`Nullable(LowCardinality(String))`)	No

A few things worth knowing about type restrictions. By default, ClickHouse only allows LowCardinality on String, FixedString, Date, DateTime, and Nullable combinations of these. Wrapping fixed-size types like Int32 or Float64 in LowCardinality is blocked because those types are already compact - the dictionary overhead would make things worse. If you need to override this (rare, and usually a bad idea), set allow_suspicious_low_cardinality_types = 1.

For nullable columns, the correct syntax is LowCardinality(Nullable(String)), not Nullable(LowCardinality(String)). ClickHouse handles NULL as a special dictionary entry, so there is no extra overhead for nullability within the LowCardinality wrapper.

Performance Impact: Storage and Query Speed

The storage savings from LowCardinality are straightforward to measure. In a typical benchmark with 10 million rows and a string column containing ~50 unique values, storage drops from 157 MB to 94 MB - a 40% reduction on that column alone. With very repetitive data (think log levels with 5-6 values), the compressed size can shrink by up to 90%.

Query performance depends on the operation:

GROUP BY: 2-5x faster. ClickHouse groups by integer positions instead of comparing strings, and the compact representation fits better in CPU cache.
WHERE filters: 2-4x faster for equality checks and IN clauses. String comparisons become integer comparisons.
Time-series with path-like keys: Up to 20-25x faster in some reported cases, particularly when long, repetitive strings (like metric paths) are dictionary-encoded.
Point lookups: Negligible difference. If you are fetching a single row by primary key, dictionary encoding does not help much.

The tradeoff is a modest increase in insert time. Dictionary maintenance adds overhead during writes - roughly 60% slower inserts in some benchmarks (4.16s vs 2.59s for 10M rows). For most analytical workloads where reads vastly outnumber writes, this is a worthwhile trade.

Applying LowCardinality to Existing Tables

Migrating an existing column to LowCardinality is a single ALTER TABLE statement:

ALTER TABLE events
      MODIFY COLUMN status LowCardinality(String);

This triggers a background mutation that rewrites the column data. You can track progress in system.mutations:

SELECT * FROM system.mutations
  WHERE table = 'events' AND is_done = 0;

Before applying changes, audit your columns to find the best candidates. Start by identifying the largest string columns:

SELECT
      column,
      type,
      formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
      formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed
  FROM system.parts_columns
  WHERE active AND table = 'events' AND type = 'String'
  GROUP BY column, type
  ORDER BY sum(column_data_compressed_bytes) DESC;

Then check cardinality for each candidate:

SELECT uniq(status) AS unique_values FROM events;

If the unique count is under 10,000, wrap it in LowCardinality. If it is in the hundreds or fewer, the gains will be dramatic. Run the storage audit query again after the mutation completes to verify the improvement.

Key Takeaways

LowCardinality applies dictionary encoding at the column level: unique values stored once, rows store integer positions. It works best on columns with fewer than 10,000 unique values.
Storage reductions of 40-90% are typical, with query speedups of 2-5x on GROUP BY and WHERE operations.
Do not use LowCardinality on high-cardinality columns (UUIDs, emails, URLs). Above 100,000 unique values, performance degrades compared to plain String.
Enum is faster than LowCardinality for GROUP BY (5-6x vs 2-5x), but requires schema changes whenever values change. Prefer LowCardinality unless your value set is truly fixed.
Migrating existing columns is straightforward with ALTER TABLE MODIFY COLUMN. Audit with system.parts_columns and uniq() to find the best candidates first.
Insert throughput decreases modestly with LowCardinality due to dictionary maintenance. For read-heavy analytical workloads, the query and storage benefits far outweigh this cost.