Optimizing Your ClickHouse Schema for Storage Efficiency

Reduce ClickHouse storage costs by choosing the right data types, applying LowCardinality where it matters, and tuning compression codecs. Includes diagnostic queries to audit your schema.

Schema design in ClickHouse has an outsized impact on storage costs and query performance. Because ClickHouse is a columnar database, every column is stored and compressed independently - which means a poor data type choice on a single column is multiplied across every row in the table. On a table with billions of rows, the difference between an Int32 and an Int8, or a String and a LowCardinality(String), can translate to orders of magnitude in storage size.

This is not about exotic optimization techniques. It is about choosing the correct data type in the first place - something that is surprisingly often neglected in production schemas.

Why Data Types Matter at Scale

When you define a ClickHouse schema, every column requires a data type. The choice you make determines how much space each value occupies in storage, how effectively compression can reduce that footprint, and how much memory is needed to process the column during queries.

The problem we see repeatedly in production systems is over-provisioned data types. An Int32 used for a column that only ever contains values between 0 and 10. A String for an event type field that has exactly 10 possible values. A Float64 for a metric that could be stored as UInt16. Each of these wastes storage and memory on every single row.

At small scale, this does not matter. At billions of rows, it is the difference between a table that fits comfortably in storage and one that costs significantly more to operate and is slower to query.

LowCardinality: The Biggest Quick Win

The single most impactful schema optimization in ClickHouse is the LowCardinality modifier for string columns. If a column has a relatively small number of unique values - think event types, status codes, country names, log levels - wrapping it in LowCardinality enables dictionary encoding. Instead of storing the full string for every row, ClickHouse stores a compact dictionary of unique values and uses integer pointers to reference them.

This is not a subtle optimization or a compression tweak. It is a fundamental change in how the data is stored. A column with 10 unique string values across a billion rows goes from storing a billion strings to storing 10 dictionary entries plus a billion small integer pointers. The storage reduction is dramatic, and query performance typically improves as well since there is less data to read from disk.

-- Instead of this:
  CREATE TABLE events (
      event_type String,
      status String,
      country String,
      ...
  )
  
  -- Use this:
  CREATE TABLE events (
      event_type LowCardinality(String),
      status LowCardinality(String),
      country LowCardinality(String),
      ...
  )

LowCardinality is effective when the number of unique values is in the low thousands or less. For high-cardinality columns like user IDs or UUIDs, it provides no benefit - stick with the plain data type.

Compression Codecs

Beyond data types, ClickHouse lets you specify compression codecs at the column level. The two main options are:

LZ4 (the default): Fast compression and decompression, moderate compression ratio. Best for columns that are frequently read in queries.
ZSTD: Higher compression ratio at the cost of slower compression and decompression. Best for columns that are infrequently queried but take up a lot of space.

You can set the codec when creating or altering a table:

ALTER TABLE events
      MODIFY COLUMN raw_payload String CODEC(ZSTD(3));

The strategy is straightforward: use ZSTD with a higher compression level for large, rarely queried columns (like raw payloads or verbose log messages), and keep LZ4 for columns that are in your hot query path. This gives you meaningful storage savings without impacting the queries that matter most.

Keep in mind that heavier compression has a cost during merges and compaction, not just at query time. If a ZSTD-compressed column ends up being touched by frequent merges, the decompression overhead adds up.

Auditing Your Schema

Rather than guessing which columns to optimize, you can query ClickHouse's system tables to identify the biggest opportunities. Start by finding your most expensive columns by storage size:

SELECT
      column,
      formatReadableSize(sum(column_data_compressed_bytes)) AS compressed,
      formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
      round(sum(column_data_uncompressed_bytes) /
            sum(column_data_compressed_bytes), 2) AS compression_ratio
  FROM system.parts_columns
  WHERE active AND table = 'your_table'
  GROUP BY column
  ORDER BY sum(column_data_compressed_bytes) DESC

This shows you which columns are consuming the most disk space and how well they are compressing. Columns with a low compression ratio are candidates for either a better data type or a more aggressive codec.

Next, for any column that stands out, check its actual value distribution:

SELECT
      uniq(event_type) AS unique_values,
      min(event_type) AS min_value,
      max(event_type) AS max_value
  FROM your_table

If your String column has 10 unique values, it should be LowCardinality(String) or an Enum. If your Int32 column's max value fits in an Int8 (up to 127) or Int16 (up to 32,767), downsize it. These findings are the low-hanging fruit - easy changes with significant impact.

Key Takeaways

Data type choices are multiplied across every row. At scale, an Int32 where an Int8 suffices or a String where LowCardinality(String) applies costs orders of magnitude in wasted storage.
LowCardinality is the single biggest quick win for string columns with a limited number of unique values. It changes how data is stored, not just how it is compressed.
Use ZSTD for large, infrequently queried columns and LZ4 (the default) for hot-path columns. Apply codecs at the column level for fine-grained control.
Audit your schema using system.parts_columns to find the largest columns, then check value distributions to identify downsizing opportunities. Run this analysis regularly, especially as your data grows.