Using Logstash to Ingest Data into ClickHouse

A practical guide to sending data from Logstash to ClickHouse, covering the dedicated output plugin, the HTTP output alternative, batch tuning, and when to consider Vector instead.

Logstash is the log-processing workhorse of the Elastic Stack, routing data from hundreds of input sources through filters and into various outputs. ClickHouse is a column-oriented OLAP database built for sub-second analytical queries over billions of rows. Connecting them lets teams feed structured log and event data into ClickHouse for fast, SQL-based analytics without replacing an existing Logstash-based pipeline.

This post walks through the two main approaches for getting data from Logstash into ClickHouse, the batch-tuning settings that matter most, and the trade-offs you should weigh before committing to this path.

The Dedicated Output Plugin

The most direct route is the logstash-output-clickhouse community plugin. It sends events as JSON batches over ClickHouse's HTTP interface and handles load balancing across multiple hosts out of the box.

Install it like any other Logstash plugin:

bin/logstash-plugin install logstash-output-clickhouse

A minimal pipeline configuration looks like this:

output {
    clickhouse {
      http_hosts => ["http://clickhouse-node1:8123", "http://clickhouse-node2:8123"]
      table => "logs"
      flush_size => 10000
      idle_flush_time => 5
      mutations => {
        "timestamp" => "%{@timestamp}"
        "message"   => "%{message}"
        "host"      => "%{[host][name]}"
        "level"     => "%{[log][level]}"
      }
    }
  }

The mutations block maps Logstash event fields to ClickHouse column names. You can use simple field references or regex-based transformations. The plugin also supports basic auth via the headers parameter (pass a Base64-encoded Authorization header).

One thing to know upfront: this plugin's GitHub repository was archived in January 2021. It still works with current Logstash versions for basic use cases, but it is not actively maintained. There are no recent fixes, no support for newer ClickHouse authentication methods, and no TLS configuration beyond what HTTP headers can carry. If your requirements are simple - batch JSON inserts with field mapping and retry logic - it will do the job. If you need compression, native protocol support, or active maintenance, read on.

The HTTP Output Alternative

For teams that want a maintained, first-party plugin, Logstash's built-in HTTP output can target ClickHouse's HTTP interface directly. ClickHouse accepts INSERT queries via POST requests in multiple formats, including JSONEachRow - which maps naturally to Logstash events.

output {
    http {
      url => "http://clickhouse-node1:8123/?query=INSERT%20INTO%20logs%20FORMAT%20JSONEachRow"
      http_method => "post"
      format => "json_batch"
      content_type => "application/json"
      headers => {
        "X-ClickHouse-User" => "default"
        "X-ClickHouse-Key"  => "your_password"
      }
      pool_max => 10
      pool_max_per_route => 5
      connect_timeout => 10
      socket_timeout => 30
    }
  }

The key here is the query parameter in the URL. ClickHouse parses the INSERT INTO ... FORMAT JSONEachRow statement from the URL and reads the actual row data from the POST body. This means you can use any filter plugin to shape your events into a JSON structure that matches your target table schema, and ClickHouse handles the rest.

This approach has two advantages: the HTTP output plugin is actively maintained by Elastic, and you get full control over ClickHouse connection parameters (compression headers, session IDs, query settings as URL params). The downside is that you lose the dedicated plugin's built-in load balancing and save-on-failure features - though Logstash's persistent queue can partially compensate for the latter.

Batch Tuning for ClickHouse

Getting the pipeline running is the easy part. Getting it to run well requires understanding how ClickHouse handles inserts at the storage layer.

Every INSERT into a MergeTree table creates a new data part on disk. Background merge threads consolidate small parts into larger ones, but if inserts arrive faster than merges can keep up, part counts spike. The ClickHouse docs recommend keeping insert frequency to roughly one per second, with a minimum of 1,000 rows per batch and an ideal range of 10,000 to 100,000 rows (ClickHouse bulk insert best practices).

For the dedicated plugin, the relevant settings are:

Setting	Default	Recommendation
`flush_size`	50	10,000 - 50,000
`idle_flush_time`	5s	10 - 30s
`automatic_retries`	1	3 - 5
`request_tolerance`	5	5

The default flush_size of 50 is far too low for ClickHouse. At that batch size, even moderate throughput will trigger the "too many parts" error that plagues most ClickHouse deployments. Bump it to at least 10,000.

For the HTTP output, pool_max and batch settings in Logstash's pipeline configuration (pipeline.batch.size and pipeline.batch.delay in logstash.yml) control how events are grouped before sending. Set pipeline.batch.size to 10,000+ and pipeline.batch.delay to 5000-10000ms.

Another optimization worth considering: ClickHouse inserts are fastest in Native format, followed by RowBinary, with JSONEachRow carrying meaningful parsing overhead. For high-throughput pipelines, this parsing cost adds up. LZ4 compression (passed via Content-Encoding: lz4 header) can reduce data transfer by over 50% according to ClickHouse documentation, which helps when Logstash and ClickHouse sit on different networks.

If client-side batching is impractical for your workload, ClickHouse's async inserts (async_insert=1) let the server buffer small inserts and flush them in larger batches internally. You can enable this per-query by appending &async_insert=1&wait_for_async_insert=1 to the HTTP URL.

Common Pitfalls

Schema mismatches fail silently. If your JSONEachRow payload contains a field that does not exist in the target table, ClickHouse silently drops it by default. If a required field is missing, the insert fails entirely. Always test your pipeline against a staging table first and enable input_format_skip_unknown_fields=1 in the query URL to be explicit about handling extra fields.

Timestamp format matters. ClickHouse DateTime columns expect Unix timestamps or YYYY-MM-DD HH:MM:SS format. Logstash's default @timestamp is ISO 8601 with milliseconds and timezone. Use a Logstash date filter or a ClickHouse DateTime64 column to avoid silent truncation.

Retry storms after ClickHouse restarts. Both the dedicated plugin and the HTTP output will retry failed requests. If ClickHouse goes down for maintenance and Logstash has been buffering, the restart can trigger a flood of large batch inserts that overwhelm the cluster. Use Logstash's persistent queue (queue.type: persisted) and stagger retry intervals with backoff_time.

Don't partition by day on high-volume tables. This is a ClickHouse-side issue, not a Logstash one, but it surfaces most often in log pipelines. Daily partitions on tables receiving millions of rows per day create too many partitions, each with independent merge processes. Monthly partitioning (toYYYYMM(timestamp)) is almost always the right choice for log data.

When to Use Vector or Fluent Bit Instead

The archived status of the dedicated Logstash plugin raises a fair question: should you use Logstash at all for this?

Vector (by Datadog) has a native ClickHouse sink with active maintenance, built-in compression, schema mapping, and batch controls. It is written in Rust, uses significantly less memory than Logstash's JVM, and achieves higher throughput on equivalent hardware. The original author of logstash-output-clickhouse recommends switching to Vector.

Fluent Bit is another lightweight alternative with a ClickHouse output plugin, particularly popular in Kubernetes environments where its small footprint matters.

That said, Logstash remains the right tool when:

You already run a Logstash fleet and adding ClickHouse is one more output alongside Elasticsearch
You rely heavily on Logstash filter plugins (grok, dissect, geoip) that have no direct equivalent in Vector
Your team knows Logstash pipeline syntax and does not want to learn a new tool for one integration

	Logstash	Vector	Fluent Bit
ClickHouse plugin status	Archived (2021)	Active	Active
Protocol	HTTP / JSON	HTTP / Native	HTTP
Memory footprint	High (JVM)	Low (Rust)	Very low (C)
Filter ecosystem	50+ plugins	Growing	Moderate
Best for	Existing ELK stacks	New pipelines, high throughput	Kubernetes, edge

Key Takeaways

Two paths exist for Logstash-to-ClickHouse: the dedicated logstash-output-clickhouse plugin (simple but archived) and the built-in HTTP output targeting ClickHouse's HTTP interface (maintained, more flexible)
Batch size is the single most critical tuning parameter. The default of 50 rows will cause "too many parts" errors in production. Target 10,000+ rows per batch
Use JSONEachRow format for the HTTP approach, and watch for schema mismatches and timestamp format issues
For new pipelines without existing Logstash investment, Vector is the stronger choice for ClickHouse ingestion - it is actively maintained, faster, and lighter on resources
Enable ClickHouse async inserts (async_insert=1) as a safety net when client-side batching cannot guarantee large enough batches