Elasticsearch Data Streams provide powerful ways to manage time series data and other types of append-only data.

In this blog, we explain the benefits and limitations of data streams and how to select and set up the correct type of data stream for your needs. Although we focus on Elasticsearch throughout this article, everything here also applies to OpenSearch unless otherwise noted. Let’s start by defining data streams and explaining why they were introduced.

Why Do We Need Data Streams?

Elasticsearch added the new data streams feature in 2020 as an improved way to manage time series data, replacing the less predictable, harder-to-manage approach of daily rolling indices.

Managing time series data with Elasticsearch was always possible, but the process was more complex and less efficient before data streams. The traditional approach for this typically involved creating indices with names based on timestamps. However, that could result in variable index and shard sizes, inefficient data compression and shard hotspots, which in turn increases cluster imbalance and is bad for performance.

The limitations of that approach were pronounced when dealing with sporadic or irregular data influxes that do not stream uniformly over time. For example, e-commerce sites have data peaks on weekends, before the holiday season, or during special sales events. This leads to indexes with shards that are larger for some days, and smaller for others. Or an ever-growing business which could start with monthly indices due to low data volume, then move to weekly then hourly as it grows; but managing the granularity manually to keep shard size in check is just too painful.

Data streams have simplified the process and supplied native capabilities that better fit the use case of streaming time series data to Elasticsearch and Opensearch.

What Capabilities Do Data Streams Offer?

Data Streams should be used for rarely updated or append-only data, such as logs and metrics. Using data streams allows us to balance shards in the cluster, and thus to maintain maximum performance.

They do that by providing an automatic, out-of-the-box optimized data rollover strategy via an alias that is functioning as a pseudo-index you can write to and read from. That solution ensures that the data remains well-balanced across shards and indexes, easily accessible, and optimally compressed, regardless of fluctuating data patterns.

Combining data streams with index lifecycle management encapsulates shard size determination, thus further improving search performance and creating a more balanced cluster.

Also, unlike rollover aliases, which were the previous method of rolling over data in Elasticsearch, data streams do not require a bootstrap process for the first backing index. That means they are a better fit when we cannot predict when we'll need to start writing data.

Setting Up Data Streams

Setting up a new data stream is similar to defining index mappings through a template. It involves creating a dedicated index template that declares it’s a data stream (as opposed to a standard index template), defines the stream’s structure, index lifecycle and mappings. This template is the blueprint for the concrete indexes created under the data streams. You’ll also need to set up policies for index management, which is outside the scope of this blog, but you’re welcome to contact us to learn more about how to do this.

In Elasticsearch, creating a template can look something like this:

PUT /_index_template/my-index-template
{
    "index_patterns": ["my-data-stream*"],
    "data_stream": { },
    "composed_of": [ "my-mappings", "my-settings" ],
    "priority": 500,
    "_meta": {
        "description": "Template for my time series data",
        "my-custom-meta-field": "More arbitrary metadata"
    }
}

In OpenSearch, the index template can be created as follows:

PUT /_index_template/logs-template-nginx
{
  "index_patterns": ["my-data-stream*"],
  "data_stream": {
    "timestamp_field": {
      "name": "request_time"
    }
  },
  "priority": 200,
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  }
}

Make sure you do not use patterns meant for Elasticsearch’s built-in data streams if you don’t need to do so—see here for more information. After setting up the templates, you then need to direct your ingestion source to write to a destination that’s within the pattern defined in index-patterns. So, for instance, if you require two separate data streams for events from customers A and B, following the above example, you would ingest into my-data-stream-events-a and my-data-stream-events-b. Ingesting documents to those destinations would immediately create the data stream described in the index templates.

Migrating to Data Streams

Transitioning from traditional index-based storage to data streams might seem daunting. However, Elastic has streamlined the process to facilitate smooth migration, and Elasticsearch version 7.11 has introduced a dedicated API for migrating index aliases to data streams:

POST /_data_stream/_migrate/<ALIAS>

This API endpoint upgrades an index alias to a data stream. The upgrade involves specific modifications to the index's mapping, enabling data streams to utilize their full capabilities. At the time of writing, this migration capability does not exist in OpenSearch.

Data Streams’ Limitations and Ways to Get Around Them

Despite their many advantages, data streams are not the best fit for every use case. Some of their limitations are detailed below:

Frequently Updated Data

When you require the existing documents in a cluster to be updated frequently, data streams are not the best option. Data streams are primarily meant for document creation and block update operations for old data (except update by query). So, for data that needs to be frequently updated, a write alias, for example, might work better than a data stream.

Non-Time-Series Data

Data streams are only a good fit for time series data and must contain a timestamp field. So, don't opt for data streams if you are not using a timestamp field to organize and search the data. Instead, you should organize your data to consider the other attributes you will use for searching. For instance, if you have several customers and the data distributes nicely between them, you could separate indices by customer IDs or groups of customer IDs.

Aliases

Data streams cannot be included in aliases along with other regular indices. If you have a requirement to query both regular indices and data streams, you can direct your search queries at both data streams and regular indices or separate aliases containing each.

Data Retention

Managing the lifecycle of cluster indices from data ingestion to data serving, archiving, and deletion is crucial. Implementing policies to automate the data lifecycle can ensure seamless data transitions while optimizing resource allocation and storage efficiency by defining different data phases for each stage, e.g., hot, warm, cold, frozen, and deleted. In Elasticsearch, this index management process feature is called index lifecycle management (ILM); in OpenSearch, it is called index state management (ISM). When using data streams, the backing indices are created with specialized names containing the index creation date (Elasticsearch) or a sequence number (OpenSearch). That differs from user-managed indices with names based on timestamps, where the index name can represent the relevant period of the data it contains. This means that with data streams understanding what portion of the data each index contains is harder.

That may be irrelevant if we don't need to deal directly with the backing indices at all, but there are some use cases in which we do. Specifically, this may complicate scenarios where we must restore data from cold storage or move it between tiers. When deciding whether to use data streams, such requirements need to be considered along with the relevant capabilities available for managed/non-managed Elasticsearch/OpenSearch.

Data Ingestion

Another critical limitation is that only document creation operations are supported with data streams. When ingesting data directly from an external data source, controlling the indexing operation type is not always possible. That is why some ingestion sources, such as Amazon Kinesis Data Firehose, do not work with data streams. However, both Logstash and Data Prepper work with data streams, as does any application library relying on the bulk API.

Conclusion

Data streams have simplified and improved the process of managing time series data on Elasticsearch and Opensearch when dealing with sporadic or irregular data influxes that do not stream uniformly over time. They do so by combining automatic, out-of-the-box optimized data rollover strategies and dedicated indices.

It's important to pick the right type of data stream for your needs and be conscious of their limitations. For example, data streams are not the best solution for frequently updated or non-time-based data. In addition, some ingestion sources cannot index data into data streams.

Nevertheless, despite such limitations, data streams remain a powerful way to streamline your time series data, balance your clusters, and improve search performance.

Contact us today to learn more about setting up data streams to meet your search needs.