In some cases a deep paginated search on an Elasticsearch or OpenSearch cluster is required. In this power tip we compare the various available options.

Deep Paging in Elasticsearch and OpenSearch refers to retrieving a large number of search results beyond the default page of 10 and often also beyond the default limit of 10,000 documents. Since Elasticsearch is designed for speed and efficiency, fetching deep results can be resource-intensive and impact cluster performance. For when this is needed still, Elasticsearch provides several mechanisms to handle deep paging efficiently.

While Elasticsearch was designed to respond quickly to aggregation queries or top few documents, some scenarios are valid for retrieving thousands or more of documents, such as:

  • Analytics and Reporting: Generating reports that require processing large (potentially filtered) datasets.
  • Full Data Export: Extracting large sets of data for external storage or processing.
  • Data Synchronization: Syncing search indices with external databases.
  • Pagination for Large Applications: Serving large-scale applications with extensive pagination needs.

Following are the various options of performing deep-paging operations, with

Using from and size

Elasticsearch provides the from and size parameters for paginated searches. It can be used to retrieve even thousands of documents, up to a defined (configurable) limit, but it has performance costs associated with it.

The syntax is as follows:

POST my_index/_search
{
  "from": 1000,
  "size": 50,
  "query": {
    "match_all": {}
  }
}

Pros:

  • Simple to implement.
  • Suitable for small-scale pagination.

Cons:

  • Performance Issues: Using this pagination method, Elasticsearch must compute and discard skipped records as search starts from the very first position, including scoring, sorting and so on.
  • Memory Intensive: Large offsets increase memory usage and slow queries.
  • Unstable Pages: If not sorting by so-called "stable" field, meaning a field that is not going to receive new out of sort-order values during the pagination operation, it is possible to receive the same element twice during pagination or to completely skip some items. This is in particular applicable when no sort is applied and results sorting is done based on the result score.

Using Scroll API

The Scroll API is designed for retrieving large numbers of documents efficiently. It maintains a snapshot of results for repeated fetching, hence a fully stable paging and cheaper operation.

To initiate a scroll, with a 1 minute timeout (1m):

GET my_index/_search?scroll=1m
{
  "size": 1000,
  "query": {
    "match_all": {}
  }
}

Retrieve the next batch:

GET _search/scroll
{
  "scroll": "1m", 
  "scroll_id": "DXF1ZXJ5QW5..."
}

Pros:

  • Efficient for bulk data retrieval.
  • Avoids re-executing the search query.

Cons:

  • Requires maintaining an open scroll session that has performance and memory applications.
  • There is a timeout defined, that when expires could throw an "expired scroll ID" error.
  • Not suitable for real-time pagination due to resources usage.

Using Point in Time (PIT) API

The PIT API improves upon the Scroll API by creating a lightweight search context.

POST my_index/_search
{
  "size": 1000,
  "query": {
    "match_all": {}
  },
  "pit": {
    "id": "<pit_id>",
    "keep_alive": "1m"
  }
}

Pros:

  • More efficient than the Scroll API.
  • Suitable for paginated requests without keeping state.

Cons:

  • Requires explicitly managing PIT lifetimes.
  • Still holds resources on the cluster.

Using search_after

search_after provides a stateless approach to deep paging using a sort key.

GET my_index/_search
{
  "size": 100,
  "query": {
    "match_all": {}
  },
  "sort": [{ "timestamp": "asc" }],
  "search_after": ["2024-01-01T00:00:00"]
}

Pros:

  • Avoids performance issues of from and size.
  • Suitable for continuous pagination.
  • Provide "stable sorting" while still avoiding heavy resource usage.

Cons:

  • Requires consistent sorting.
  • No direct random access to pages.

Comparison of Deep Paging Methods

Method Performance Use Case Statefulness
from & size Poor Small pagination Stateless
Scroll API Moderate Bulk export Stateful
PIT API Good Large-scale pagination Semi-stateful
search_after Best Real-time paging Stateless

Conclusion

Deep paging in Elasticsearch requires choosing the right method based on your use case. For bulk exports, the Scroll API is suitable. For real-time pagination, search_after is the best choice. The PIT API is a hybrid approach that balances efficiency and usability. Avoid using from and size for deep pagination whenever possible to maintain cluster performance.