Exporting data from Elasticsearch or OpenSearch, often referred to as "dumping data", is often required for various purposes, for example loading data stored in Elasticsearch for some batch processing in Spark, and so on. This post describes how that can be done.
Exporting data from Elasticsearch or OpenSearch is often required for backup purposes, or in order to move data between systems - for example loading data stored in Elasticsearch for some batch processing in Spark, and so on.
For backup purposes, you should use the built-in Snapshot/Restore API. It is by far the easiest and most efficient way to perform backups and restore from backups. Some managed solutions for Elasticsearch pose some limitations around it, but it's still the best option for performing backups.
But sometimes you need to dump data from Elasticsearch into, say, JSON format and then load it into other systems - for example Spark for batch processing, or even load it into Elasticsearch of a different version in a completely different system. What then?
The Scroll API and it's predecessor the PIT (point in time) search API are the way to go. They offer a way to read results of any search query (or an entire index, or multiple indices) efficiently and without skipping and results. They even support a "slicing" approach to allow for reading data in parallel from multiple consumers, thus speeding up the process.
There are quite a handful of utilities that will make this process of data export from Elasticsearch / OpenSearch a breeze:
- Elastician is a dockerized utility written in Python by our experts, which we've built and optimized to support many use-cases of data export and import we have seen over the years with our customers. Elastician supports exporting in slices as well so a single instance running on a multi-core machine can perform data export faster.
- ElasticDump is an actively-maintained tool written in JavaScript which offers full support of OpenSearch as well, and AWS S3 destinations.
- Using Logstash - you can use Logstash's Elasticsearch input to feed data into Logstash and then make use of Logstash's many output destinations (and even more than just one of them). This can be much easier to setup for many, especially shops which already use Logstash in their stack. It is probably going to be the slowest and more resources consuming method, however.
A word of advice: Elasticsearch (and OpenSearch, too) weren't designed for supporting frequent full data exports. The snapshot/restore API can be used for performing frequent backups, but the other APIs we mentioned here shouldn't be used as part of your normal operation with Elasticsearch. Exporting data from Elasticsearch may take significant amount of time, even with parallelism, and will consume non-negligible amount of cluster resources.