Should you use Elasticsearch and OpenSearch as a primary database? In this article we look into various use-cases where this is applicable, and present some gotchas you should consider, as well as scenarios in which it isn’t going to work well.
A heavily debated topic is whether you could use Elasticsearch as a primary datastore. This means having all writes go directly to Elasticsearch without persisting it in a database first. Opponents of this approach have a straightforward argument: Elasticsearch isn’t a database and doesn’t provide the usual guarantees databases do, so don’t use it as one.
However, as the technology has evolved, so have use-cases. We have helped more than a few customers with maintaining clusters storing even petabytes of data natively on Elasticsearch without a main database to back it up.
With decades of combined Elasticsearch and OpenSearch experience, our consultants have worked on thousands of projects where this question was raised. In this article, we are going to look into various use-cases where using Elasticsearch as a primary datastore is applicable, and present some gotchas you should consider, as well as scenarios in which it isn’t going to work well.
We are going to discuss Elasticsearch, but everything in this article is going to apply to OpenSearch, as in its core it’s the same code-base albeit just under a different name. For an OpenSearch vs Elasticsearch discussion, see here
Any user’s foremost concern is about deploying technology in a way that may cause either outages or data loss. Data loss on Elasticsearch presents less of an risk for today’s users owing to multiple years of development effort from Elastic, but it’s always important to architect your solution correctly for optimal resilience. Whilst there have been some issues with resilience on Elasticsearch in the past, Elastic has since gone a long way in addressing them.
Regardless, whatever is acting as your SSoT needs to be fault-tolerant, and Elasticsearch is no exception. In fact, when using Elasticsearch as your primary datastore, fault tolerance should be at the forefront of your mind, as without having a database backend you’re more likely to suffer data loss. So make sure you deploy Elasticsearch or OpenSearch in a Fault-Tolerant Architecture:
The cluster-based architecture of Elasticsearch offers some protection against unpredicted disasters. This can be further enhanced by increasing the number of nodes in a cluster or distributing the nodes across multiple availability zones (in the case of a cloud deployment) or racks (in the case of private clouds and other types of data centers). The more redundancy with data nodes, the better protection against disasters index replication can offer.
Of course, having the cluster properly bootstrapped with the right number of master nodes is crucial. If you are on an earlier version of Elasticsearch (pre-7), make sure your master-nodes configuration is correct. On later version the risk is lower, but still make sure you have deployed master nodes correctly and in the right capacity.
Disaster recovery via replicated clusters can also make sense. Usually, this is done on a completely different geographical location, and sometimes even on a different cloud provider. It can be achieved relatively easily by adding a completely autonomous ingestion pipeline into the other cluster(s) or using the cross-cluster replication (CCR) feature in Elasticsearch.
Using the right kind of storage is key. For performance reasons, locally attached storage is the best and often recommended (such as the i-instance class on AWS), although those storage classes on cloud environments are ephemeral. If the data node goes away for any reason, the data on the disk will disappear as well. When a new node comes up you may find your data to be missing.
For full fault tolerance, use network storage (such as EBS on AWS). This ensures that if anything happens to the node, its replacement can just attach to the disk and continue from when the previous one left off. No data loss.
If however, you do decide to go with locally attached storage, you will need to have a highly replicated cluster. This will reduce the chances of suffering from data loss by single or several nodes shutting down at once by replicating your indices across as many cluster nodes as possible. Adding more availability zones may come in useful here as well.
As a side-note, cloud servers with locally attached disks are often cheaper than the server and networked storage of a similar capacity.
Regardless, always make sure you have frequent backups.
There is a native backup capability for Elasticsearch, known as Snapshot and Restore. Snapshot operations are incremental and rather efficient so it is feasible and even recommended to take snapshots of your data frequently, much more than just once a day, for clusters with critical data.
It is also possible to make your system more fault-tolerant by hosting your snapshot repository on a separate infrastructure from the production cluster (such as a different cloud object store for multi-cloud resilience). Storages like S3 and Google Cloud Storage are incredibly safe and cost-efficient for this.
When in Rome do as the Romans do. When using Elasticsearch, use it as a document-store and don’t try to force usage patterns that are not appropriate. No single technology is a silver bullet, and Elasticsearch will only function as your SSoT if you keep its usage close to what it was originally intended and respect its inherent design decisions.
Elasticsearch by design has a strong preference for append-only data. This means that the original and existing data is more or less immutable, and any new data that is written is merely appended.
Examples of append-only data include logs, metrics, and sensor data, which goes some way to showing why Elasticsearch is so commonly used as a monitoring and observability platform foundation.
That is not to say you can’t update or delete data on Elasticsearch. But if your data model requires frequent updates and deletes for the entire data set, Elasticsearch might not be the right fit for you. If, however, you require to occasionally delete or update some documents - then it is definitely doable.
The more relational your data is, the harder you will have to work to make it suitable for Elasticsearch. Elasticsearch doesn’t support table joins, so you may find yourself with multiple views of your data to accommodate more complex queries. This can be cumbersome and become resource-intensive.
Features like nested-documents and join fields (previously known as parent-child) are available to assist with some relational data models, but in our experience, they fail to operate efficiently at scale, so should be used with caution.
Elasticsearch is why it is often used with an ACID-compliant database on the backend which acts as the single source of truth for the entire system. It’s important to remember that Elasticsearch is an OLAP database, not an OLTP database, as it doesn’t support transactions and doesn’t have the required consistency guarantees.
Elasticsearch does not support transactions by design because it is distributed, asynchronous, and concurrent - and favors speed and efficiency over correctness. It is still possible to use a control mechanism known as optimistic concurrency to try and avoid the inconsistencies caused by Elasticsearch’s inherent architecture. Optimistic concurrency assigns a unique and concurrent number in a sequence to every chance committed to a document. This means that the receiving node can make sure that the changes are correctly ordered as they are applied to a document, avoiding any conflicts or errors.
Elasticsearch works great when queried for aggregations over data, or small results set based on various sort orders (with the most popular one - document rankings). It is less suitable for operations requiring frequent access to the entire dataset, or very large results sets (let's say over 100-200 results per query). While there are ways to export data from Elasticsearch, it's not meant to be used this way during normal operation, so if this is what your use-case entails - it's worth reconsidering.
One of the biggest constraints of using Elasticsearch without a database is that it requires pre-defined indexes and schemas. Whilst a database may be able to validate a migration before making alterations permanent, you don’t have that luxury with Elasticsearch.
Elasticsearch will automatically identify numerical, date/time, and boolean field mappings so it is advertised as "schema-less", but in fact setting an explicit schema up front with Elasticsearch will make your life much easier and will evade you from many issues down the road. Once an index mapping has been set up, you can add fields to it (but with no backfilling without reindexing, see below), but it isn’t possible to rename or delete a field.
Traditionally, when using Elasticsearch with a backing database, you would create new indices to act as a view on top of the database. This view would reflect the updated data schema and you would be able to switch between the original and updated as required. This is much harder to achieve when using Elasticsearch as your primary datastore. It is possible to mimic this capability with append-only data in Elasticsearch, as discussed above.
In Elasticsearch, there is no
ALTER TABLE. Once you have defined a field, you are committed to keeping it. Schema planning is critical with Elasticsearch, especially when you plan on using it as your primary datastore.
For example, if you define a field as a string in your schema, continue indexing data, and then redefine it as something else (like a numerical), you’ll likely have to reindex all of the historical data. Therefore, whilst Elasticsearch can accommodate a flexible schema, when you are using it as a primary datastore it’s advisable to maintain a standard schema and minimize field-mapping changes as much wherever possible.
The Reindex API is a handy tool to get familiar with. It is the right way to perform data and schema migrations in Elasticsearch, especially when it's being used as your SSoT. The Reindex API has a couple of other common use cases as well, such as cluster-to-cluster migration.
If you are using the API to reindex a particularly large index, then this could take a while. The Reindex API supports the use of Sliced Scroll, which can break down the reindexing task into more manageable chunks.
Lastly, unless you have planned acceptable downtime, your Elasticsearch application will be carrying out the reindexing and both ongoing and incoming queries simultaneously. This will be demanding on memory, and without proper planning and expertise you may end up with a performance bottleneck.
If you plan on using Elasticsearch as your primary datastore, then you need to have a good plan in place for retention.
Because Elasticsearch is append-only and you won’t have a database to offload to, the volumes of data being managed by Elasticsearch can quickly escalate. Not only will this degrade your cluster’s performance, but it will also bring unexpected costs.
Because Elasticsearch isn’t great for deleting data, you might want to consider introducing data tiers so that older or less frequently used data sits on a cheaper, less performant tier. Fortunately, all of the major cloud providers have a variety of warm to cold data storage options available.
You should also consider using index tiering, splitting up indices by either dates or data streams using Index Lifecycle Management. This tool allows you to set policies around index retention based on your own bespoke requirements, and it's great way to reduce costs while not giving up on data retention requirements.
Now that we’ve covered what to look out for when deploying Elasticsearch without a backing database, we’re going to look at when it’s suitable and recommended to do so.
When Elasticsearch is used to persist and query data coming in from append-only data streams, such as logs, metrics, IoT sensor events, and so on - it's when we recommend considering having it as a primary datastore. In those cases it's redundant to have yet another database as Elasticsearch is able to sustain the load and possibly be even more efficient in handling it.
We do recommend using a message broker, such as Apache Kafka or Apache Pulsar, as a gateway for incoming data so writes to Elasticsearch are made in a constant, predictable rate and in order to prevent the Elasticsearch cluster from becoming overwhelmed in case of surges.
To avoid a forever-growing cluster, it is important to have a well-defined data retention policy and delete old data on-time, possibly putting it on backup or colder tiers before deleting it. Without such a retention policy, you are risking losing a grip on a fast-growing datastore, and costs that are piling up.
In this piece, we have looked at the challenges of using Elasticsearch as your primary datastore, as well as some instances in which it might work well. Below are two examples of scenarios in which you should never use Elasticsearch as your single source of truth.
Don't use Elasticsearch as your only datastore when there are strict transactional requirements, and Optimistic Concurrency isn’t Appropriate. This is common in financial services or for some highly concurrent CRUD interfaces with strict accuracy requirements, and for those you might find yourself with a need for transactional storage.
Other instances where relaxed assumptions and optimistic concurrency aren’t suitable, these may be dictated by compliance, for example, where not enough certainty can be given to the validity of the data. It may not always be acceptable to adopt the principle of “last write wins” in a high-frequency scenario. In this instance, don’t use Elasticsearch as your single source of truth.
If you are using Elasticsearch for artificial intelligence or machine learning workloads, you may find that you have a very complex data model. Because Elasticsearch does not support joins, with a complex data model that can’t be flattened you might find yourself amending or further duplicating models on the serving layer or view.
This is a surefire way of deprecating the performance of your Elasticsearch cluster very quickly, and if that’s why you selected Elasticsearch in the first place, that doesn’t make much sense.
It’s fair to say that as event streaming and observability grow in popularity, the requirement to deploy Elasticsearch as a primary datastore will only increase. For those looking for highly performant analytics engines for write-heavy workloads, it’s a good choice. However, as outlined in this article, it’s not without its complications.
That’s where Pulse comes in. Pulse is an intelligent and automated platform designed to support and maintain the most complex of Elasticsearch environments. Powered by decades of cumulative experience, Pulse acts like your auto-pilot and automated consultant for keeping your clusters always in-check. If you’re choosing to use Elasticsearch or OpenSearch as your single source of truth, choose Pulse to keep it truthful. Book a demo here.