Avoiding Elasticsearch Disk Watermark Errors

Disk watermark in Elasticsearch refers to two parameters, high and low watermark, that help ensure there is enough disk space available for Elasticsearch to run properly. Disk watermark errors can lead to cluster failures. In this post we will discuss disk watermark and how it affects your cluster.

The disk watermark feature in Elasticsearch helps to prevent the application from running out of disk space. Elasticsearch monitors the disk usage and takes action whenever one of the configured thresholds (or watermarks) are reached. There are three thresholds Elasticsearch monitors - high watermark, low watermark, and flood stage watermark. In this post we will discuss all three watermarks, how to avoid reaching them and how to resolve a watermark issue should it occur.

Disk Watermark - Definitions

The disk watermarks act in stages that progressively get more and more restrictive on cluster actions as they are met. The low watermark is met first, followed by the high watermark, then finally the flood stage watermark as a last resort. The cluster first takes action on the low watermark hoping to prevent further disk usage. If the actions taken in the low watermark stage do not prevent the cluster from increasing disk usage enough, then the next watermark will be breached and further actions are taken until finally the last stop-gap actions are taken in the flood stage. Let’s review each of these stages.

Low Watermark

The low watermark represents the minimum amount of available free space allowed on a node before Elasticsearch begins taking action to prevent reaching disk capacity. This defaults to 85% disk usage. Once a node has breached this watermark, Elasticsearch will no longer allocate shards to that node. Primary shards for new indices can still be created on that node, but no replicas can be allocated to it. This setting is configurable by updating the cluster.routing.allocation.disk.watermark.low parameter using the Cluster update settings API. Leaving the default value is recommended for most use cases.

High Watermark

The high watermark represents the next stage where even more aggressive actions are taken in order to prevent reaching disk capacity. This defaults to 90% disk usage. Once a node has breached this watermark, Elasticsearch will begin to relocate shards away from that node and onto others (assuming there are other nodes available that have not reached this threshold as well). Unlike the low disk watermark, the high disk watermark affects all shards, including Primary shards for new indices. This setting is configurable by updating the cluster.routing.allocation.disk.watermark.high parameter using the Cluster update settings API. Leaving the default value is recommended for most use cases.

Flood Stage Watermark

The flood stage watermark represents the final stage where the most aggressive actions are taken in order to prevent reaching disk capacity. This defaults to 95% disk usage. Once a node has breached this watermark, Elasticsearch will mark all indices read-only that contain at least one shard on that node. This read-only status is automatically removed once the node has sufficiently decreased its disk usage below the high watermark. This setting is configurable by updating the cluster.routing.allocation.disk.watermark.flood_stage parameter using the Cluster update settings API. Leaving the default value is recommended for most use cases.

Preventing Disk Watermark Errors

The best way to avoid a disk watermark error is to monitor the disk utilization of your cluster and take action before it occurs. If you know Elasticsearch will begin to handicap your cluster beginning at 85% disk utilization, why let it get there in the first place? You can set up monitors and alerts at lower thresholds like 75 or 80 percent. This way you have time to take action before Elasticsearch reaches the first watermark.

Some actions you can take to avoid disk watermark errors include:

Utilize ILM Policies - If your cluster includes data that is only relevant for a given amount of time (e.g. time series data like application logs), you can use an ILM policy to automatically delete indices after some time.
Utilize Data Tiers - Data tiers allow you to have different nodes for different sets of data. You typically have your hot nodes which contain all of your most relevant, most searched data. Then you have your warm, cold or frozen nodes that contain less and less relevant data. With the data tiers you can use ILM like the suggestion above, but instead of deleting the indices, you can offload them to the next tier so the data is still searchable (while being less performant). This has the added benefit of giving you the ability to use less expensive hardware for other tiers, saving you money in the long run. Additionally, you can still automate data deletions in the end once the data has lost all usefulness.
Review and Update Shard Allocation - If you have six data nodes and your indices are configured to use one primary shard and one replica, all your data for an index will be allocated to only two of the six shards. If your indices have disparate amounts of data (e.g. one index having 50MB while another has 50GB), you can easily have one node that is storing the bulk of your data. If this is the case, you will want to update the shard configuration for your indices to better spread the data across all of your data nodes.
Use Best Practices for Data Storage - You should follow best practices like using best_compression on indices, disabling the _source when not needed, etc.

If you are in the market for a monitoring and alerting solution that can help you to identify and resolve issues like disk watermark, try our Pulse solution for Elasticsearch. Pulse keeps an eye of your cluster and gives actionable recommendations and in-time alerts to avoid disk watermark and other common problems in Elasticsearch. Plus you have the ability to tap into world-class Elasticsearch experts that can help you around the clock.

Resolving Disk Watermark

If despite your best efforts your cluster still receives disk watermark errors, don’t fret. There are some steps you can take to help resolve these errors, and remember Elasticsearch is configured to recover gracefully once disk usage is back below the watermarks. It will disable or roll back the actions it took when the watermarks were breached.

Some actions you can take to resolve disk watermark errors include:

Perform a Cluster Cleanup - There are times when we create indices for testing or as a backup for an upgrade, or for some other miscellaneous reason. Deleting these unnecessary indices and data can easily help free up space.
Reduce Replica Count - Though you may not want to remove replicas from high priority indices, if you have any lower priority indices, you can reduce or remove the number of replicas to free up space.
Add Additional Storage Capacity - Most don't want to go this route because it increases costs, but adding additional storage capacity will quickly resolve the issue. This can be done by adding additional storage to existing nodes or by adding new nodes to the cluster. Elasticsearch will rebalance shards when a new node is added, therefore reducing the disk usage on the existing nodes.

Conclusion

The disk watermark mechanism is an excellent feature in Elasticsearch that allows it to continue servicing search requests even when disk usage gets low. By taking the necessary actions ahead of time, you should be able to avoid running into disk watermark errors and keep your cluster healthy. If you ever need help in this regard, feel free to reach out and try our Pulse solution today!