An Elasticsearch cluster with unassigned shards is not a healthy cluster. In this tip we explain why, and how to fix that situation.
An optimal Elasticsearch (or OpenSearch) cluster would be running smoothly, efficiently managing data and responding to queries. In some cases shards might become unassigned and disrupt your cluster’s health. Detecting and handling unassigned shards, and especially preventing them from ever appearing, is essential for maintaining a robust Elasticsearch environment.
In this power-tip we will discuss the causes, identification, and solutions for shards that are unassigned in Elasticsearch and OpenSearch clusters, to keep them performing optimally at all times.
What are Unassigned Shards in Elasticsearch?
Unassigned shards refer to shards that are not allocated to any data node within the Elasticsearch cluster. They might have failed to be assigned or initialize due to reasons such as a node hosting the shard leaving the cluster, disk space running out, or an existing index being restored.
This state of shards can impact the performance and health of your Elasticsearch service. It's crucial to comprehend the roles of primary and replica shards in Elasticsearch and the factors that can result in unassigned shards to address them effectively.
Primary and Replica Shards
Shards are the pieces of indexes containing actual data. Primary shards are used to hold part of the data stored in an index, while replica shards are copies of primary shards located on different nodes to ensure access to data in the event of a node failure. Elasticsearch automatically migrates shards as the cluster grows or shrinks to rebalance the cluster.
In short:
- An index must have at least one shard; it can obviously have many shards.
- Primary shards contain the indexed data and it's source.
- When data is written to Elasticsearch, it's written to a primary shard.
- Replica shards serve as copies of the primary shards, providing redundancy and failover.
- Having at least one replica shard for each primary shard is advisable for redundancy and often also performance. But it's not a must.
- Having at least one primary shard of an index is a requirement for a healthy index.
- Effectively, having an unassigned primary shard is a significantly more severe situation having this situation with a replica shard.
Unassigned primary shard is a significant cluster event. It means data is missing from the cluster, and if this situation doesn't fix itself, it may lead to real data loss. There are many things which can be done to avoid this situation; and not a lot that can be done once an unassigned primary shard situation is in effect. More on that later.
Unassigned replica shards is usually a very minor event. More often than not it's a result of a misconfiguration, by setting wrong number of replicas configuration to an index. However, if the configuration was verified and is correct, having unassigned replica shards means the cluster is not able to achieve the replication factor it was configured to have, and is raising a flag letting us know about it.
Main Reasons for Unassigned Shards
There are quite a few reasons for shards to become unassigned (or never be assigned to a node in the first place). Here are the main causes:
Insufficient Nodes or Misconfigured Replicas
When there are not enough nodes in the cluster to properly distribute the shards, or when replicas are misconfigured for an index, the cluster will complain about unassigned shards. Those will be unassigned replicas. In these cases, add more data nodes to the cluster or adjust index replica settings to try and resolve the issue, by ensuring that the replica shard remains unassigned no longer.
Changing the replica count for an existing cluster is very easy and also a safe operation, as long as you still keep the index replicated. This is excluding the single node cluster case of course.
Guaranteeing your cluster has a sufficient number of nodes and suitably configured replicas is necessary for avoiding unassigned shards and upkeeping healthy Elasticsearch nodes.
Having the wrong value for the index.number_of_replicas
setting is a very common mistake, causing unassigned replica shards and a yellow cluster state. This is a settings per-index, not a part of the cluster settings.
Disabled Shard Allocation
Shard allocation in Elasticsearch is the process by which Elasticsearch determines which unallocated shards should be assigned to which nodes in the cluster. Having shard allocation disabled means that this process has been deactivated, which can cause unassigned shards due to the lack of nodes available to distribute the shards.
To resolve unassigned shards caused by this, you can enable shard allocation by sending a request to the _cluster/settings
API endpoint. This will allow Elasticsearch to allocate shards according to its allocation algorithm, ensuring that your cluster remains healthy and functional.
Node Failures and Data Loss
Node failures and data loss can also lead to unassigned shards. In such cases, you can utilize the following methods to address the issue:
- Delayed allocation: This involves delaying the allocation of replica shards that become unassigned due to a node leaving the cluster. You can use the index.unassigned.node_left.delayed_timeout setting to configure the delay.
- Reindex missing data: If data is missing from the original data source, you can reindex it to restore the missing shards.
- Restore from snapshot: If the affected index has a previous snapshot, you can restore it to recover the unassigned shards.
These methods can help you resolve unassigned shard issues caused by node failures and data loss.
By implementing these strategies, you can address node failures and data loss while minimizing the impact on your Elasticsearch cluster’s health and ensuring the stability of data nodes.
Disk Space Issues
Being short in storage can contribute to unassigned shards in an Elasticsearch cluster. Monitoring disk usage and adjusting related settings can help prevent unassigned shards and maintain overall cluster health. By keeping an eye on disk space usage and making adjustments as needed, you can prevent unassigned shards and optimize the performance of your Elasticsearch cluster.
Elasticsearch has disk watermarks in place to guarantee adequate disk space across all nodes. However, if these thresholds are surpassed or if a node fails, it can result in unassigned shards and failed allocation attempts.
Monitoring disk space and ensuring that nodes have adequate resources is necessary to prevent unassigned shards. In some cases, adding more nodes or increasing storage space may be necessary to prevent unassigned shards. Understanding the factors leading to unassigned shards enables you to implement preventive measures and maintain a healthy Elasticsearch cluster.
This might be obvious, but you can sometimes delete indices that are not required anymore to free up some disk space.
And lastly, another hat tip: if your service logs (and Elasticsearch logs in particular) are written to the same disk as where data is stored - you'd want to make sure logs are deleted frequently and then separate them as to avoid these kind of issues on any particular node.
Identifying and Analyzing Unassigned Shards
Elasticsearch provides APIs that can help you locate unassigned shards and understand the reasons behind their unassigned state.
The _cat shards
API in Elasticsearch allows you to list and obtain information about unassigned shards in your cluster. To use this API, send a GET request to the _cat/shards
API endpoint from Dev Tools or curl and look for index shards in the UNASSIGNED state.
The Cluster Allocation Explain API is another powerful tool for diagnosing unassigned shards in Elasticsearch. This API provides insights into the reasons behind unassigned shards and suggests potential solutions. To use this API, simply send a request to the _cluster/allocation/explain
endpoint.
How to Fix Unassigned Shards?
Fixing an unassigned shard situation is either very easy or very hard.
Unassigned replica shards is easy to fix - just add nodes or update the replication configuration for the affected index(es).
If any primary shards are unassigned, you need to go and look for the node with the disk where those shard files exist. It could have crashed, or completely went away. But recovering such shards need to be done from either the disk itself, or from snapshot. Unfortunately there is no easy route here or an API that could fix it.
Preventing Unassigned Shards in Elasticsearch and OpenSearch
Preventing unassigned shards and maintaining good cluster health requires optimizing and monitoring your Elasticsearch cluster. We recommend following the below set of rules for any production-grade Elasticsearch or OpenSearch cluster:
- Deploy at least 2 data nodes with dedicated master nodes.
- Make sure the
index.number_of_replicas
setting for any index in the cluster is larger than zero, and equals or lower than the total number of data nodes. The number of replicas for any index on the cluster cannot be higher than the number of available data nodes. - Maintain at least 75% free disk space on every data node at all times.
- Use Index lifecycle management (ILM) in Elasticsearch (or the equivalent in OpenSearch) to automate index management and optimize resource utilization according to predefined rules. By defining lifecycle stages for indices, such as hot, warm, and cold, ILM allows you to manage indices based on their importance and age; remove old indexes and free up some storage.
- Use a trustworthy Elasticsearch monitoring and alerting tool to keep an eye on disk usage, shards status and pressure on nodes in various aspects.