Poor query performance in Elasticsearch and OpenSearch is likely the top complaint we see. Here is how to go about fixing it.
Something as simple as a slow Elasticsearch query can quickly cause significant complications, affecting everything from Elasticsearch cluster stability, your customers' experiences to your cloud costs.
That’s why it’s crucial to monitor Elasticsearch queries, to identify slow ones and find ways to optimize system performance. That might sound complex, but it doesn’t have to be. In this blog, we explain how to detect, analyze, and resolve slowness in your Elasticsearch clusters.
But before we turn to the solution, it’s essential to understand the source of the problem. So, let’s start by exploring the potential causes of slow queries. Also, please note that although this blog focuses on Elasticsearch, everything here applies to OpenSearch too.
What Causes Slow Elasticsearch Queries?
The most common causes of slow queries on an Elasticsearch cluster include the following issues:
-
Expensive Queries - Any query that requires excessive CPU or memory utilization can be considered expensive and potentially wreak havoc on your cluster. Be particularly cautious with Elasticsearch’s n-gram analyzer, as discussed here.
-
Bad cluster performance - Sometimes it's not your queries that are at fault, but your Elasticsearch cluster that is having a hard time. High indexing workload can have affects on query latency for example if it causes consistently high CPU usage or disk I/O usage in the entire cluster or specific data nodes, with knock-on negative effects on search speed. In such cases, the slowness does not stem from a problem with the query itself but is actually a symptom of overall cluster behavior. Review your cluster performance metrics to correctly analyze CPU utilization, JVM heap usage, garbage collection and other various metrics
-
Uncacheable Queries - Some common mistakes can make your queries uncacheable and potentially impact cluster performance significantly. It is possible to boost query performance without changing search results, by applying simple changes to your search request and query dsl - for example using the filtered queries in the filter context.
-
Overloaded Search Thread Pools - Incorrect Elasticsearch cluster sizing can overload data nodes, causing queries to queue in the search thread pool and even get occasionally rejected.
-
Incorrect Sharding Strategy - If the primary shard count is too high or shards are too large in your Elasticsearch cluster, queries can be slow to execute.
-
Heavy Aggregations - Queries with expensive-to-run aggregations, such as cardinality aggregations, field data intensive aggregations, or those that return many buckets, can also significantly slow down searches.
Key Metrics for Debugging Slow Elasticsearch Queries
When debugging slow Elasticsearch queries, there are three key metrics to monitor:
-
Response Time - The "took" parameter in the Elasticsearch search response payload indicates, in milliseconds, how long it took to process the query within the cluster.
-
Request Time - This is the end-to-end query time - from when the request is submitted to when the response is received; including serialization, deserialization, and network time. Measure this by setting a timer to record the time between sending the request and receiving the response.
-
Page Time - Page slowness is rarely attributed to Elasticsearch's delayed response but rather to concurrent processes, such as database calls or requests for additional auxiliary APIs or services.
When monitoring Elasticsearch queries, pay special attention to those key query performance metrics. You should log the response time, request time, and, if possible, the page time for each our your queries. The request time should be just above the response time. Consistent high request or response time deltas mean an underlying issue must be identified and resolved.
Slow Elasticsearch queries can be due to infrastructure issues (see below for details of how these can be addressed with Pulse). But to be absolutely sure of the cause, you should monitor all queries and analyze those performance metrics.
Elasticsearch Query Analytics with Pulse
To resolve query latency issues in your Elasticsearch cluster, it is essential to identify the root cause of the slowness. One way to achieve that is to use Pulse, our innovative platform tailored specifically to deeply analyzing Elasticsearch metrics and monitoring Elasticsearch performance.
Specifically, Pulse’s Query Analytics feature is designed to help users identify and resolve query issues. It not only keeps an eye on your cluster's health but also provides actionable recommendations on what to do, when to do it, and how to do it. Learn more here about this feature and how it can enhance your search performance.
Conclusion
Trying to figure out the root cause of poor query performance can be exhausting and resource-consuming and will likely impact your customers' experiences and your revenue. Keeping a close eye on your Elasticsearch performance metrics is one step in the right direction, to ensure Elasticsearch's performance is in check.
The next step is analyzing your Elasticsearch queries, to detect query issues, analyze query patterns, and discover the reason for slowness. For example - is this a bad query or a particular index is more susceptible for slow queries. That can be achieved with the Pulse platform, which monitors clusters, analyzes queries, and recommends practical solutions.
Contact us today to learn how we can help you better understand your cluster's performance and leverage these insights for improved results.