This is a thorough review of tools for monitoring Elasticsearch clusters, to help keep them stable and at high performance.
Elasticsearch is one of the most popular software tools in the industry. It is used for Search, of course, but also for Observability, Security Information and Event Management (SIEM), and in recent Elasticsearch versions, even as a Vector Database. This makes Elasticsearch a critical part of the software stack of many companies.
Once a tool is a vital part of the software stack, we must ensure monitoring and dashboards are set up to help visualize what is happening since many things can easily go wrong in software. Fortunately, when it comes to Elasticsearch, there are many tools to choose from. In this post, we’ll compare some of the top tools you can use for monitoring your Elasticsearch clusters.
Several things need to be considered when monitoring Elasticsearch. A tool should monitor your Elasticsearch process run on the operating system, the underlying operating system itself, and the Java Virtual Machine (JVM) in which Elasticsearch runs. Monitoring tools should be feature-rich, allowing you to create alerts, visualizations, and dashboards. They should be scalable and cost-effective as well.
Let’s review some of the essential features that make for an ideal monitoring tool for Elasticsearch:
- The ability to collect operating system metrics such as CPU and RAM usage.
- Collection of JVM metrics such as heap usage and Garbage Collection (GC) count.
- Cluster metrics such as query response times, index sizes, and the number of requests.
- Visualizations and dashboards for displaying the collected metrics.
- Alerting that integrates well with popular tools such as Slack, ServiceNow, PagerDuty, and others.
Other considerations that apply to any software product:
- Ease of use
This post will examine the various tools that offer these features and capabilities. This will help you decide which tool to use for monitoring your Elasticsearch clusters.
The Stack Monitoring application built into Kibana is our first option since it is part of the Elastic Stack. It has monitoring, alerting, and out-of-the-box dashboarding capabilities and is built to work on Elasticsearch data and metrics. It is an open and free software like the rest of the Elastic Stack, but it has some limitations. Some of its useful features are behind an additional paywall and are not free out-of-the-box.
For example, monitoring a single Elasticsearch cluster with Kibana is free, but it requires a paid subscription if you want to monitor multiple clusters from the same Kibana instance.
Additionally, if you want to use any of its machine learning capabilities, such as anomaly detection, you must obtain a paid license. To view more of the differences between the free and paid versions of Kibana, check the official Elastic site.
Stack Monitoring gives you many insights into your cluster. You can get an overview of the entire cluster at a high level and drill down to the node or index levels. You can view many metrics, such as search and indexing rates, disk usage, JVM heap usage, CPU utilization, and the number of requests. See the complete list of the available metrics. However, each resource (node, index) can be viewed individually. For example, you cannot look at the metrics of all nodes on one dashboard screen.
Kibana also allows you to create small widget visualizations and import the widgets into dashboards for easy use. You can create many widgets, charts, graphs, pie charts, maps for geometric data, and more. Additionally, you can generate reports from the visualizations and export them as needed.
Lastly, Kibana provides alerting capabilities as well. It allows you to create rules and actions such as “Send a message if the CPU level is above a specified threshold for more than 3 minutes”. As for the target systems, there are integrations with popular tools such as Slack, Jira, and ServiceNow. However, many alerting features, such as integrations, require a paid subscription.
- Powerful and flexible dashboards
- Many integrations with other tools
- Large active community
- There are many usability limitations; for example, node graphs are shown individually, never on one graph.
- Some of the most useful functionality is behind a paywall.
Cerebro is an open-source monitoring tool that is lightweight, easy to use, and very commonly known and widely used. However, it is not as fully featured as some of the other options. This is not necessarily bad since Cerebro allows you to see a clear picture of your Elasticsearch cluster without the extra bells and whistles that may distract you and are not always needed.
Cerebro gives you a quick real-time look into your cluster health statistics. However, its most significant drawback is that it doesn’t support a time picker for historical data. This means that Cerebro cannot show historical metric data but only show point-in-time metrics at the given moment. Additionally, if you’re looking for features such as alerting and visualizations, then you would have to connect a different tool for alerting in addition to Cerebro, which is only suitable for visualization and viewing.
Lastly, the project is no longer regularly updated. The latest commit from its GitHub repository was on July 3, 2021, over two years ago at the time of writing! That said, it is still a great lightweight, free, and open-source alternative to some bigger players in this space.
- Free and open-source
- Lightweight, simple, and easy to use
- Not as powerful or flexible
- No option for viewing historical data
- No integration with other tools
- No longer actively supported
Grafana is an open-source tool for monitoring and visualizing metric data. It works on top of various sources and is commonly used with Prometheus, an open-source metrics collection and storage tool, as the primary metrics data source. Grafana is very flexible and allows you to pull data from various sources. It also has alerting capabilities, allowing you to set up various rules with complex query logic to alert only in specific scenarios inside those dashboards. Although there are not many integrations for alerting built-in to Grafana, there is a plugin system you can use to install plugins that enable support for most popular alert system targets such as Slack, Teams, PagerDuty, ServiceNow, and others.
As its name suggests, Grafana shines the most with its dashboarding and visualization capabilities which are very flexible and customizable. Grafana is available for installation as a free and open-source version you maintain. However, there is also a hosted version by Grafana Labs, which has a basic free tier and paid plans for larger amounts of time series data and storage.
One of Grafana’s main downsides is its learning curve. Grafana itself requires domain expertise to leverage and “unlock” some of its leading features, and additional knowledge is also required for the systems it can require. For example, the most popular data source of Grafana is Prometheus for collecting the metrics and exporting them into Grafana. For defining alerts when using Prometheus as a data source, the PromQL syntax needs to be used. This is added to the learning curve of Grafana itself.
- Flexible and customizable dashboards
- Full-featured dashboards are available to get started quickly
- Free open-source and hosted options
- Learning curve across multiple systems
- Requires maintaining two tools, one for metrics such as Prometheus and Grafana itself for dashboards and alerting. Connecting Elasticsearch itself as a data source is theoretically possible but is not a widely used option.
New Relic is a fully featured Observability product. Its Elasticsearch integration makes it simple to pull in various metrics about the cluster, node, and indexes. As a full-featured Observability tool, New Relic allows you to monitor your Elasticsearch clusters in addition to your websites, mobile apps, systems, and applications.
Regarding Elasticsearch monitoring, New Relic gives you nearly all available cluster statistics. You can use them to create visualizations and dashboards, and it supports alerting with many integrations.
Because New Relic is an all-in-one enterprise-grade software, it tends to be more difficult to get used to and learn its approach to observability. Additionally, since New Relic is a paid solution, it can get very expensive for large teams with lots of data. Although it does have a free tier, it is a trial to try out the platform features, and most production use cases will require a paid subscription. Pricing is calculated based on the number of users (segmented down to the type of users) and the amount of processed data.
- Fully featured observability platform
- Simple to integrate Elasticsearch data
- Can be expensive
- Not as customizable as other options
Like New Relic, Datadog is an enterprise-grade, robust, fully featured observability tool. It offers insights into every metric available in your Elasticsearch cluster. It supports many application integrations and features monitoring, visualizations, dashboards, and alerting capabilities.
One of the best features is its support for templating. You can quickly retrieve templates for dashboards, reports, and monitoring. This gives you a quick and easy way to get started. You can load a template dashboard to get some basic Elasticsearch metrics and modify it to fit your needs. This provides a better experience than starting from scratch.
The main drawback of Datadog is its cost. In most cases, it is the most expensive monitoring solution on this list. That said, it is definitely worth considering especially if you want a solution to monitor Elasticsearch alongside the rest of your infrastructure and applications.
- Easy to use
- Many out-of-the-box integrations
- Not as customizable as others
Pulse is our Elasticsearch Monitoring solution built by the engineers at BigData Boutique.
Most of the solutions mentioned above provide alerting and graphs, but they still require Elasticsearch expert knowledge to know how to react when a problem arises. Otherwise, it could take hours or days to resolve a critical issue. This type of situation can be very valuable for your organization during a software outage since it can result in a loss of income if clients rely on the uptime of your search solution.
That is exactly why we built Pulse.
Pulse offers monitoring, visualizations, dashboards, and alerting. Rather than providing alert functionality with predefined limits and definitions, Pulse will intelligently offer custom monitoring suggestions based on your specific cluster configuration and setup. The alerts suggested help with current issues and preventing future issues by analyzing potential misconfigurations that may turn catastrophic only down the road. Additionally, Pulse suggests only actionable insights, which reduces alert fatigue.
The dashboards are designed to monitor Elasticsearch on all levels: clusters, nodes, indices, and operating system components relevant to running Elasticsearch in production. The metrics and dashboards are based on years of experience consulting clients with Elasticsearch, so we know exactly which are important to focus on.
Another powerful feature of Pulse is Query Analytics to analyze the performance of your Elasticsearch queries, and it also supports OpenSearch.
In addition to these insights, Pulse comes with support from world-class Elasticsearch expert engineers when needed. With industry-standard SLAs, our engineers can help resolve your issues fast and get you back up and running in no time.
- Powerful and flexible dashboarding
- Provides actionable insights to prevent future issues.
- Simple to set up and use.
- Expert support as part of the product offering.
- No free or open-source option.
- Pulse focuses on Elasticsearch and relevant Operating System metrics only, so it cannot be used for other systems.
- Pulse does not integrate with other monitoring tools.
When it comes to Elasticsearch monitoring, these excellent solutions are available. Some are free, others require a commercial license, and some sit somewhere in between, offering some features for free and requiring a license for others. All options mentioned in this post are viable from our experience, so when it comes to selecting a monitoring solution for your cluster, it comes down to personal preference and your specific needs.
Due to our extensive experience with Elasticsearch and after using many different tools over the years, we developed and currently use Pulse ourselves for most use cases. The main advantage of Pulse is that it includes the capability for monitoring and alerting and gives actionable insights about the Elasticsearch cluster. From our experience, this is something that cannot be underestimated. Too often, engineers are plagued with a situation where the problem is known, but there is no idea what caused it, how to fix it, and how to prevent it from occurring again. Understanding the cause and effect of a failure and implementing a solution can sometimes take hours or even days. In these cases, having those expert tips can save time and money and zero in on the issue's root cause faster.
Another first-rate alternative would be to stick with the default tool, Kibana. This is especially true when you’re using a hosted Elasticsearch solution like Elastic Cloud, seeing as in Elastic Cloud the product is already part of the Elastic Stack and offers all the capabilities and licensed features out of the box.
As the best free and open-source monitoring solution, we recommend Grafana.
If you do not have cost concerns and would rather opt for a more commercial, full observability platform that encompasses not only your Elasticsearch clusters but also your applications, logs, and metrics, then New Relic or Datadog is the way to go.
We hope this helps you make better-informed decisions about your Elasticsearch monitoring needs.