A practitioner comparison of anomaly detection algorithms for observability data: statistical baselines, Isolation Forest, Random Cut Forest, and Prophet, and which one fits the shape of your logs and metrics.

Anomaly Detection Algorithms for Logs and Metrics: Isolation Forest, Random Cut Forest, Prophet, and When Each Works

Most teams reach for machine learning the moment someone says "anomaly detection." That instinct is often wrong. For a single metric with a stable daily rhythm, a z-score against a rolling mean catches the same spikes that an ensemble of trees would, runs in microseconds, and never surprises you at 3 a.m. with a model that drifted. The hard part of anomaly detection is not the algorithm. It is matching the algorithm to the shape of your data and to the cost you are willing to pay for a false positive.

This post compares four families of anomaly detection algorithms that show up repeatedly in observability work: statistical baselines (moving average, z-score, EWMA), Isolation Forest, Random Cut Forest (RCF), and Prophet. Each one assumes something different about your data: whether it streams or arrives in batches, whether it has seasonality, whether it is one dimension or many. Pick the wrong assumption and you get alert fatigue or silent misses. The sections below lay out what each algorithm actually does and where it breaks.

The three categories you are actually choosing between

Anomaly detection algorithms split into three groups based on how they model "normal." Statistical methods model normal as a distribution and flag points that fall too far from it. Tree-based outlier detectors model normal as a region of feature space and flag points that are easy to isolate from the rest. Forecasting methods model normal as a prediction and flag points where reality diverges from the forecast.

That framing matters because it tells you what each method can and cannot see. A z-score on a single time series has no concept of "Tuesday looks different from Sunday." Prophet does. Isolation Forest has no concept of order in time at all; shuffle your data and the result is identical. Random Cut Forest is built specifically to handle order, because it maintains a sketch of a stream and updates it point by point.

Anomaly detection is the task of identifying observations that deviate so much from the rest of the data that they were likely generated by a different process. In observability, that "different process" is usually a bug, an outage, an attack, or a capacity limit.

The other axis is dimensionality. A CPU utilization metric is univariate. A request that carries latency, status code, payload size, and upstream host is multivariate, and an anomaly may live in the combination of values rather than in any single field. The methods below differ sharply in how they handle that.

Statistical baselines: where most metric monitoring should start

For a univariate metric, the workhorses are moving average, z-score, MAD (median absolute deviation), and EWMA (exponentially weighted moving average). A z-score measures how many standard deviations a point sits from a rolling mean. MAD swaps mean and standard deviation for median and median deviation, which makes it far more robust when your history already contains spikes. EWMA weights recent observations more heavily than old ones, so the baseline adapts as traffic grows.

These methods are not a fallback you settle for. For a large class of metrics they are the correct choice. They are interpretable: an on-call engineer can read "value was 6 sigma above the 1-hour mean" and act on it. They cost almost nothing to compute, which matters when you are scoring thousands of series every few seconds. And they have no training phase to drift or go stale. ClickHouse can express a rolling z-score directly in SQL over a windowed aggregation, and Prometheus recording rules approximate the same idea with avg_over_time and stddev_over_time.

The limits are real and worth stating plainly. A plain z-score assumes the series is roughly stationary, so a metric with strong daily or weekly seasonality will throw false positives every morning when traffic ramps and miss real problems during quiet hours. The usual fix is to deseasonalize first (compare each point to the same hour last week, or subtract a seasonal baseline) and then apply the statistic. If you find yourself building an increasingly elaborate seasonal correction on top of a z-score, that is the signal to move to a forecasting model instead.

A standard playbook for metric alerting:

  1. Start with EWMA or a rolling z-score on the raw series.
  2. If mornings and weekends cause false alarms, deseasonalize against the same period last week before scoring.
  3. Switch from mean/std to median/MAD if your history is spiky.
  4. Only reach for ML once you have a concrete shape that statistics cannot capture (strong multi-period seasonality, multivariate interactions, or high cardinality).

Isolation Forest and Random Cut Forest: tree-based outlier detection

Isolation Forest, introduced by Liu, Ting, and Zhou in 2008, builds an ensemble of random trees. At each node it picks a feature at random and a split value at random. Anomalies, being few and different, get isolated into their own leaf after only a handful of splits, so a short average path length across the forest signals an outlier. The scikit-learn IsolationForest implementation makes this a few lines of Python and scales well to high-dimensional batches. It is unsupervised, fast to train, and indifferent to the scale of individual features.

Isolation Forest isolates anomalies instead of profiling normal points. It builds random trees and flags observations that require unusually few random splits to separate from the rest, because anomalies are both rare and distant in feature space.

The catch: Isolation Forest is a batch method with no notion of time. It treats each row as an independent point in feature space. That is ideal for finding outlier log events or unusual request fingerprints across a fixed window, and it handles the multivariate case naturally. It is the wrong tool for a streaming metric where "anomalous" means "different from five minutes ago," because shuffling the timestamps would not change a single score.

Random Cut Forest is the streaming answer to that gap. The Robust Random Cut Forest paper by Guha, Mishra, Roy, and Schrijvers (ICML 2016) adapts the isolation idea to data streams by maintaining a forest as a continuously updated sketch of the input. As each new point arrives, RCF estimates how much that point disturbs the existing tree structure and emits an anomaly grade. It was designed for streaming, high-dimensional data and degrades gracefully on duplicates and near-duplicates, which are common in metric streams.

RCF is the algorithm behind OpenSearch Anomaly Detection and AWS managed anomaly detection features. In OpenSearch it runs unsupervised over your indexed time series, computing an anomaly grade and a confidence score per interval and adapting as patterns evolve. The plugin's high-cardinality mode is the part most teams underuse: by setting a category field (host, IP, customer, product ID), OpenSearch builds a separate per-entity model, so each entity gets its own baseline rather than being smeared into a global average. AWS documents this scaling to roughly a million entities with adaptive in-memory model management. For per-host or per-tenant observability, that per-entity isolation is usually what separates a useful detector from a noisy one. The OpenSearch team has also published algorithmic improvements that cut false positives substantially in recent releases, so the practical false-positive rate has moved over time.

Prophet and forecasting-based detection: when seasonality is the whole story

Prophet is an open-source forecasting library from Meta. It fits an additive model that decomposes a series into trend, seasonality (yearly, weekly, daily, plus custom periods you define), and holiday effects. To use it for anomaly detection you forecast the expected value with an uncertainty interval, then flag any actual observation that lands outside that interval. The anomaly is defined relative to what the model expected, not relative to a flat rolling mean.

This is the right family when seasonality and known calendar events dominate your signal. Business-hours traffic, weekend dips, end-of-month batch jobs, Black Friday: Prophet models these explicitly, including the ability to register a holiday calendar so a predictable spike does not page anyone. It is robust to missing data and to occasional outliers in the training history, and it does not demand a time-series specialist to operate. Classical alternatives in this family (ARIMA, SARIMA, Holt-Winters) handle similar seasonal structure and are lighter weight, but they require more care in selecting orders and differencing.

The trade-offs are equally concrete. Prophet is a batch fit; you retrain periodically rather than scoring a live stream point by point. It wants several seasonal cycles of history to learn the pattern, so it is poor at cold start and on young metrics. It is fundamentally univariate, so multivariate interactions are out of scope. And on a metric with no real seasonality it adds complexity and tunable knobs without buying you anything a z-score would not. Reserve it for series where the seasonal shape is the dominant feature and where calendar-aware suppression of known spikes is worth the operational overhead of retraining.

The comparison table and how to choose

The honest summary is that none of these dominates. They occupy different corners of a design space defined by streaming vs. batch, univariate vs. multivariate, and whether seasonality is present.

Algorithm Supervised? Streaming or batch Seasonality handling Dimensionality Typical use
Z-score / EWMA / MAD Unsupervised Streaming None (deseasonalize manually) Univariate Default for stable single metrics; cheap, interpretable alerting
Prophet (forecasting) Unsupervised Batch (periodic retrain) Strong, explicit (daily/weekly/yearly + holidays) Univariate Seasonal business metrics with calendar effects
Isolation Forest Unsupervised Batch None (no time concept) Multivariate Outlier rows: unusual log events, request fingerprints over a window
Random Cut Forest Unsupervised Streaming Learns periodicity from the stream Multivariate, high-cardinality Live metric/log streams; per-entity detection in OpenSearch

A few rules of thumb that fall out of this:

  • Single metric, stable or mildly trending: start with a rolling z-score or EWMA. Do not skip to ML.
  • Single metric, strong daily/weekly cycle and known calendar events: Prophet, or a deseasonalized statistic if you want to stay lightweight.
  • A live stream where you cannot retrain offline, especially many entities: Random Cut Forest, and use OpenSearch's category-field per-entity models for host- or tenant-level granularity.
  • Outlier records in a batch (logs, request features), order does not matter: Isolation Forest.

Whatever you pick, the production failure mode is almost always the same: alert fatigue. A detector that fires on every Monday morning ramp gets muted within a week, after which it might as well not exist. Budget your effort accordingly. Tune thresholds against a labeled window of past incidents, suppress known seasonal and deployment-related spikes, and require a minimum confidence or a sustained breach before paging. Log anomaly detection has its own front-end step worth calling out: raw lines are noisy, so most pipelines first extract templates (drain-style parsing) or cluster messages, then run frequency or outlier detection on the structured result rather than on raw text. If you are still deciding where logs and metrics should live before any of this, our guides to choosing log management tooling and monitoring Amazon OpenSearch cover the platform layer, and the OpenTelemetry with OpenSearch guide covers getting the signals in cleanly in the first place.

Key takeaways

  • Match the algorithm to the data shape, not to fashion. Streaming vs. batch, univariate vs. multivariate, and seasonal vs. flat are the three questions that decide the choice.
  • Simple statistics (z-score, EWMA, MAD) frequently beat ML on univariate metrics and cost a fraction as much. Start there and earn your way up to ML.
  • Isolation Forest is a batch, time-agnostic outlier finder for multivariate rows. Random Cut Forest is its streaming counterpart and is the algorithm powering OpenSearch Anomaly Detection, including per-entity, high-cardinality detection.
  • Prophet earns its keep only when seasonality and calendar events dominate the signal; it is a univariate batch model that needs several cycles of history.
  • The recurring production problem is false positives, not detection power. Deseasonalize, suppress known spikes, tune against real incidents, and require sustained breaches before you page.

If you are weighing these approaches against the noise problem in your own stack, BigData Boutique builds and tunes anomaly detection on OpenSearch and ClickHouse for production observability workloads.