LLM Evaluation in Production: Frameworks, Metrics, and the Layered System That Ships

A practitioner's guide to LLM evaluation: the layered system (offline regression, online/shadow, human calibration), the metrics that map to business risk, and a 2026 comparison of DeepEval, Ragas, Promptfoo, LangSmith, Braintrust, Phoenix, Langfuse, Opik, and MLflow.

Most teams treat LLM evaluation as a tool-selection problem. It is not. Picking DeepEval over Ragas, or LangSmith over Braintrust, is downstream of a question that almost nobody answers explicitly: what does the evaluation system look like? Shipping reliable LLM features in production requires a layered system - offline regression suites, online and shadow evaluation, and a small set of human-calibrated anchors. Frameworks slot into that architecture; they do not replace it.

This post lays out the architecture first, maps the 2025-2026 framework landscape onto it, and gives an opinionated decision tree for stack selection. It assumes you have already shipped or prototyped an LLM-powered feature and want to stop being surprised by it.

Why LLM Evaluation Is Not Like Traditional ML Evaluation

Classic supervised ML had the convenient property that a single number - accuracy, F1, AUC - summarized model quality on a frozen test set. LLM applications break that on every axis. Outputs are open-ended, sampling is stochastic, and the same prompt can produce different answers across runs. The "test set" is moving because prompts and upstream models change, not just data.

Cost and latency are not ops afterthoughts in this regime; they are first-class quality dimensions. A model that is two points more accurate but three times slower or five times more expensive is rarely the right answer for a user-facing surface. Evaluation has to track all of it together.

Public benchmarks like MMLU-Pro, GPQA, BIG-Bench Hard, and SWE-bench Verified tell you which model is generally smarter. They do not tell you whether your support copilot stops fabricating policy numbers or your coding agent stops over-using a deprecated API. Those are application-level questions, and a benchmark of grad-school physics multiple choice will not answer them.

What to Actually Evaluate: Metrics That Map to Business Risk

The first decision is what failure modes hurt you, not what is easy to measure. The taxonomy below covers the categories most product teams need.

Task metrics. Accuracy on the underlying task: exact match, structured-output validity (is the JSON parseable, does it conform to the schema), tool-call correctness for agents. These are the closest analog to classical ML metrics and the easiest to automate.

RAG metrics. When retrieval is in the loop, Ragas has standardized a useful taxonomy: faithfulness (is the answer grounded in retrieved context), answer relevancy (does the answer address the query), context precision (is the retrieved context focused), context recall (does the retrieved context contain the answer). All four are computable without ground-truth labels using LLM-as-judge.

Agentic metrics. Trajectory correctness, tool-selection accuracy, step efficiency. These have only recently shown up as first-class scorers in DeepEval, Langfuse, and Braintrust - if your agents have non-trivial multi-step reasoning, you need them.

Safety and policy. Toxicity, PII leakage, jailbreak resistance, prompt-injection robustness. These are usually classifier-based and run on every output regardless of task.

Operational metrics. P50 and p95 latency, tokens per request, cost per successful task. The denominator matters: failed attempts inflate cost-per-completion, which is what actually shows up on the bill.

Human-calibrated metrics. Preference, helpfulness, tone. Expensive to collect, but they anchor the automated layers - without them, you are optimizing a proxy you have not validated.

The Layered Evaluation System

The single most important idea in this post: evaluation is not one thing. It is three layers that feed each other.

Layer 1 - Offline regression. A versioned golden dataset (typically 100-500 examples), deterministic or LLM-judge scorers, runs in CI on every PR. Catches regressions before they ship. This is where DeepEval, Promptfoo, Ragas, and similar libraries live.

Layer 2 - Online and shadow evaluation. Sampling production traffic for live scoring, A/B tests, interleaved tests, shadow deploys for new prompt or model versions. This is where hosted platforms (LangSmith, Phoenix, Langfuse, Braintrust, Opik) earn their keep, because they couple tracing infrastructure with eval harnesses.

Layer 3 - Human calibration. A small expert-labeled set, refreshed quarterly, that anchors what "good" means. The other two layers drift without it; LLM judges are notoriously biased toward verbosity and self-similarity.

The layers are not independent. Production failures surfaced by Layer 2 become regression cases in Layer 1. Layer 3 disagreements with the LLM judge force scorer revisions. A team running only Layer 1 ships well-tested regressions of a quality target nobody validated. A team running only Layer 2 catches problems after users see them. You need all three.

Framework Landscape, Mapped to the Layers

Below is the 2026 picture, organized by which layer each tool primarily targets.

Offline and dev-loop frameworks

DeepEval is the closest thing to "pytest for LLMs" - tests as Python functions, decorators that wrap LLM calls, more than fifty built-in metrics including G-Eval and the full Ragas suite. It is the default for teams that want evaluation to look like normal software testing.

Ragas is RAG-native and remains the cleanest taxonomy for retrieval evaluation. It composes well with other frameworks; many teams import Ragas metrics into DeepEval or directly into a hosted platform.

Promptfoo is declarative YAML-driven prompt regression with first-class red-team support. As of 2026 it is part of OpenAI and remains open-source under MIT. Strongest fit for prompt engineering iteration loops where the unit of comparison is the prompt itself.

OpenAI Evals is largely in maintenance mode. Still useful as a reference for declarative YAML eval patterns, but not where active investment is happening; new product teams should not start here.

Platform and hosted (tracing + eval)

LangSmith, Braintrust, Arize Phoenix, Langfuse, and Opik all offer the same shape: tracing capture, dataset management, online evaluators, dashboards, and human review queues. They differ in self-host options, opinionated framework integrations, and pricing model.

Phoenix and Langfuse have permissive open-source licenses and self-host well, which matters for regulated workloads. Braintrust and LangSmith are commercial-first with stronger out-of-the-box scoring and dataset tooling. Opik (Comet) sits in between - open-source core, hosted SaaS available.

For a deeper comparison see our LLM observability tools post.

Enterprise and MLOps

MLflow LLM Evaluate is the right answer for teams already on Databricks or running MLflow as their experiment-tracking spine. As of early 2026 it offers more than fifty built-in metrics and judges, with results that drop into the same MLflow runs as classical model artifacts.

Vertex AI Gen AI Evaluation and the Microsoft Foundry Evaluation SDK (the rebrand of Azure AI Studio's evaluation tooling, with azure-ai-projects v2 unifying agents, inference, and evaluations as of January 2026) are the cloud-native picks. They earn their place when compliance, data residency, or the rest of the ML stack is already on the cloud in question.

Research and model selection

Stanford HELM, EleutherAI's lm-evaluation-harness, MMLU-Pro, GPQA, BBH, and SWE-bench Verified are the right tools for picking which base model to use. They are the wrong tools for evaluating whether your application works. Conflating the two is the most expensive mistake we see in evaluation strategy.

Comparison table

Framework	Primary layer	RAG support	Self-host	CI fit	Best for
DeepEval	Offline (Layer 1)	Strong (Ragas built-in)	OSS	Excellent (pytest-style)	Teams that want eval as Python tests
Ragas	Offline (Layer 1)	Native	OSS	Good	RAG-heavy products
Promptfoo	Offline (Layer 1)	Good	OSS	Excellent (YAML, CI)	Prompt regression and red-team
LangSmith	Online (Layer 2)	Good	Hosted	Via SDK	LangChain-native stacks
Braintrust	Online (Layer 2)	Good	Hosted	Excellent	Multi-model agentic systems
Arize Phoenix	Online (Layer 2)	Good	OSS / hosted	Good	OpenTelemetry-native, self-hosted
Langfuse	Online (Layer 2)	Good	OSS / hosted	Good	OSS-first, self-host friendly
Opik	Online (Layer 2)	Good	OSS / hosted	Good	Comet-aligned shops
MLflow LLM Evaluate	Enterprise	Good	OSS / Databricks	Excellent	Existing MLflow / Databricks users
HELM / lm-eval-harness	Research	N/A	OSS	N/A	Base-model selection only

LLM-as-a-Judge: Where It Works, Where It Quietly Lies

Judge models are seductive because they collapse a hard labeling problem into an API call. They also have well-documented failure modes: position bias (preferring whichever response is shown first), verbosity bias (preferring longer answers), self-preference (a model rates its own outputs higher), and reference leakage when the judge has seen the test data during pretraining.

The original G-Eval paper showed that chain-of-thought-style scoring with explicit rubrics improves correlation with human judgment, but does not eliminate the biases. The LLM-as-a-Judge paper (the MT-Bench work) pinned down the position-bias number empirically. Since then, the practical recipe has stabilized: use a stronger judge than your generator, anchor judgments with rubrics rather than free-form scoring, randomize position in pairwise comparisons, and audit a sample against humans regularly.

Where judges work well: structured criteria (does the output cite at least one source, does the JSON parse, does the response stay on-topic), pairwise preferences with anchored rubrics, and faithfulness checks against retrieved context where the judge has the source material in the prompt.

Where they quietly lie: subjective tone, anything where verbosity correlates with perceived quality, multi-turn coherence, and high-stakes safety judgments. We have written about a specific failure mode in Thinking Fast and Failing Slow - judges look fine on aggregate metrics while missing the cases that matter.

The mitigations: use rubric decomposition (break the judgment into 3-5 binary checks rather than a 1-10 score), use pairwise comparisons rather than absolute scores, run the judge twice with positions swapped and discard ties, and treat any judge score as a hypothesis until it correlates with a human-labeled holdout.

The Golden Dataset: The Unsexy Work That Decides Everything

Evaluation frameworks are commodity. Golden datasets are not. The single highest-leverage activity in an evaluation program is curating a few hundred high-quality examples that represent your real distribution.

Where the data comes from: real production traces beat synthetic every time. If you do not have production traffic yet, prompt-generated examples are a starting point but should be replaced as soon as real data exists. Stratify by intent, difficulty, and observed failure mode - random sampling under-represents the long-tail failures that actually matter.

Labeling protocol: two raters plus adjudication, with explicit rubrics. Track inter-annotator agreement (Cohen's kappa or simple percent agreement) - if your raters disagree below ~0.7 kappa, the rubric is the problem, not the model.

Treat eval data as code: version it in the same repo as the application, gate changes through PRs, and snapshot it whenever you change scoring logic so old runs remain comparable. Datasets that drift silently invalidate every comparison you make.

Wiring Evals Into CI/CD and Production

The eval system is only useful if it actually blocks bad changes from shipping.

Pre-merge. A fast subset of the golden set runs on every PR. Threshold scorers - "faithfulness must be at least 0.85", "p95 latency must be below 2 seconds", "cost per request must not increase by more than 10%" - act as gates. Slow scorers (LLM-judge) run nightly or pre-release rather than per-PR.

Pre-release. The full golden set, the cost and latency budgets, and any safety or red-team suites. Failing closes the release.

Post-release. Sample 1-5% of production traffic into the eval pipeline. Monitor judge-score distributions for drift; alert when the tail of low-confidence outputs grows. Auto-promote new failure clusters into the regression set so the next release is tested against today's bugs.

Release strategies that pay off: shadow deploys (new version sees production traffic, output discarded, scored offline), canary releases on traffic slices, and feature-flagged prompt and model changes so rollback is one toggle rather than a redeploy.

Choosing Your Stack: A Pragmatic Decision Tree

There is no single right stack. Here is how we usually shape recommendations.

Pure LLM app, small team, fast iteration: Promptfoo for prompt regression in CI, DeepEval for Python-test-style eval in the same repo. No hosted platform until traffic justifies it.
RAG-heavy product: Ragas for the metric set, plus a hosted tracer (Phoenix or Langfuse if self-hosting matters; Braintrust or LangSmith if it does not).
Regulated or enterprise stack: MLflow LLM Evaluate if you are on Databricks; Vertex AI Gen AI Evaluation or Microsoft Foundry Evaluation SDK if the rest of the platform is on GCP or Azure. Compliance posture beats marginal feature gaps.
Multi-model agentic systems: Braintrust or LangSmith for tracing depth and custom scorer ergonomics. Pair with DeepEval or a thin custom harness for the offline regression layer.

When to build versus buy: if you can articulate a two-year horizon for which an internal eval harness gives you specific leverage (proprietary scoring, data residency, integrations no vendor will build), build. Otherwise, buy. Eval is one of those areas where vendor velocity outpaces internal projects, and the cost of building stalls the actual product work.

Key Takeaways

LLM evaluation is a layered system, not a tool. Offline regression, online and shadow evaluation, and human calibration each catch failures the others miss.
Map metrics to business risk before picking a framework. Faithfulness, structured-output validity, latency and cost belong in the same scorecard.
Public benchmarks (MMLU-Pro, GPQA, BBH, SWE-bench) are for picking base models. They do not answer whether your application works.
LLM judges are useful but biased. Use rubric decomposition, pairwise comparisons, and audited human holdouts to keep them honest.
Golden datasets are the differentiator. A few hundred curated examples, versioned alongside code, beats any framework choice.
Wire evals into CI: PR gates on a fast subset, full suite pre-release, traffic sampling post-release. Without enforcement, evaluation is just paperwork.

If you are building an LLM-powered product and want help designing the evaluation stack and the operational discipline around it, our AI consulting team has shipped this for clients across regulated and high-throughput domains.