LLM Observability Tools Compared: LangFuse vs LangSmith vs Opik

A practical comparison of Langfuse, LangSmith, and Opik for LLM observability, tracing, and evaluation - covering architecture, pricing, framework support, and when to use each tool.

Getting an LLM-powered application to work in a demo takes days. Keeping it working in production takes months. The gap between the two is observability - the ability to trace what your application is doing, measure how well it's doing it, and figure out why it stopped doing it correctly last Tuesday.

LLM observability is the practice of capturing and analyzing the runtime behavior of LLM applications - including traces of individual LLM calls, retrieval steps, tool invocations, latency, token usage, cost, and systematic evaluation of output quality. As teams ship more agents and RAG pipelines into production, observability has become the difference between "it works on my machine" and "it works for 10,000 users." Three platforms lead this space right now: Langfuse, LangSmith, and Opik. Each takes a different approach to the same problem.

What LLM Observability Actually Means

Traditional application monitoring tracks HTTP status codes, CPU usage, and error rates. LLM observability goes deeper. A single user request to an agentic application might trigger a retrieval step, a reranking call, three LLM invocations with different prompts, two tool calls, and a final synthesis step. If the output is wrong, you need to know which step failed and why.

The core capabilities teams need from an LLM observability platform:

Tracing: Capturing the full execution tree of a request - every LLM call, retrieval step, and tool invocation with inputs, outputs, latency, and token counts.
Cost tracking: Mapping token usage to actual dollar amounts per request, per user, per feature. At scale, a single poorly written prompt can cost thousands per month.
Evaluation: Systematic measurement of output quality through automated scoring, human annotation, or both. This is the hard part - as we've written about before, LLM-as-a-judge has real limitations.
Prompt management: Versioning, deploying, and A/B testing prompts without code changes.

All three platforms we're comparing offer these capabilities, but they differ in architecture, licensing, ecosystem integration, and operational overhead.

LangSmith

LangSmith is the observability and evaluation platform built by the team behind LangChain. It started as a tightly-coupled companion to the LangChain framework but has since expanded to support any LLM application stack.

Architecture and integration. LangSmith provides SDKs for Python, TypeScript, Go, and Java. It traces applications built with OpenAI SDK, Anthropic SDK, Vercel AI SDK, LlamaIndex, and custom implementations - not just LangChain. That said, the deepest integration remains with LangChain and LangGraph, where tracing is essentially automatic. For other frameworks, you wrap functions with a @traceable decorator or use the OpenTelemetry integration.

Evaluation framework. This is where LangSmith stands out. You can build datasets directly from production traces, define custom evaluators, and run systematic experiments comparing prompt changes or model swaps. The playground lets you iterate on prompts against real production data, which shortens the feedback loop considerably.

Deployment and pricing. LangSmith offers managed cloud, bring-your-own-cloud (BYOC), and self-hosted options. The free Developer tier includes 5,000 traces per month. The Plus plan runs $39/seat/month with 10,000 included traces and additional traces at approximately $2.50-$5.00 per 1,000 traces depending on retention (14-day vs 400-day). Enterprise pricing is custom. One thing worth noting: the LangSmith platform is proprietary (though the client SDKs are MIT-licensed). Both BYOC and self-hosted options require an Enterprise license.

Limitations. Vendor lock-in is the primary concern. While LangSmith now supports multiple frameworks, the evaluation and dataset features work best within the LangChain ecosystem. If you're not using LangChain, you'll do more manual integration work. The closed-source nature also means you can't inspect or modify the platform itself.

Langfuse

Langfuse is an open-source LLM observability platform that has become the default choice for teams that want full control over their observability data. In January 2026, ClickHouse acquired Langfuse as part of a $400M Series D round - a move that signals the strategic importance of LLM observability in the data infrastructure stack.

Architecture and integration. Langfuse is framework-agnostic from the ground up. It integrates with LangChain, LlamaIndex, OpenAI SDK, LiteLLM, Vercel AI SDK, and any custom stack via its Python and TypeScript SDKs. It also supports OpenTelemetry, meaning you can feed LLM traces into the same pipeline as your existing application telemetry. The data model is clean: traces contain observations (spans, generations, events, and dedicated types for agents, tools, chains, retrievers, and more), each with structured metadata.

Open source and self-hosting. This is Langfuse's defining feature. The core platform is MIT-licensed and self-hostable, with enterprise features under a separate license. You can run it on Docker Compose for development or deploy it on Kubernetes for production. Langfuse runs on ClickHouse as its analytics backend (which makes the acquisition a natural fit). The project has grown to over 23,000 GitHub stars, with adoption at 19 of the Fortune 50. Post-acquisition, Langfuse remains fully open source with no planned licensing changes.

Evaluation and prompt management. Langfuse provides evaluation capabilities including LLM-as-a-judge, custom scoring functions, and annotation workflows for human review. On the prompt side, it offers built-in versioning, a playground for iteration, and flexible custom dashboards supporting multi-level aggregations across tracing data.

Pricing. Self-hosting is free. The cloud Hobby tier is free with 50,000 units per month (traces, observations, and scores each count as one unit) and 30-day retention. Paid cloud plans start from there. For context, a medium-scale self-hosted deployment runs approximately $3,000-4,000/month in infrastructure costs, compared to $199-300/month for the equivalent cloud Pro tier at mid-market scale.

Limitations. The evaluation framework is less opinionated than LangSmith's - you get scoring primitives, annotation queues, and LLM-as-a-judge, but need to assemble your own evaluation pipelines. The self-hosted option, while powerful, requires operational investment. And despite the growing community, the ecosystem of pre-built integrations and tutorials is still smaller than LangSmith's.

Opik

Opik is the newest entrant, built by Comet ML - a company with deep roots in ML experiment tracking. Released under the Apache 2.0 license, Opik brings a different perspective: it treats LLM observability as an extension of the broader ML experimentation workflow.

Architecture and integration. Opik provides tracing for LLM calls, tool executions, memory operations, and context assembly, with complete input/output pairs, token counts, latency, and cost tracking. It supports integrations with major LLM providers and frameworks, and recently released opik-openclaw, a native plugin for full-stack agent observability. The platform also offers an MCP Server for IDE integration. Opik includes native thread grouping via a thread_id parameter, allowing traces to be displayed as conversation threads in the UI - similar to Langfuse's session concept.

Experiment tracking heritage. Where Opik differentiates itself is in the connection to Comet ML's experiment tracking platform. If your team already uses Comet for ML model training experiments, Opik extends that workflow to LLM evaluation and production monitoring. You get a unified view across traditional ML and LLM workloads.

Evaluation and experiments. Opik provides LLM-as-a-judge evaluators, heuristic scoring, and a structured experiments workflow for comparing prompt and model changes. You can seed evaluation datasets directly from production traces, making it easy to build eval sets from real traffic. The connection to Comet's broader experiment tracking means evaluation results can live alongside traditional ML metrics in a single dashboard.

Pricing. Opik's core features are available across all tiers - open source, cloud, and enterprise - though usage quotas, rate limits, and data retention differ between plans. The open-source version is free to self-host. The cloud Pro plan is $39/month. All plans include unlimited team members. Comet also offers free Pro access for academic users.

Limitations. Opik is the youngest of the three platforms. The community is growing fast (18,000+ GitHub stars), but the ecosystem of guides, third-party integrations, and production case studies is thinner. If you're not already in the Comet ML ecosystem, the experiment tracking advantage is less relevant.

Comparison at a Glance

Feature	LangSmith	Langfuse	Opik
License	Proprietary (SDKs are MIT)	MIT (core); separate EE license	Apache 2.0 (open source)
Self-hosting	Enterprise license required	Free, fully supported	Free, fully supported
Tracing	Full execution trees	Full execution trees	Full execution trees
Evaluation	Built-in datasets, evaluators, playground	LLM-as-a-judge, custom scoring, playground	LLM-as-a-judge, heuristic evaluators, experiments
Prompt management	Yes, with playground	Yes, with versioning and playground	Yes, with versioning
Framework support	All major SDKs; deepest with LangChain	All major SDKs; framework-agnostic	All major SDKs; ties into Comet ML
OpenTelemetry	Yes	Yes	Yes
Cloud free tier	5,000 traces/month	50,000 units/month	Free with usage limits
Paid cloud	$39/seat/month + traces	Starting at ~$29/month	$39/month (Pro), custom (Enterprise)
GitHub stars	N/A (platform is proprietary)	23,000+	18,000+
Backing	LangChain Inc.	ClickHouse Inc. (acquired Jan 2026)	Comet ML

When to Use Which

There is no single best LLM observability and evaluation tool. The right choice depends on your existing stack, your operational preferences, and where you are in the build-vs-buy spectrum.

Choose LangSmith if you're building with LangChain or LangGraph and want the tightest possible integration. The evaluation framework is the most complete of the three - with built-in datasets, systematic experiment tracking, and a playground for iterating on prompts against production data. The managed cloud removes operational burden. Accept the vendor lock-in if speed of integration matters more than flexibility.

Choose Langfuse if you need open-source, self-hosted observability and evaluation with no licensing restrictions on the core platform. It's the strongest option for multi-framework environments, and the ClickHouse backing gives it long-term stability. Teams with data residency requirements or those already running ClickHouse will find this the most natural fit.

Choose Opik if you want a permissively licensed (Apache 2.0) platform with strong experiment tracking. If your team already uses Comet for ML model training, Opik extends that workflow to LLM evaluation with a unified view. The structured experiments workflow makes it easy to compare prompt and model changes systematically.

For most teams starting fresh with LLM observability and evaluation, Langfuse offers the best balance of capability, flexibility, and cost. It's open source, framework-agnostic, and backed by a well-funded parent company. But if you're deep in the LangChain ecosystem, LangSmith's tighter integration and more opinionated evaluation framework may be worth the trade-off.

Whatever you choose, the key is to instrument early. Retrofitting observability into a production LLM application is far harder than building it in from day one.