A practical comparison of Langfuse, LangSmith, and Opik for LLM observability, tracing, and evaluation - covering architecture, pricing, framework support, and when to use each tool.
Getting an LLM-powered application to work in a demo takes days. Keeping it working in production takes months. The gap between the two is observability - the ability to trace what your application is doing, measure how well it's doing it, and figure out why it stopped doing it correctly last Tuesday.
LLM observability is the practice of capturing and analyzing the runtime behavior of LLM applications, including traces of individual LLM calls, retrieval steps, tool invocations, latency, token usage, and cost. As teams ship more agents and RAG pipelines into production, observability has become the difference between "it works on my machine" and "it works for 10,000 users." Three platforms dominate this space right now: Langfuse, LangSmith, and Opik. Each takes a different approach to the same problem.
What LLM Observability Actually Means
Traditional application monitoring tracks HTTP status codes, CPU usage, and error rates. LLM observability goes deeper. A single user request to an agentic application might trigger a retrieval step, a reranking call, three LLM invocations with different prompts, two tool calls, and a final synthesis step. If the output is wrong, you need to know which step failed and why.
The core capabilities teams need from an LLM observability platform:
- Tracing: Capturing the full execution tree of a request - every LLM call, retrieval step, and tool invocation with inputs, outputs, latency, and token counts.
- Cost tracking: Mapping token usage to actual dollar amounts per request, per user, per feature. At scale, a single poorly written prompt can cost thousands per month.
- Evaluation: Systematic measurement of output quality through automated scoring, human annotation, or both. This is the hard part - as we've written about before, LLM-as-a-judge has real limitations.
- Prompt management: Versioning, deploying, and A/B testing prompts without code changes.
All three platforms we're comparing offer these capabilities, but they differ in architecture, licensing, ecosystem integration, and operational overhead.
LangSmith
LangSmith is the observability and evaluation platform built by the team behind LangChain. It started as a tightly-coupled companion to the LangChain framework but has since expanded to support any LLM application stack.
Architecture and integration. LangSmith provides SDKs for Python, TypeScript, Go, and Java. It traces applications built with OpenAI SDK, Anthropic SDK, Vercel AI SDK, LlamaIndex, and custom implementations - not just LangChain. That said, the deepest integration remains with LangChain and LangGraph, where tracing is essentially automatic. For other frameworks, you wrap functions with a @traceable decorator or use the OpenTelemetry integration.
Evaluation framework. This is where LangSmith stands out. You can build datasets directly from production traces, define custom evaluators, and run systematic experiments comparing prompt changes or model swaps. The playground lets you iterate on prompts against real production data, which shortens the feedback loop considerably.
Deployment and pricing. LangSmith offers managed cloud, bring-your-own-cloud (BYOC), and self-hosted options. The free Developer tier includes 5,000 traces per month. The Plus plan runs $39/seat/month with additional traces at approximately $2.50-$5.00 per 1,000 traces depending on retention (14-day vs 400-day). Enterprise pricing is custom. One thing worth noting: LangSmith is closed source. The self-hosted option requires an Enterprise license.
Limitations. Vendor lock-in is the primary concern. While LangSmith now supports multiple frameworks, the evaluation and dataset features work best within the LangChain ecosystem. If you're not using LangChain, you'll do more manual integration work. The closed-source nature also means you can't inspect or modify the platform itself.
Langfuse
Langfuse is an open-source LLM observability platform that has become the default choice for teams that want full control over their observability data. In January 2026, ClickHouse acquired Langfuse as part of a $400M Series D round - a move that signals the strategic importance of LLM observability in the data infrastructure stack.
Architecture and integration. Langfuse is framework-agnostic from the ground up. It integrates with LangChain, LlamaIndex, OpenAI SDK, LiteLLM, Vercel AI SDK, and any custom stack via its Python and TypeScript SDKs. It also supports OpenTelemetry, meaning you can feed LLM traces into the same pipeline as your existing application telemetry. The data model is clean: traces contain observations (spans, generations, events), each with structured metadata.
Open source and self-hosting. This is Langfuse's defining feature. The entire platform is MIT-licensed and self-hostable. You can run it on Docker Compose for development or deploy it on Kubernetes for production. Langfuse runs on ClickHouse as its analytics backend (which makes the acquisition a natural fit). The project has grown to over 22,000 GitHub stars and 26 million monthly SDK installs, with adoption at 19 of the Fortune 50. Post-acquisition, Langfuse remains fully open source with no planned licensing changes.
Prompt management and evaluation. Langfuse provides built-in prompt versioning and management, a playground for prompt iteration, and evaluation capabilities including LLM-as-a-judge and custom scoring functions. The custom dashboards are flexible, supporting multi-level aggregations across tracing data with various chart types.
Pricing. Self-hosting is free. The cloud Hobby tier is free with 50,000 observation units per month and 30-day retention. Paid cloud plans start from there. For context, a medium-scale self-hosted deployment runs approximately $3,000-4,000/month in infrastructure costs, compared to $199-300/month for the equivalent cloud Pro tier at mid-market scale.
Limitations. The evaluation framework is less opinionated than LangSmith's - you get the building blocks but need to assemble more yourself. The self-hosted option, while powerful, requires operational investment. And despite the growing community, the ecosystem of pre-built integrations and tutorials is still smaller than LangSmith's.
Opik
Opik is the newest entrant, built by Comet ML - a company with deep roots in ML experiment tracking. Released under the Apache 2.0 license, Opik brings a different perspective: it treats LLM observability as an extension of the broader ML experimentation workflow.
Architecture and integration. Opik provides tracing for LLM calls, tool executions, memory operations, and context assembly, with complete input/output pairs, token counts, latency, and cost tracking. It supports integrations with major LLM providers and frameworks, and recently released opik-openclaw, a native plugin for full-stack agent observability. The platform also offers an MCP Server for IDE integration.
Experiment tracking heritage. Where Opik differentiates itself is in the connection to Comet ML's experiment tracking platform. If your team already uses Comet for ML model training experiments, Opik extends that workflow to LLM evaluation and production monitoring. You get a unified view across traditional ML and LLM workloads.
Performance. Independent benchmarks suggest Opik completes trace logging and evaluation operations significantly faster than alternatives - reported at roughly 23 seconds for workloads that take Langfuse over 300 seconds. That speed advantage matters for rapid iteration during development.
Pricing. All Opik versions - open source, cloud, and enterprise - include the full feature set. The open-source version is free to self-host. The cloud version has a free tier with usage limits. All plans include unlimited team members. Comet also offers free Pro access for academic users.
Limitations. Opik is the youngest of the three platforms. The community is growing fast (17,600+ GitHub stars), but the ecosystem of guides, third-party integrations, and production case studies is thinner. If you're not already in the Comet ML ecosystem, the experiment tracking advantage is less relevant.
Comparison at a Glance
| Feature | LangSmith | Langfuse | Opik |
|---|---|---|---|
| License | Proprietary | MIT (open source) | Apache 2.0 (open source) |
| Self-hosting | Enterprise license required | Free, fully supported | Free, fully supported |
| Tracing | Full execution trees | Full execution trees | Full execution trees |
| Evaluation | Built-in datasets, evaluators, playground | LLM-as-a-judge, custom scoring, playground | LLM-as-a-judge, heuristic evaluators, experiments |
| Prompt management | Yes, with playground | Yes, with versioning and playground | Yes |
| Framework support | All major SDKs; deepest with LangChain | All major SDKs; framework-agnostic | All major SDKs; ties into Comet ML |
| OpenTelemetry | Yes | Yes | Yes |
| Cloud free tier | 5,000 traces/month | 50,000 units/month | Free with usage limits |
| Paid cloud | $39/seat/month + traces | Starting at ~$29/month | Pro and Enterprise tiers |
| GitHub stars | N/A (closed source) | 22,000+ | 17,600+ |
| Backing | LangChain Inc. | ClickHouse Inc. (acquired Jan 2026) | Comet ML |
When to Use Which
There is no single best LLM observability tool. The right choice depends on your existing stack, your operational preferences, and where you are in the build-vs-buy spectrum.
Choose LangSmith if you're building with LangChain or LangGraph and want the tightest possible integration. The evaluation framework is the most complete of the three, and the managed cloud removes operational burden. Accept the vendor lock-in if speed of integration matters more than flexibility.
Choose Langfuse if you need open-source, self-hosted observability with no licensing restrictions. It's the strongest option for multi-framework environments, and the ClickHouse backing gives it long-term stability. Teams with data residency requirements or those already running ClickHouse will find this the most natural fit.
Choose Opik if you're already using Comet ML for experiment tracking and want a unified platform across ML training and LLM evaluation. The performance advantage in trace logging also makes it worth evaluating if iteration speed during development is a bottleneck.
For most teams starting fresh with LLM observability, Langfuse offers the best balance of capability, flexibility, and cost. It's open source, framework-agnostic, and backed by a well-funded parent company. But if you're deep in the LangChain ecosystem, LangSmith's tighter integration and stronger evaluation tools may be worth the trade-off.
Whatever you choose, the key is to instrument early. Retrofitting observability into a production LLM application is far harder than building it in from day one.