Building an LLM application is one thing. Understanding why it gave a bad answer at 3 AM on a Tuesday is another. LangSmith is the observability, evaluation, and testing platform built by the LangChain team, designed to give developers full visibility into how their LLM applications behave in development and production.
LangSmith captures detailed traces of every step in an LLM workflow - model calls, retrieval steps, tool usage, agent decisions - and provides evaluation frameworks, prompt management, and monitoring dashboards on top of that data. It works with LangChain and LangGraph natively, but also supports applications built with OpenAI SDK, Anthropic SDK, LlamaIndex, Vercel AI SDK, or custom implementations through its Python, TypeScript, Go, and Java SDKs.
Key Features of LangSmith
Tracing: Every LLM application request is captured as a structured trace - a tree of runs showing parent/child relationships, inputs/outputs at each step, latencies, token counts, and errors. You can filter, export, share, and compare traces through the UI or API. For LangChain and LangGraph applications, tracing is automatic with a single environment variable.
Evaluation Framework: LangSmith supports multiple evaluator types: LLM-as-judge scoring against criteria you define, heuristic checks (output validation, code compilation), human evaluation through annotation queues, and pairwise comparisons. Custom evaluators can be written in Python or TypeScript with arbitrary business logic - correctness matching, hallucination detection, guardrails validation. Datasets are built directly from production traces, so you test against real-world inputs.
Prompt Playground: An interactive environment for testing and iterating on prompts with different models and real production data. You can edit messages, swap models, and immediately see how changes affect output - without deploying anything.
Monitoring and Dashboards: Real-time dashboards tracking costs, latency distributions, response quality scores, and usage patterns across your LLM applications. Set up alerts to get notified when performance degrades or costs spike.
Annotation and Feedback: Built-in annotation queues let team members score and label LLM outputs. This human feedback feeds directly into evaluation datasets, creating a loop between production quality and offline testing.
Prompt Hub Integration: A centralized repository for versioning, sharing, and deploying prompts. Every push generates a unique commit hash capturing that exact prompt version and its model configuration. Prompt engineers can iterate through the web interface while developers pull specific versions into application code - separating prompt changes from code deployments.
How LangSmith Works
The architecture follows a straightforward pattern: instrument your application with the LangSmith SDK, trace data flows to LangSmith's backend, and you interact with it through the dashboard.
For LangChain and LangGraph applications, you set the LANGCHAIN_TRACING_V2 environment variable and LangSmith automatically captures every chain execution, agent step, and tool call. For other frameworks, the SDK provides decorators and context managers to wrap the functions you want traced.
Each trace is structured as a tree. The root run represents the top-level request, with child runs for each inner operation - an LLM call, a retrieval step, a tool execution. In the UI, you drill into any run to inspect its inputs, outputs, timing, and token usage.
From there, you can send traces to evaluation datasets, run automated evals against those datasets, compare prompt versions in the playground, and monitor production metrics on dashboards. The workflow creates a feedback loop: production data feeds evaluation, evaluation results inform prompt changes, and monitoring catches regressions.
LangSmith vs Langfuse
Both platforms solve LLM observability, but they make different trade-offs.
| LangSmith | Langfuse | |
|---|---|---|
| Source model | Proprietary (managed service) | Open source (MIT license) |
| Framework focus | LangChain/LangGraph native, supports others | Framework-agnostic from the start |
| Deployment | Managed cloud, BYOC, self-hosted | Self-hosted or managed cloud |
| Tracing model | Run-based (aligns with LangChain execution) | OpenTelemetry-based |
| Evaluation | Built-in evaluators, annotation queues, playground | Evaluation API, custom scoring functions |
| Prompt management | SHA-based versioning via Prompt Hub | Integer-versioned prompt management |
| Pricing | Free tier (5k traces/mo), Plus at $39/seat/mo | Free self-hosted, usage-based cloud pricing |
LangSmith is the natural choice if your stack is built on LangChain or LangGraph - the integration is seamless and the evaluation tooling is polished. Langfuse is stronger for teams using multiple frameworks or wanting full control through self-hosting with no license constraints.
When to Use LangSmith
LangSmith fits well when:
- Your stack is LangChain or LangGraph: Automatic tracing with zero instrumentation code. The platform is designed around these frameworks' execution model.
- You want managed infrastructure: No servers to run, no databases to maintain. The Developer plan is free for up to 5,000 traces per month, which covers most development and testing workflows.
- Evaluation and testing matter to your workflow: The built-in evaluation framework - with dataset management, automated evals, human annotation queues, and regression testing - is more turnkey than most alternatives.
- You need a prompt engineering workflow: The Playground and Hub give prompt engineers a dedicated environment for iteration and versioning, decoupled from application code.
For teams not committed to the LangChain ecosystem, or those needing a fully open-source solution with no vendor dependency, Langfuse or other framework-agnostic tools may be a better fit.
Pricing Overview
LangSmith offers three tiers. The Developer plan is free with one seat and 5,000 base traces per month. The Plus plan costs $39 per seat per month, includes 10,000 base traces, and adds features like multiple workspaces for environment separation. The Enterprise plan adds advanced administration, security, and deployment options. Additional traces beyond the included allotment cost $0.50 per 1,000 base traces (14-day retention) or $5.00 per 1,000 extended traces (400-day retention).