A vendor-neutral playbook for cutting LLM costs in production, organized by lever: caching, model routing, prompt compression, batching, architecture choices, and token FinOps.
A demo that costs cents per call can turn into a five-figure monthly invoice once it meets real traffic. The bill rarely balloons because one model is expensive. It balloons because of compounding waste: a system prompt that ships on every request, an agent loop that retries five times before giving up, a retrieval step that stuffs 40K tokens of context to answer a one-line question, and a frontier model doing work a small one could handle.
LLM cost optimization is the practice of reducing the dollar cost of serving language-model workloads without degrading output quality past an acceptable threshold, by controlling token volume, model selection, and request patterns. The arithmetic is simple - cost equals (input tokens + output tokens) multiplied by the per-token price for the model you called - so every lever pushes on one of those three terms. The practical levers cluster into three verbs: cache what repeats, route to the cheapest model that clears the quality bar, and compress the tokens you send and receive. Layered, these routinely cut spend by 70% or more. This playbook walks each lever, the typical savings, and the trade-off you accept to get it.
Measure before you optimize
You cannot cut what you cannot see. The first move is instrumentation: capture input tokens, output tokens, model name, and latency for every call, then aggregate spend per request, per user, and per feature. Most teams discover that a small slice of traffic drives most of the bill - a single chatty agent, one verbose endpoint, or a batch job that re-embeds the same documents nightly. Without per-feature attribution, you optimize blind and risk cutting the wrong thing.
Token spend hides in four places: retries (failed tool calls and timeouts that silently re-invoke the model), long context (RAG pipelines that retrieve too much), verbose output (no max_tokens cap, no structured-output constraint), and over-large models (a frontier model classifying sentiment). Track each as its own line item. Tools like Langfuse, LangSmith, and Opik trace token counts and cost per span out of the box; our comparison of LLM observability tools covers the trade-offs between them. Treat this as a FinOps discipline, not a one-off audit - the same cost-accountability loop that FinOps applies to cloud infrastructure applies to tokens.
A useful baseline metric is cost per resolved task, not cost per call. An agent that solves a problem in one expensive call can be cheaper than one that grinds through ten cheap ones. Optimize for the unit your business actually cares about.
Route to the right model
Model selection is the highest-leverage decision in the whole stack, because per-token prices between tiers differ by an order of magnitude or more. The default mistake is using a frontier model everywhere. Most production traffic - classification, extraction, routing, short answers, formatting - does not need the strongest model. Right-sizing means matching each task to the smallest model that clears its quality bar, and moving the bulk of traffic down a tier.
Two patterns formalize this. Static routing sends a request class to a fixed model: cheap model for classification, mid-tier for summarization, frontier for open-ended reasoning. Dynamic routing decides per request. The RouteLLM framework from LMSYS trains a router on preference data and reports reaching 95% of GPT-4-class quality while cutting cost by over 85% on MT-Bench by sending only the hard queries to the strong model (LMSYS, 2024). Managed routers (OpenRouter, NotDiamond, Martian) and platform features like Amazon Bedrock and Azure intelligent prompt routing offer the same idea without training your own classifier.
Cascading takes routing further. The FrugalGPT paper (Chen, Zaharia, Zou, Stanford) describes an LLM cascade that queries a cheap model first, scores the answer's reliability, and escalates to a stronger model only when confidence is low. The paper reports matching the best individual LLM's accuracy with up to 98% cost reduction. The trade-off is real: cascades add a scoring step and tail latency on escalation, and a badly tuned confidence threshold either leaks errors or escalates too often. Start with static routing, measure the quality gap, and add cascading only where the savings justify the complexity.
Keep these references provider-neutral. Across the major providers - Anthropic's Claude Opus 4.x, Sonnet 4.x, and Haiku 4.5 tiers, OpenAI's flagship and mini tiers, and Gemini's Pro and Flash tiers - the pattern is identical: a cheap small model and an expensive large one, with a 5x to 20x price gap between them.
Cache what repeats
Caching attacks the input-token term directly, and it comes in three flavors that stack.
Provider prompt caching lets you mark a stable prefix - system prompt, tool definitions, few-shot examples, a long document - so the provider stores its computed state and bills reused tokens at a steep discount. Per Anthropic's prompt caching docs, a cache read costs 0.1x the base input price (a 90% discount), while the initial cache write costs 1.25x for a 5-minute TTL or 2x for a 1-hour TTL. The write premium means caching pays off after a single read on the short TTL. OpenAI applies prompt caching automatically for long shared prefixes, and Gemini offers explicit context caching. The engineering rule is structural: put everything stable at the front of the prompt and everything variable at the end, so the cacheable prefix is as long as possible.
Semantic response caching eliminates whole calls. Instead of matching prompts byte-for-byte, it embeds the query and returns a stored response when a prior query is close enough in vector space. The open-source GPTCache library does this with a Redis or PostgreSQL backend. The associated GPT Semantic Cache research reports reducing API calls by up to 68.8% with positive-hit accuracy above 97% on FAQ-style traffic. The risk is false positives - two questions that look similar but aren't - so tune the similarity threshold conservatively (many teams start around 0.75 to 0.85 cosine similarity) and exclude anything where a stale or near-miss answer is unacceptable.
KV caching applies when you self-host. The key-value attention cache lets an inference server reuse computation across turns and requests; engines like vLLM and TGI manage it automatically, and it is the main reason a well-configured self-hosted endpoint sustains high throughput.
Compress the tokens
Every token you don't send is a token you don't pay for. Compression starts with the system prompt: trim it, drop redundant instructions, and avoid restating the same rules in multiple ways. The larger win is usually retrieval. RAG pipelines that "stuff" context inflate input cost on every call and can degrade quality when the relevant passage drowns in noise. Retrieve precisely instead of generously - better chunking, reranking, and a tighter top-k beat a bigger context window on both cost and accuracy. Our RAG architecture guide covers the retrieval design that keeps context lean.
Output tokens are typically priced higher than input tokens, so capping them matters. Set max_tokens to a realistic ceiling, use structured output (JSON schema or function calling) so the model returns data instead of prose, and add stop sequences to cut off rambling. "Answer in one sentence" in the prompt is a cost control, not just a style note. The shift from hand-tuned prompts toward systematically engineered context - covered in from prompt engineering to context engineering - is partly a cost story: a well-engineered context hits the quality target with fewer tokens.
Batch the work that can wait
Not every request needs an answer in two seconds. For workloads that tolerate latency - nightly document classification, bulk summarization, embedding backfills, evaluation runs - the asynchronous batch APIs cut the bill in half. Both OpenAI and Anthropic process batched requests asynchronously within a 24-hour window at roughly 50% of standard token prices (VentureBeat, 2024), and the discount applies to input and output tokens alike.
Migration is usually a low-friction lift - you submit requests in bulk (JSONL) and poll for results rather than calling synchronously. The discounts compose: a batched call against a cached prompt stacks the 50% batch discount on top of the cache read discount. The only constraint is that batch is wrong for anything user-facing and interactive. Reserve it for the pipeline work that runs on a schedule, and you get half off a meaningful chunk of total spend for almost no engineering cost.
Architecture and governance decisions
Some cost decisions sit above any single request. Three matter most.
Fine-tune a small model versus prompt a large one. When a task is narrow and high-volume, fine-tuning a small model to do it well can beat paying frontier per-token rates plus long few-shot prompts on every call. The break-even depends on volume and the cost of building and maintaining the tuned model; our guide on fine-tuning LLMs when RAG isn't enough walks the decision in depth.
Self-host versus API. Below a volume threshold, API pricing wins because you pay only for what you use. Above it, a self-hosted endpoint with quantization can be cheaper per token, at the cost of running infrastructure. Quantization is the lever here: INT8 roughly halves memory with a small quality drop, and INT4 cuts memory by about 75% using methods like AWQ or GPTQ (Latitude, 2025), which lets you serve a given model on cheaper hardware or batch more requests per GPU. The trade-off is operational ownership - GPUs, autoscaling, and reliability all become your problem. Our practical guide to running LLMs locally covers the setup.
Governance. Optimization decays without guardrails. Set per-key and per-team budgets, rate limits, and alerts on cost spikes so a runaway agent loop trips an alarm instead of an invoice. Attribute spend back to teams (chargeback) so the people writing the prompts see the bill. This is the FinOps operating model applied to tokens: continuous measurement, accountability, and optimization rather than a quarterly cleanup.
The decision framework
Use this to sequence the levers by effort and payoff. Start at the top - the cheap, high-return moves - and descend only as far as your volume justifies.
| Technique | Effort | Typical savings | Trade-off / risk | When to use |
|---|---|---|---|---|
| Provider prompt caching | Low | Up to ~90% on cached input (Anthropic) | Write premium on first call; needs stable prefix | Repeated system prompts, long shared context |
| Output token control | Low | Varies; targets the higher-priced token | Truncation if cap is too tight | Always |
| Model right-sizing | Low-Med | 5-20x on routed traffic | Quality regression if mis-sized | Classification, extraction, routing |
| Batch API | Low | ~50% (VentureBeat) | Up to 24h latency | Offline / scheduled jobs |
| Semantic caching | Medium | Up to ~68% fewer calls (arXiv) | False-positive cache hits | FAQ, support, repetitive queries |
| Dynamic routing / cascade | Med-High | Up to 85-98% (RouteLLM, FrugalGPT) | Router/scoring complexity, tail latency | High volume, mixed difficulty |
| Self-host + quantization | High | Per-token cheaper at scale (Latitude) | You own the infra and reliability | Very high, steady volume |
Key takeaways
- Cost is tokens times price. Every lever reduces token volume, the per-token price, or both.
- Instrument first. Measure cost per feature and per resolved task before changing anything, then treat tokens as a FinOps discipline.
- Cache, route, compress - in that order of effort-to-payoff. Prompt caching and right-sizing are the cheap wins; routing and self-hosting are for proven volume.
- Batch the work that can wait for a flat ~50% discount, and stack it with prompt caching where both apply.
- Don't invent quality regressions. Set a quality threshold per task and verify each optimization holds it - a cheaper wrong answer costs more than an expensive right one.
- Govern continuously. Budgets, rate limits, and spike alerts keep a runaway loop from becoming a runaway bill.
FAQ
How do I reduce LLM costs in production? Instrument cost per feature, then apply the three levers in order of payoff: enable provider prompt caching, route most traffic to smaller models, and compress prompts and outputs. Layer batching for offline work. Combined, these commonly cut spend by 70% or more.
What's the cheapest way to run an LLM in production? It depends on volume. At low to moderate volume, a small API model with prompt caching and tight output limits is cheapest because you pay only per use. At very high, steady volume, a self-hosted quantized model can win per token, but you take on the infrastructure and reliability burden.
How do I track LLM cost per feature? Log input tokens, output tokens, and model for every call, tag each call with a feature identifier, and aggregate in an observability tool (Langfuse, LangSmith, or Opik) or your own metrics pipeline. Per-feature attribution is what lets you find and fix the few endpoints driving most of the spend.
Cutting LLM spend is rarely one big change; it's a sequence of measured ones, each verified against a quality bar. BigData Boutique helps engineering teams instrument token spend, right-size models, and build the routing and caching layers that keep production AI affordable. If your LLM bill is growing faster than your usage, get in touch.