A practical engineering guide to AWS Bedrock pricing - how token billing works, when to use on-demand vs provisioned throughput vs batch, hidden costs beyond inference, and concrete strategies to keep your bill under control.
AWS Bedrock gives you API access to foundation models from Anthropic, Meta, Mistral, Cohere, and Amazon's own Nova family - without managing infrastructure. But its pricing model is not a single flat rate. You're paying across multiple dimensions: input tokens, output tokens, model choice, pricing tier, and a growing list of add-on features like Knowledge Bases, Guardrails, and Agents. Without understanding these layers, costs can escalate quickly once you move past prototyping.
This guide breaks down how Bedrock pricing actually works, when each pricing tier makes sense, what hidden costs to watch for, and practical strategies to keep spend under control in production.
How Token-Based Billing Works
Bedrock charges per token for most text and embedding models, with separate rates for input and output. A token roughly maps to 3/4 of a word in English - a 1,000-word prompt is approximately 1,333 tokens. Input tokens (your prompt, system instructions, and any context like RAG results) and output tokens (the model's response) are billed at different rates, with output tokens typically costing 2-5x more than input tokens.
The price spread across models is dramatic. Amazon Nova Micro runs at $0.000035 per 1,000 input tokens and $0.00014 per 1,000 output tokens. Anthropic Claude 3.5 Sonnet charges $6.00 per million input tokens and $30.00 per million output tokens - roughly 170x more on the input side and 214x more on output. This isn't just a premium for a "better" model; it reflects the computational cost of larger, more capable architectures. Choosing the right model for each task is the single biggest lever you have on cost.
For image generation, pricing shifts to per-image billing. Amazon Nova Canvas charges $0.04-$0.08 per image depending on resolution and quality settings, while Stability AI models bill at hourly rates for provisioned throughput.
A Quick Cost Estimate
Consider a customer support application processing 100,000 queries per day, each with 300 input tokens and 200 output tokens. Using Amazon Nova Lite on-demand, the math works out to roughly $6.60/day or about $200/month. Switch the same workload to Claude 3.5 Sonnet and you're looking at orders of magnitude more. The model choice alone can be the difference between a $200/month service and a $20,000/month one.
Choosing the Right Pricing Tier
Bedrock offers three main pricing tiers, each suited to different workload patterns.
On-demand is the default: you pay per token with no commitments. It's ideal for development, experimentation, and workloads with unpredictable traffic. The downside is that you're subject to throttling limits during high demand, and per-token costs are at their highest.
Batch inference lets you submit large sets of prompts as a single job and get results asynchronously. The trade-off is latency for cost - batch pricing runs at roughly 50% of on-demand rates across supported models. If you have workloads that don't need real-time responses (nightly summarization jobs, bulk classification, document processing pipelines), batch is the obvious choice. You submit prompts to S3, kick off the job, and collect results when it's done.
Provisioned throughput reserves dedicated capacity at fixed hourly rates with optional 1-month or 6-month commitments. For Meta Llama 3.3 Instruct (70B), the no-commitment rate is $24.00/hour ($17,280/month). A 1-month commitment brings that down to $21.18/hour, and a 6-month commitment to $13.08/hour - a 45% discount. This makes sense when you have consistent, high-volume traffic and need guaranteed latency without throttling. But the economics only work if your utilization is high enough; paying $17,280/month for capacity you use 20% of the time is worse than on-demand.
The decision framework is straightforward: start with on-demand for development and variable workloads, move batch-eligible work to batch for the 50% savings, and only commit to provisioned throughput when you have sustained traffic that justifies the fixed hourly cost. One important caveat: fine-tuned custom models require provisioned throughput for inference - there's no on-demand option for them.
Hidden Costs Beyond Inference
Token pricing gets most of the attention, but several Bedrock features add their own billing dimensions that can quietly inflate your bill.
Knowledge Bases are commonly used for RAG (retrieval-augmented generation) pipelines. The retrieval queries themselves are billed - SQL-based Knowledge Bases cost around $2.00 per 1,000 queries. But the bigger cost surprise is the underlying vector store. If you're using Amazon OpenSearch Serverless as the backend, expect a minimum of $600-$700/month just for the OpenSearch infrastructure, regardless of query volume. This is a significant fixed cost that's easy to overlook during prototyping.
Agents don't have their own per-call fee, but a single user query to an Agent can trigger 5-10 internal model calls as it reasons through tool selection, parameter extraction, and response synthesis. Your effective per-query cost is a multiple of the base model rate, and it compounds quickly in multi-step agentic workflows.
Guardrails add $0.15 per 1,000 text units per filter type. If you're running content filtering, PII detection, and topic blocking on every request, those charges add up - especially on high-throughput applications.
Model customization brings training costs (per-token), mandatory monthly storage fees ($1.95/month per custom model), and the provisioned throughput requirement for inference mentioned earlier. A fine-tuned model that sits idle still costs you storage every month.
Data Automation features for processing documents, images, audio, and video carry their own rates: $0.010 per page for documents, $0.006 per minute for audio, $0.050 per minute for video. Flows (Bedrock's workflow orchestration) bills at $0.035 per 1,000 node transitions - complex multi-step workflows with loops can rack up transitions quickly.
Practical Cost Optimization
Beyond choosing the right pricing tier, several techniques can meaningfully reduce Bedrock costs in production.
Prompt Caching
If your requests share common prefixes - system prompts, few-shot examples, or shared context documents - prompt caching can deliver up to 90% cost reduction on cached tokens with up to 85% latency improvement. Cached tokens on Claude 3.5 Sonnet drop from $6.00 to $0.60 per million tokens for cache reads (with a $7.50/million write cost on first caching). The cache lives for 5 minutes, so this works best for applications with sustained request patterns and consistent system prompts. Structure your prompts to maximize the shared prefix across requests.
Right-Sizing Model Selection
Not every task needs the most capable model. A classification or routing task that Claude 3.5 Sonnet handles at $6.00/$30.00 per million tokens might work just as well with Amazon Nova Micro at $0.035/$0.14 per million tokens - a 170x cost reduction. Consider implementing a tiered model strategy: use a small, cheap model for simple tasks (classification, extraction, routing) and reserve larger models for complex reasoning or generation. This pattern is sometimes called "model routing" - the small model triages and only escalates to the expensive model when needed.
Batch Everything You Can
Any workload that tolerates latency should go through batch inference. Nightly report generation, bulk document summarization, dataset annotation, content moderation backlogs - all of these benefit from the 50% batch discount with no quality trade-off. Combine prompts into single job submissions and store results in S3 for downstream processing.
Monitoring and Alerting
AWS Cost Explorer supports filtering by usage type, API operation, region, and tags - use it. Tag your Bedrock usage by team, project, and environment. Set up AWS Budgets alerts before you need them, not after the first surprising bill. CloudWatch can track token consumption patterns and trigger alarms when usage exceeds thresholds. The AWS Billing Console shows token counts in the "Usage Quantity" column for line-item analysis.
One limitation to be aware of: the AWS Pricing Calculator only supports estimates for first-party Amazon Titan and Nova models. For third-party models like Claude or Llama, you'll need to calculate costs manually from the Bedrock pricing page.
Key Takeaways
- Model selection is your biggest cost lever. The price difference between Amazon Nova Micro and Claude 3.5 Sonnet is over 100x. Match model capability to task complexity.
- Use batch inference for anything not real-time. The 50% discount is free money if your workload tolerates async processing.
- Provisioned throughput only makes sense at sustained high utilization. Do the math against on-demand before committing to hourly rates.
- Budget for hidden costs. Knowledge Bases (especially OpenSearch Serverless), Guardrails per-unit charges, Agent multi-call amplification, and custom model storage fees all add up.
- Enable prompt caching early. Structure prompts with shared prefixes and take advantage of up to 90% savings on cached tokens.
- Tag and monitor from day one. AWS Cost Explorer, Budgets, and CloudWatch are essential to prevent billing surprises as usage scales.
Bedrock pricing rewards engineers who understand its structure and plan around it. The managed service convenience is real, but so is the complexity of its billing model. Start with on-demand, measure actual usage patterns, and optimize from there.