A practical, vendor-neutral guide to Amazon Bedrock for engineers - what it is, how its building blocks (foundation models, Knowledge Bases, Agents, AgentCore, Guardrails) fit together, and how to decide when Bedrock is the right choice versus calling provider APIs directly or self-hosting.
Amazon Bedrock has grown from a thin proxy in front of a few foundation models into a sprawling AI platform with its own RAG service, agent runtime, guardrail layer, evaluations, and now a separate microVM-based agent platform called AgentCore. That growth is easy to mistake for sprawl. It isn't. Each piece exists to remove a different class of plumbing that teams kept rebuilding by hand in 2023 and 2024 - vector stores, action routing, content filters, observability, identity for agents calling third-party APIs.
This guide walks through the platform from an engineer's perspective: what Bedrock actually is, which of its building blocks are worth adopting, where rolling your own beats the managed path, and how the pieces compose into reference architectures. It is intentionally vendor-neutral. We work with teams that ship on Bedrock and teams that call Anthropic or OpenAI directly, and the right answer is almost always determined by the rest of the stack rather than the model.
What Amazon Bedrock Is
Amazon Bedrock is a fully managed AWS service that exposes foundation models from multiple providers behind a single, IAM-authenticated API, alongside managed building blocks for retrieval-augmented generation, agents, content safety, evaluation, and fine-tuning. Calls stay inside your AWS Region, no data is used to train provider models, and billing flows through your AWS account.
That definition matters because Bedrock is frequently confused with adjacent things. It is not a chatbot UI - the AWS Bedrock console playground exists, but production traffic goes through APIs. It is not SageMaker - Bedrock does not give you a notebook or a training cluster, and the customization options are deliberately narrow. It is not a model gateway like LiteLLM or OpenRouter - those work across clouds and abstract billing, Bedrock works only on AWS and is the billing surface itself.
The point of the abstraction is to let an application call Claude Opus 4.7, Llama 4 Maverick, Mistral Large 3, Amazon Nova 2 Lite, or DeepSeek-V3.2 through the same request shape, with the same IAM role, KMS keys, VPC endpoints, CloudWatch metrics, and CloudTrail events as the rest of your AWS workloads. That uniformity is the whole product. As of May 2026, Bedrock exposes models from 18 providers across more than 100 variants, including Anthropic, Meta, Mistral AI, Cohere, AI21, Amazon (Titan and Nova), DeepSeek, OpenAI's open-weight gpt-oss-120b and gpt-oss-20b, Qwen3, Stability AI, TwelveLabs (video), Writer, Luma AI, and NVIDIA Nemotron, plus 100+ specialized models in Bedrock Marketplace.
One frequent misconception about Bedrock deserves a direct rebuttal: provider model owners do not receive your prompts or completions. When you call Claude on Bedrock, Anthropic does not see the traffic, does not log it, and does not use it for training. The model runs in AWS infrastructure under AWS's terms. This is the legal posture that makes Bedrock the default choice for HIPAA, PCI, and FedRAMP workloads where sending data to a third-party SaaS would require a separate review the security team will not enjoy.
When Bedrock Wins vs Direct APIs vs Self-Hosted
The "should we use Bedrock" question rarely has a clean technical answer. It is mostly a procurement, compliance, and stack-cohesion question. The table below is the shortcut we use in client engagements.
| Concern | Bedrock | Direct provider API | Self-hosted (EKS, SageMaker, bare GPU) |
|---|---|---|---|
| Stack already in AWS | Best fit, single IAM/VPC/KMS surface | Adds another vendor contract | Best fit but you operate the runtime |
| Time to first call | Minutes via IAM | Minutes via API key | Days to weeks |
| Latest model availability | Lag of days to weeks behind providers | Day-zero access | Whenever weights are released |
| Multi-model A/B | Same API across providers | Per-provider SDKs | Per-model serving stacks |
| Compliance (HIPAA, FedRAMP High, PCI) | Inherited from AWS, BAA-eligible | Per-provider, separate review | You own the compliance posture |
| Data residency guarantees | Stays in-Region; no provider training | Depends on provider terms | Wherever you deploy |
| Procurement | One AWS bill | New vendor onboarding | One AWS bill, more headcount |
| Cost at scale | On-demand premium, batch and PT discounts | Sometimes cheaper for high-volume single-model | Cheapest at saturation; expensive idle |
| Custom fine-tunes | Supported but narrow | Provider-dependent | Full control |
| Lowest latency | Region-bound | Provider's lowest-latency endpoint | Co-located with your workload |
A few patterns hold across most engagements. If the workload is already in AWS and includes anything regulated, Bedrock wins by default because the second vendor contract and the second data-flow review are not worth fighting for. If the team needs Claude or GPT the week it ships, going direct is faster - Bedrock typically lags by days to weeks on the freshest models. If utilization is genuinely high and steady for a single model, self-hosting an open-weight model on GPUs can come in 5-10x cheaper per token than any managed option, but the operations cost is real.
Multi-model strategies are the most underrated reason teams adopt Bedrock. Switching between Claude Sonnet, Nova Lite, and Llama 4 to find the cheapest model that meets quality - or to route per-request based on complexity - is trivial when they share an API and an IAM role. Doing the same across three SDKs and three billing surfaces is a project.
Foundation Models and the Inference APIs
Bedrock offers four runtime API families. Picking the right one is mostly about whether you need streaming, tool use, long-running jobs, or bulk throughput.
| API | Use it for | Notes |
|---|---|---|
InvokeModel / InvokeModelWithResponseStream |
Single-shot, model-specific JSON payloads | Stateless, raw provider schema, no document inputs |
Converse / ConverseStream |
Default for chat, multi-turn, tool use, document inputs | Unified message format across models, AWS-recommended |
StartAsyncInvoke / GetAsyncInvoke |
Long jobs, primarily video gen (Nova Reel) | Results delivered to S3 |
CreateModelInvocationJob (batch) |
Large offline workloads | ~50% discount vs on-demand; Converse format supported since Feb 2026 |
The Converse API should be your default. It abstracts the per-model payload differences (Claude's messages shape, Llama's prompt template, Mistral's instruct format) into one schema, and it natively supports tool use, system prompts, multi-turn history, and document inputs (PDF, DOCX, CSV, and more) without bespoke parsing.
import boto3
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
response = bedrock.converse(
modelId="anthropic.claude-opus-4-7-20260301-v1:0",
messages=[{"role": "user", "content": [{"text": "Summarize the attached doc."}]}],
system=[{"text": "You are a concise technical writer."}],
inferenceConfig={"maxTokens": 1024, "temperature": 0.2},
)
print(response["output"]["message"]["content"][0]["text"])
Two practical pitfalls catch teams here. First, switching from InvokeModel to Converse mid-project requires rewriting any code that depends on provider-specific response fields (Claude's stop_reason codes, Llama's logprobs). Plan that work; do not assume it is a drop-in. Second, batch inference is the most underused cost lever on the platform. Nightly classification, document summarization, evaluation runs, and any workload that tolerates hours of latency belong in batch. The savings are real and the operational complexity is small (drop JSONL into S3, start a job, read JSONL back).
For model selection, the safe pattern is to default to a cheap, fast model and only escalate when an evaluation set says you must. Nova Micro and Haiku 4.5 are good defaults for classification and extraction; Sonnet and Mistral Large 3 handle most reasoning; Opus 4.7 and Llama 4 Maverick (both with 1M-token context) are for the genuinely hard or long-context cases. Pay attention to model lifecycle dates - Bedrock publishes legacy and EOL schedules, and pinning to a specific model version (the :0 suffix) is the difference between a stable production app and one that silently changes behavior.
Tool use through Converse is straightforward and worth understanding before reaching for an agent framework. You declare a tool spec, the model returns a toolUse block when it wants to call one, you execute the tool, and you append the toolResult to the next turn. Many "we need an agent" requirements collapse into a tool-use loop of two or three iterations - cheaper to operate, easier to reason about, and trivial to evaluate.
tools = [{
"toolSpec": {
"name": "lookup_order",
"description": "Look up an order by ID.",
"inputSchema": {"json": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"],
}},
}
}]
response = bedrock.converse(
modelId="anthropic.claude-sonnet-4-6-20260101-v1:0",
messages=conversation,
toolConfig={"tools": tools},
)
Cross-Region inference profiles deserve a separate mention. Bedrock now ships inference profile IDs (the us., eu., and apac. prefixes) that route traffic across Regions transparently to even out capacity and reduce throttling. For latency-sensitive workloads in a single Region, stick with the Region-specific model ID. For everything else, the cross-Region profile usually wins on availability without changing your data residency posture (the profile is bounded by geographic zone).
Knowledge Bases for RAG
Knowledge Bases for Amazon Bedrock is the managed RAG service. You point it at a data source, choose a chunking strategy and an embedding model, pick a vector store, and you get two APIs back: Retrieve for search-only and RetrieveAndGenerate for grounded generation with source citations. It removes most of the boilerplate around ingestion, chunking, embedding, and retrieval orchestration.
A Knowledge Base in Bedrock is the managed pipeline that takes documents from a connected source, splits them into chunks, embeds them with an embedding model (Titan Embeddings v2, Cohere Embed, or others), writes the vectors and metadata to a vector store, and serves retrieval plus grounded generation through a single API. It handles re-ingestion when source documents change.
Connectors supported today: S3, Web Crawler, Confluence, SharePoint, Salesforce, and a Custom direct-ingestion path for anything you can push through an SDK. Vector store backends include OpenSearch Serverless, OpenSearch Service managed cluster, S3 Vectors, Aurora PostgreSQL (pgvector), Pinecone, Redis Enterprise Cloud, MongoDB Atlas, and Neptune Analytics for GraphRAG. The S3 Vectors and OpenSearch managed cluster options were added in 2025 and have changed the cost story significantly - OpenSearch Serverless used to be a $600+/month floor for the smallest KB; S3 Vectors moves the floor near zero for read-heavy, low-update workloads.
Chunking is the part teams misconfigure most often. The choices:
| Strategy | When to use | Trade-off |
|---|---|---|
| Default (~300 tokens, sentence-aware) | First pass for general prose | Good baseline; not optimal for any specific corpus |
| Fixed-size | Predictable, uniform documents | Token control is easy; can split mid-thought |
| Hierarchical (parent/child) | Long technical docs, code, manuals | Better recall at higher index cost |
| Semantic | Mixed corpora where topic boundaries matter | Slower ingestion, often better quality |
| Custom (Lambda) | LangChain/LlamaIndex parity, domain-specific splitters | You own the code |
| No chunking | Pre-chunked input | Honors your existing chunk boundaries |
When should you skip Knowledge Bases and build your own RAG stack on OpenSearch or a vector DB? Three signals: (1) you need hybrid search (BM25 + dense + reranking) with custom weights and rerankers, (2) you need fine-grained tenant isolation, query-time filters, or unusual aggregations, or (3) the retrieval surface is itself a product feature you want to iterate on weekly. Otherwise, take the managed path. We wrote a longer treatment of RAG pipeline architecture and a practical guide to RAG on OpenSearch for the cases where the managed path is not the right call.
One pitfall worth flagging: a Knowledge Base is only as fresh as its ingestion schedule. If your source data changes by the minute, schedule incremental syncs and design around eventual consistency. For change-data-capture-driven RAG, the custom ingestion API plus an event-driven pipeline (S3 + EventBridge + Lambda) is usually the right shape.
The vector store choice is the biggest cost and operational lever inside Knowledge Bases. A rough buyer's guide:
| Backend | Best for | Watch out for |
|---|---|---|
| OpenSearch Serverless | Default if you want zero ops and small-to-medium corpora | ~$600+/month floor for the smallest setup |
| OpenSearch Service managed cluster | Large corpora, hybrid search, multi-tenant isolation | You operate the cluster; pairs well with Pulse for OpenSearch |
| S3 Vectors | Read-heavy, infrequently updated, cost-sensitive | Newer service; check feature parity for your query patterns |
| Aurora PostgreSQL (pgvector) | Already on Aurora, want vectors next to relational data | pgvector tuning is non-trivial at scale |
| Pinecone, Redis Enterprise Cloud, MongoDB Atlas | Already standardized on one of them | External billing; less tight IAM story |
| Neptune Analytics (GraphRAG) | Entity-heavy domains where relationships matter | Adds graph modeling work |
Bedrock also integrates with Amazon Kendra as a managed retrieval index, which sidesteps the chunking-and-embedding question entirely. Kendra runs full-text and semantic retrieval as a service, with built-in connectors and access controls. The trade-off is cost and flexibility: Kendra is more expensive than the DIY paths and less tunable than OpenSearch, but for enterprise document search with row-level ACLs from SharePoint or Confluence, it is often the fastest path to a defensible answer.
Bedrock Agents and AgentCore
The agent story on Bedrock now has two products, and the distinction matters. Classic Bedrock Agents is the original config-driven service: you define Action Groups (OpenAPI or Lambda function schemas), attach a Knowledge Base, write instructions, and Bedrock handles multi-step reasoning, tool invocation, and response synthesis. It is opinionated, low-code, and tightly bound to Bedrock-hosted models.
Amazon Bedrock AgentCore, announced at re:Invent 2024 and expanded at re:Invent 2025, is a separate service (bedrock-agentcore) aimed at production agent workloads. It is framework-agnostic - it runs agents written in Strands, LangGraph, CrewAI, LlamaIndex, or your own code - and model-agnostic, including non-Bedrock models. Its components:
- Runtime: Each session gets a dedicated microVM with up to 8 hours of execution and 100MB payloads. Strong isolation per user session.
- Memory: Managed short-term and long-term memory, with episodic learning across sessions.
- Identity: OAuth/IAM for users and third-party services so agents can call Salesforce or GitHub on a user's behalf.
- Gateway: Turns Lambdas and APIs into MCP-compatible tools.
- Browser: Managed headless browser for agents that need to interact with web UIs.
- Code Interpreter: Sandboxed code execution.
- Observability: OpenTelemetry traces of agent steps, integrated with CloudWatch.
- Policy: Deterministic guardrails added at re:Invent 2025.
- Evaluations: 13 pre-built evals for agent quality.
| Classic Bedrock Agents | AgentCore | |
|---|---|---|
| Framework | Bedrock-native, declarative | Bring your own (Strands, LangGraph, etc.) |
| Model | Bedrock-hosted only | Any model, incl. non-AWS |
| Tool protocol | Action Groups (OpenAPI/Lambda) | MCP via Gateway |
| Session isolation | Logical | microVM per session |
| Long sessions | Limited | Up to 8 hours |
| Best for | Internal apps, quick prototypes, KB-heavy workflows | Production agents, multi-framework teams, regulated workloads |
Use classic Agents when the workflow is short, mostly KB-driven, and you want a low-code path. Use AgentCore when you are running an agent framework you already invested in, when sessions are long, or when you need strong per-session isolation for multi-tenant SaaS. Most production agent work we see now starts on AgentCore. Our agents primer goes deeper into the design questions independent of platform choice.
A subtle point about AgentCore Gateway: it implements Model Context Protocol as the tool surface, which means an agent built on AgentCore can call the same tools as Claude Desktop, Cursor, or any other MCP client. That portability is real - we have seen teams ship the same toolset to an internal AgentCore agent and an employee-facing IDE assistant without rewriting the integrations. AgentCore Identity handles the OAuth dance for tools that need a user's third-party credentials (Salesforce, GitHub, Gmail), which used to be the most painful piece of agent plumbing to build correctly.
The cost shape of agents is also where teams get surprised most often. A single user query into a multi-step agent typically triggers 5-15 model calls as the agent reasons, picks tools, parses results, and synthesizes a response. Effective per-query cost is the base model rate multiplied by step count. Three knobs make this manageable: (1) route the planning step to a cheap fast model and reserve the expensive one for synthesis, (2) cap the max steps and exit gracefully, and (3) cache stable system prompts and tool specs aggressively - prompt caching can drop input token cost by up to 90% for the repeated prefix.
Guardrails, Custom Models, and Provisioned Throughput
Three smaller pieces, each pulling its weight.
Guardrails for Amazon Bedrock layer policy on top of any model invocation: content filters (hate, insults, sexual, violence, misconduct, prompt attack - text and image), denied topics with natural-language definitions, word filters, sensitive information filters (a set of pre-built PII types plus custom regex with block-or-mask actions), and contextual grounding checks that score model output for relevance and faithfulness to a source document. The ApplyGuardrail API decouples guardrails from inference, so you can run the same policies on self-hosted models or third-party APIs. After AWS cut Guardrails pricing 85% in December 2024, content filters and denied topics each run at $0.15 per 1,000 text units, which makes per-request gating practical for most workloads. We have a deeper Guardrails implementation guide that covers the failure modes.
Custom Models has three flavors. Supervised fine-tuning works on a narrow set of base models (Titan, Nova, some Llama). Continued pre-training is offered on Titan and Nova for domain adaptation. The most interesting addition is Custom Model Import, which is GA and now supports Llama 2/3/3.1/3.2, Mistral/Mixtral, Flan-T5, Qwen2/2.5/3 (including VL and MoE variants), GPT-OSS (20B and 120B), and DeepSeek-R1-Distill-Llama (8B/70B). Imported models are billed per Custom Model Unit, and they integrate with the rest of Bedrock - Knowledge Bases, Guardrails, Agents - so an imported open-weight model gets the same managed surface as a first-party one.
Provisioned Throughput reserves dedicated capacity, billed per Model Unit per hour. Commitment terms are no-commit (limited to some base models), 1-month, and 6-month, with deeper discounts for longer commitments. You cannot delete a PT during its commitment, which has caught more than one team off guard. PT is required for any fine-tuned or imported model and recommended for latency-critical, high-volume workloads where throttling on on-demand limits is unacceptable. For the cost math, see our Bedrock pricing guide.
Security, Observability, and Reference Architectures
For regulated industries, Bedrock's data protection model is the headline reason to choose it. Prompts and completions are not used to train any model. Data stays in the Region you call. Per-Region interface VPC endpoints (com.amazonaws.<region>.bedrock-runtime, bedrock-agent-runtime, plus AgentCore endpoints) keep traffic on the AWS backbone with no internet path. KMS customer-managed keys cover encryption at rest. The service holds SOC 1/2/3, ISO 27001/27017/27018, HIPAA-eligible, PCI DSS, and FedRAMP High in GovCloud.
Model invocation logging captures full prompt and response payloads to CloudWatch Logs or S3 at the per-Region level - turn it on before going to production and design for the storage cost from day one. Bedrock Evaluations supports three modes: automatic (built-in metrics for summarization, QA, classification), human review, and LLM-as-a-judge where one model scores another's output with explanations. RAG-specific evaluations work directly against Knowledge Bases. AgentCore Observability adds OpenTelemetry traces of agent reasoning steps, which is the only practical way to debug a misbehaving multi-step agent in production.
Three reference architectures show up repeatedly in client work:
- RAG chatbot: API Gateway → Lambda →
RetrieveAndGenerateagainst a Knowledge Base → Guardrails on input and output → response. Logs to CloudWatch; evaluation jobs run nightly against a held-out set. - Multi-step agent: AgentCore Runtime hosts a LangGraph agent → AgentCore Gateway exposes internal Lambdas as MCP tools → AgentCore Identity handles OAuth for third-party APIs → AgentCore Memory persists user context → AgentCore Observability traces every step.
- Document understanding pipeline: S3 upload → EventBridge → Step Functions → Bedrock Data Automation for OCR/extraction → Converse API for structured extraction with tool use → results to DynamoDB and Athena. Batch inference for the bulk historical backfill.
The thread across all three: Bedrock is the model surface, but the architecture around it is mostly AWS primitives you already know. That is the point.
A note on evaluation discipline. Almost every production failure we have debugged on a Bedrock workload traces back to a missing or weak evaluation set. Without one, you cannot tell whether switching from Claude Sonnet to Nova Lite costs you 2% quality or 20%. Build the eval harness before you ship the feature. Bedrock Evaluations gives you the plumbing - the dataset, the rubric, and the LLM-as-a-judge prompts are your job, and they are where the actual quality lives. For RAG specifically, evaluate retrieval and generation separately: retrieval quality (was the right chunk in the top-k?) and generation quality (did the model use the chunk correctly?) fail for different reasons and demand different fixes.
The cost side also rewards a small amount of upfront discipline. Bedrock has seven distinct billing dimensions once you turn on the managed services - input tokens, output tokens, cached tokens, Knowledge Base queries, vector store infrastructure, Guardrails text units, and Flows node transitions - plus the implicit cost of agent step amplification. The pricing post linked above has the line-item math. Two patterns are worth internalizing from day one: tag every Bedrock call with cost-allocation tags so Cost Explorer can break it down by team and feature, and set CloudWatch alarms on token-count metrics before the first surprise bill, not after.
Picking the Right Pattern
A working decision framework for new projects:
- Pick the model second, not first. Start with the workload constraints (latency, context length, regulatory). The model shortlist falls out of those.
- Default to Converse API. Move to
InvokeModelonly if you need a provider-specific feature Converse does not expose. - Default to Knowledge Bases for RAG. Roll your own on OpenSearch only when hybrid search, custom rerankers, or unusual filters are central to the product.
- Default to AgentCore for new agent work. Use classic Agents for short, KB-heavy internal workflows where the low-code path is worth more than framework flexibility.
- Add Guardrails on day one. The pricing makes it cheap, and retrofitting policy after an incident is painful.
- Turn on model invocation logging before the first production request. You cannot debug what you did not log.
- Move batchable work to batch. Half the cost, same quality.
- Pin model versions explicitly. The
:0suffix is not optional in production. - Hold off on Provisioned Throughput until utilization data justifies it. PT is expensive and non-cancellable.
- Evaluate continuously. Bedrock Evaluations and LLM-as-a-judge are not optional once the workload matters.
Key Takeaways
- Bedrock is the managed AWS surface for foundation models, not a chatbot or a training platform. The value is uniformity across providers and integration with the rest of AWS.
- Choose Bedrock when the stack is AWS-native, when compliance matters, or when multi-model A/B is on the roadmap. Choose direct provider APIs for day-zero model access. Self-host when utilization is high enough to amortize ops cost.
- The Converse API is the right default for inference. Batch inference is the most underused cost lever.
- Knowledge Bases handles managed RAG end-to-end with S3, OpenSearch, Aurora, Pinecone, and others as backends. Roll your own only when retrieval is itself a product feature.
- AgentCore is the production path for agents. Classic Bedrock Agents remains useful for low-code, KB-heavy workflows.
- Guardrails, model invocation logging, and evaluations are baseline production requirements, not nice-to-haves.
At BigData Boutique we hold the AWS AI Services Competency and we have shipped Bedrock workloads for regulated and consumer companies. If you are working through these choices and want a second pair of eyes, get in touch.