What is Context Engineering?

Context engineering is the discipline of designing the full set of information a large language model sees on any given turn -- not just the instruction text, but everything in the context window: the system prompt, retrieved documents, tool definitions, prior tool outputs, conversation history, memory, files, and structured state. It's the natural evolution of prompt engineering once applications grow beyond single-turn chat into agents, RAG systems, and long-running workflows.

The term gained traction in 2024-2025 as practitioners building production LLM systems realized that the prompt was rarely the bottleneck anymore. The hard problems were elsewhere: what to retrieve, what to keep in memory, when to compress, which tools to expose, how to summarize long histories, how to prevent context rot. Calling all of that "prompt engineering" stretched the term past usefulness. "Context engineering" names the actual job.

Prompt Engineering vs Context Engineering

Prompt engineering optimizes the instruction you give to a model. Context engineering optimizes everything the model sees alongside that instruction.

For a single-turn chat application, the two collapse into the same thing. For an agent that uses tools, retrieves documents, calls sub-agents, and runs for thirty turns, they diverge sharply. The agent's system prompt may be a tiny fraction of the actual context on turn 25. The rest is tool definitions, accumulated tool outputs, retrieved chunks, summarized history, and intermediate scratchpad reasoning. Get any of those wrong and the model's behavior degrades, regardless of how well the system prompt is written.

The shift in framing also changes what gets measured. Prompt engineering asks "is the instruction clear?" Context engineering asks "is the model seeing the right information, in the right order, at the right time, without being drowned in irrelevant tokens?"

What's in the Context

On any given LLM turn, the context window typically contains some combination of:

System prompt. Persistent instructions: identity, constraints, output format, safety rules. Usually cached on supporting providers.

Tool / function definitions. Schemas describing what tools the model can call -- name, description, parameters. For agents with many tools, these can dominate the context budget.

Retrieved context (RAG). Chunks pulled from a vector database, search engine, or knowledge base for the current query. Choosing how much to retrieve, how to rank, how to format, and what to drop is one of the central problems in context engineering.

Conversation history. Previous user messages, assistant responses, tool calls and results. Grows monotonically unless actively managed.

Memory. Persistent state across sessions -- user preferences, past decisions, derived facts. Short-term memory (within a session) and long-term memory (across sessions) are usually implemented differently.

Scratchpad / reasoning state. For multi-step agents, the model's own intermediate thinking, plans, observations, and tool-output digests.

Files and attachments. PDFs, code, images, structured data the user or upstream system has attached.

The current user query or trigger. The actual thing being asked or the event being responded to.

The context engineering job is to assemble this on every turn, dynamically, with the right tradeoffs between completeness, relevance, latency, and cost.

Why Context Engineering Got Hard

Context windows got bigger -- but attention got worse. Modern frontier models support 200K, 1M, even 2M tokens of context. In theory that solves everything. In practice, model attention degrades as context grows -- the well-known "lost in the middle" effect, and the broader phenomenon of context rot where signal drowns in noise. A 1M-token context window isn't a license to dump in 1M tokens. Effective utilization is far smaller.

Tools multiplied. Production agents now routinely have dozens of tools available -- internal APIs, search, databases, file systems, code execution, sub-agents. Each tool definition costs tokens. Each tool output costs tokens. Multi-step agents accumulate tool outputs across turns, and without active management, the context fills with stale call traces nobody needs anymore.

Retrieval became the bottleneck. RAG makes context engineering a retrieval problem: which chunks, ranked how, formatted how, with how much surrounding context, deduplicated against what? Modern RAG systems use hybrid search (lexical + vector), reranking, query expansion, and structured retrieval. Each design decision is a context engineering decision.

Memory became architectural. Stateful agents and long-running assistants need persistent memory that's queried, written to, summarized, and pruned. Memory isn't a single thing -- it's an indexed store, a summarization pipeline, a retrieval system, and a write policy.

Cost and latency became real. A 100K-token context costs roughly 100x what a 1K-token context costs and adds significant latency. Context engineering has direct economic impact: every token in every turn is paid for, multiplied by every user.

Core Techniques

Just-in-time retrieval. Don't pre-load the context with everything that might be relevant. Retrieve on demand based on what the agent is actually doing on the current turn. This is the standard pattern in agentic RAG -- the agent decides when, what, and how to retrieve.

Context compression and summarization. Periodically summarize older conversation turns or tool outputs into a compact representation. Replace the verbose history with the summary. Done well, this preserves the information that matters while reclaiming context budget.

Tool result pruning. After a tool call, extract the relevant fields and drop the rest. A 50-row JSON response usually doesn't need to stay in context as 50 rows of JSON.

Tool selection and routing. When an agent has many tools, don't expose all of them every turn. Use a router (sometimes another LLM call, sometimes embedding similarity, sometimes a rule engine) to surface only the tools relevant to the current step. This is often called "tool RAG" or "dynamic tool loading."

Structured state outside the context. Persist long-running state (user profile, task list, intermediate artifacts) in an external store -- a database, a file, a memory service -- and pull only the relevant slice into context when needed.

Prompt caching. When supported, cache the long static prefix (system prompt + tool definitions) so subsequent calls only pay for the variable part. Dramatic cost and latency reduction on long-context workloads.

Chunking and formatting. How you split documents, what metadata you attach, how you delimit retrieved content (XML tags, markdown, fenced blocks) all affect how reliably the model uses retrieved context.

Hybrid retrieval. Combine lexical and semantic search (sparse plus dense vectors). Add reranking. Filter by metadata. Generic top-k vector search is rarely the right production retrieval strategy.

Memory writes and reads. Decide what to remember, when to write it, how to retrieve it. Recency, frequency, and explicit user instruction are common signals. Avoid memorizing model errors as facts.

Evaluation and observability. You can't engineer what you don't measure. Tools like Langfuse and LangSmith log full traces -- system prompt, retrieved chunks, tool calls, outputs -- so you can see what was actually in context when something went wrong.

Context Engineering for RAG

RAG is mostly a context engineering problem. The model is constant; the prompt is largely constant; what changes per query is the retrieved context. Getting that retrieval right is what separates a useful RAG system from a confused one.

The dimensions to engineer:

Index design. Chunk size, chunk overlap, metadata, embedding model, hybrid lexical + vector
Query construction. Rewriting, expansion, decomposition into sub-queries
Retrieval strategy. Top-k, reranking, multi-stage retrieval, GraphRAG for multi-hop reasoning
Filtering and deduplication. Metadata filters, semantic deduplication, freshness boosts
Formatting. How retrieved chunks are presented to the model, with what delimiters, and with what source metadata
Citation and grounding checks. Post-generation verification that the answer is actually supported by the retrieved context

These choices interact. Changing chunk size shifts what retrieval returns, which shifts what the model attends to, which shifts answer quality. Treating any of them in isolation produces brittle systems. See our practical guide for how this plays out in production.

Context Engineering for Agents

Agentic systems accumulate context fast. A single agent run can involve dozens of tool calls, each producing structured output, each adding tokens. Without active management, the agent runs out of effective context (or hits the hard limit) before it finishes the task.

The patterns that work:

Sub-agents with isolated context. A supervisor delegates a sub-task to a specialized agent with its own clean context. The sub-agent returns a compact result, not its full reasoning trace. The supervisor's context stays manageable.
Stateful scratchpads outside the context. The agent writes intermediate findings to an external store (file, database, memory service) and reads back only what's needed.
Aggressive tool output pruning. Keep the result, drop the call details unless they matter again.
Periodic compaction. Summarize older turns into a structured state representation; replace the verbose trace.
Tool budgets and step budgets. Cap how long an agent runs before forcing a checkpoint, summary, or escalation.

This is also where Langfuse and similar observability tools become essential -- not optional. You cannot debug agent behavior without seeing the full context the model saw on each step.

Common Failure Modes

Context rot. Quality degrades as irrelevant tokens accumulate. The model gets distracted, confuses old tool outputs with current ones, or fixates on a chunk that was retrieved three turns ago and is no longer relevant.

Lost in the middle. Critical information placed in the middle of a long context gets ignored. Put what matters at the start (system prompt) or the end (just before generation).

Stale memory poisoning. A memory system that wrote down an early model error as a fact, then keeps retrieving it as ground truth.

Tool definition bloat. Sixty tools in the context, each with verbose descriptions, leaving little room for the actual task. Dynamic tool loading fixes this.

Retrieval drift. A retrieval system that returns plausible but irrelevant chunks. The model dutifully uses them and produces a confidently wrong answer.

Cost blowups. Context that grows monotonically with no compaction strategy. Long-running agent conversations that cost more per turn than the entire previous session combined.

Where Context Engineering Fits

Prompt engineering is part of context engineering. RAG is part of context engineering. Memory systems are part of context engineering. Tool design is part of context engineering. It's not a replacement for any of these -- it's the umbrella discipline that ties them together for production LLM systems, and especially for agentic AI.

The framing matters because it shifts where teams invest. Spending weeks tuning a system prompt while retrieval is broken, memory is unmanaged, and tool outputs are blowing up context budget is misallocated effort. Context engineering asks the broader question: of all the things the model sees on this turn, which are pulling their weight, and which are noise?

For a deeper look at how this plays out in real systems, see Everything you need to know before building AI agents and Needle in a haystack: optimizing retrieval and RAG over long context windows.