From Prompt Engineering to Context Engineering: What Changed and Why It Matters

Context engineering is the discipline of designing the full information environment an LLM receives - not just the prompt, but retrieved knowledge, tool results, and conversation history. Here's what changed and why it matters for production AI systems.

For a while, "prompt engineering" was the skill everyone wanted on their resume. Learn to write better prompts, get better outputs. It worked - until it didn't. As LLM applications moved from chatbot demos to production systems with tools, memory, and multi-step workflows, the idea that you could control an AI system by perfecting a single text input started to fall apart.

The term gaining traction now is context engineering - and it describes something fundamentally different from tweaking prompt phrasing. This post explains what context engineering is, how it relates to prompt engineering, and why it's the skill that actually matters when building production AI systems.

The Rise and Limits of Prompt Engineering

Prompt engineering gave us genuinely useful techniques. Few-shot prompting showed models how to respond by example. Chain-of-thought prompting (Wei et al., 2022) unlocked step-by-step reasoning. System prompts set behavioral boundaries. These techniques work well for single-turn interactions - ask a question, get an answer.

But prompt engineering assumes a static world. You write a prompt, test it, maybe iterate a few times, and ship it. The problems show up when your LLM application needs to:

Call external tools and incorporate their results mid-conversation
Retrieve documents dynamically based on the user's query
Maintain state across dozens of turns in a conversation
Coordinate multiple agents working on subtasks in parallel

In these scenarios, the prompt is just one piece of what the model sees. The rest - retrieved documents, tool outputs, conversation history, system state - often matters far more than the prompt itself. You can have a perfect prompt and still get terrible results if the surrounding context is wrong.

What Is Context Engineering?

Context engineering is the discipline of designing and orchestrating the full information environment an LLM receives at each step of a task - including system instructions, retrieved knowledge, tool results, conversation history, and any other relevant state. It treats the context window as a dynamic, programmable interface rather than a static text box.

Andrej Karpathy put it this way: context engineering is "the delicate art and science of filling the context window with just the right information for the next step." He noted that in every industrial-strength LLM application, this goes far beyond what people think of as "prompting." It includes task descriptions, few-shot examples, RAG results, multimodal data, tool definitions, state, and history - and getting the balance right is non-trivial. Too little context and the model lacks information. Too much and costs go up while performance degrades.

Shopify CEO Tobi Lutke endorsed the framing, calling it "the art of providing all the context for the task to be plausibly solvable by the LLM." Simon Willison noted that unlike "prompt engineering" - which many people read as a pretentious term for typing things into a chatbot - "context engineering" has an inferred definition much closer to the actual work involved.

The four components of an LLM's context at any given step:

System instructions: the prompt itself, behavioral rules, output format constraints
Retrieved knowledge: documents, code, data pulled in via RAG or search
Tool results: outputs from function calls, API responses, database queries
Conversation history: prior turns, summaries of earlier exchanges, persistent memories

Prompt engineering covers the first bullet. Context engineering covers all four - and the dynamic orchestration between them.

Prompt Engineering vs Context Engineering: A Side-by-Side Comparison

Prompt engineering is not obsolete. It's a subset of context engineering. You still need well-written system prompts and instructions. But when people say "we need better prompts," they often mean "we need better context" - and those are different problems requiring different solutions.

Dimension	Prompt Engineering	Context Engineering
Scope	Single text input to the model	Full information pipeline across the system
Nature	Static - written once, updated manually	Dynamic - assembled at runtime per request
Approach	Manual craft and iteration	System design and architecture
Interaction model	Single-turn or simple multi-turn	Multi-step, agentic, tool-using workflows
Failure mode	"The model misunderstood me"	"The model had the wrong information"
Skill set	Writing, domain knowledge	Software engineering, information architecture

LangChain's context engineering framework breaks the discipline into four strategies that map well to production systems:

Write: save information outside the context window for later (scratchpads, long-term memory)
Select: pull the right information in when needed (RAG, tool selection, memory retrieval)
Compress: reduce tokens while retaining what matters (summarization, trimming old messages)
Isolate: split context across separate agents or components so each gets only what it needs

Context Engineering in Practice

What does this look like in real systems? Here are the patterns we see most often in production.

RAG as Context Engineering

Retrieval-augmented generation is context engineering in its purest form. You're not changing the prompt - you're changing what knowledge the model has access to when it generates a response. The engineering challenge isn't writing the query; it's building the pipeline that retrieves the right documents, ranks them, filters irrelevant noise, and fits the result within the token budget.

In our work building RAG systems with Shraga (our open-source RAG framework), the difference between a working and failing system almost always comes down to retrieval quality and context assembly - not prompt wording.

Tool Use and Function Results

When an LLM calls a tool - a database query, an API, a code interpreter - the result becomes part of its context for the next reasoning step. A customer support agent that retrieves the wrong account data will give a confidently wrong answer no matter how good its system prompt is. The context engineering challenge here is deciding which tools to expose, how to format their outputs, and how to handle failures.

Research from LangChain found that applying RAG to tool descriptions - dynamically selecting which tools to show the model based on the task - improved accuracy 3-fold compared to dumping all tool descriptions into the context.

Memory and Context Window Management

Long-running agents accumulate context fast. Anthropic's multi-agent research system uses up to 15x more tokens than a standard chat interaction. Without active management, you either blow past the context window or drown the model in irrelevant history.

Production strategies include:

Auto-compaction: Claude Code triggers summarization at 95% context capacity, compressing earlier conversation into key points
Context isolation: Anthropic's multi-agent system gives each subagent its own context window, letting them explore independently before condensing results back to a lead agent. This approach outperformed single-agent setups by over 90%
Selective memory: systems like ChatGPT and Cursor generate long-term memories across sessions, choosing what to persist and what to discard

Why Agents Fail

According to LangChain's 2025 State of Agent Engineering report, 57% of organizations now have AI agents in production, but 32% cite quality as their top barrier. Most quality failures trace back not to model capability but to poor context management - the model had the wrong information, too much information, or stale information when it needed to make a decision.

The four context failure modes identified in production systems: poisoning (wrong information injected), distraction (irrelevant context drowning out relevant), confusion (contradictory information), and clash (conflicting instructions from different context sources). Each of these is a context engineering problem, not a prompt engineering problem.

Getting Started: From Prompt Tweaker to Context Architect

If you're building LLM-powered applications and still spending most of your time tweaking prompt wording, here's how to shift toward context engineering:

Map your context: trace every piece of information your LLM receives at each step. System prompt, retrieved docs, tool results, conversation history - draw it out. You'll likely find context you didn't realize was there, and gaps where context should be.
Instrument and observe: 89% of organizations now have some form of agent observability, but only 52% have proper evaluations (LangChain, 2025). Trace which context inputs correlate with good and bad outputs. The problem is rarely "the model is dumb" - it's usually "the model saw the wrong things."
Design context pipelines: treat context assembly as a software engineering problem. Build pipelines that retrieve, rank, filter, and format context dynamically. Frameworks like LangGraph, LlamaIndex, and Shraga provide building blocks for this, though many production systems end up with custom pipelines tuned to their specific needs.
Budget your tokens: treat the context window like memory in an embedded system - a hard constraint you design around, not an afterthought. Decide what gets included, what gets summarized, and what gets dropped. Profile your token usage across real requests.
Test context, not just prompts: when something goes wrong, don't just tweak the prompt. Examine the full context the model received. Build test cases that exercise different context compositions, not just different prompt phrasings.

Key Takeaways

Context engineering is the discipline of designing the full information environment an LLM receives - system instructions, retrieved knowledge, tool results, and conversation history - not just the prompt text.
Prompt engineering is a subset of context engineering, not a separate practice. You still need good prompts, but they're one component of a larger system.
Production AI agents fail primarily due to bad context, not bad prompts. The model had the wrong information, not the wrong instructions.
The four strategies for managing context are write (persist for later), select (retrieve what's relevant), compress (fit within token limits), and isolate (separate concerns across agents).
Building reliable LLM applications requires treating context assembly as a software engineering discipline - with pipelines, observability, and systematic testing.