Context Engineering: The Discipline That Replaced Prompt Engineering

Prompt engineering had a good run. For about two years, the ability to craft the right prompt was a genuine competitive advantage. People wrote books about it. Companies hired for it. "Prompt engineer" appeared on job titles unironically.

That era is over.

Not because prompts don't matter — they do — but because the bottleneck moved. With models like Claude Opus 4.6 offering million-token context windows and tool-use capabilities that can pull in information dynamically, the hard problem isn't "what do I say to the model?" It's "what information does the model need, in what structure, at what point in the interaction?"

That's context engineering. And it's a fundamentally different discipline.

From prompts to context architecture

Prompt engineering was about the instruction layer: how you phrase the ask, what persona you set, what few-shot examples you include. It operated on a fixed, relatively small input.

Context engineering operates on the entire information space the model has access to. It includes:

What enters the context window — and what doesn't
When information is loaded — upfront vs. retrieved on demand
How information is structured — raw text vs. summarized vs. structured data
What gets evicted when the window fills up
How tools and retrieval systems feed information into the conversation

This is systems architecture, not copywriting. The skill set is closer to database design than creative writing.

The context window is a cache

The mental model that works best: treat the context window as a fixed-size cache with specific performance characteristics.

Like any cache, it has:

Capacity: Finite, measured in tokens. Even a million tokens has limits.
Retrieval patterns: The model attends to some positions more than others. Information at the beginning and end of the context gets more attention than information in the middle (the "lost in the middle" problem).
Eviction strategy: When you exceed capacity, something has to go. Unlike a cache where you control eviction, here the oldest messages typically get truncated — and you lose whatever was in them.
Hit rate: Not all information in the context is useful for every query. Irrelevant context actively degrades performance.

The best context engineers I work with think about information placement, density, and relevance the way a database engineer thinks about indexing and query optimization.

typescript

interface ContextArchitecture {
  // Static context: always present, loaded at initialization
  systemPrompt: string;            // Instructions, persona, constraints
  domainKnowledge: string;         // Core reference material
  outputSchema: string;            // Expected response format
 
  // Dynamic context: loaded based on the current task
  retrievedDocuments: Document[];  // From RAG pipeline
  conversationHistory: Message[];  // Compressed/summarized
  toolResults: ToolResult[];       // From MCP servers or function calls
 
  // Metadata context: structural information
  availableTools: ToolDefinition[];
  userPreferences: UserContext;
}

Strategies that work in production

1. Hierarchical context loading

Don't dump everything into the context at once. Load information in layers based on relevance:

Layer 1 — Always present: System prompt, core instructions, output format requirements. This is your "hot" cache — small, high-value, never evicted.

Layer 2 — Task-specific: Retrieved documents, relevant code files, conversation history. Loaded when the task is identified, refreshed as the task evolves.

Layer 3 — On-demand: Detailed reference material, API documentation, edge case examples. Loaded only when the model requests it via tool calls.

typescript

async function buildContext(task: Task): Promise<Context> {
  // Layer 1: Always present (~2K tokens)
  const base = {
    systemPrompt: SYSTEM_PROMPT,
    outputFormat: task.expectedFormat,
    constraints: task.constraints,
  };
 
  // Layer 2: Task-specific (~10-50K tokens)
  const taskContext = await Promise.all([
    retrieveRelevantDocs(task.query, { limit: 10 }),
    getRecentConversation(task.sessionId, { maxTokens: 8000 }),
    getRelatedExamples(task.type, { limit: 3 }),
  ]);
 
  // Layer 3: Available on-demand via tools
  const tools = [
    searchDocsTool,      // Model can search for more docs if needed
    readFileTool,        // Model can read specific files
    queryDatabaseTool,   // Model can look up specific records
  ];
 
  return { ...base, ...taskContext, tools };
}

This approach keeps the initial context lean while giving the model the ability to pull in more information when needed. It's the difference between loading an entire database into memory versus giving the application a connection string.

2. Context compression

Long conversations accumulate tokens fast. A 50-message conversation easily consumes 20-30K tokens, most of which is redundant or outdated.

The pattern that works: progressive summarization. As conversations grow, older messages get compressed into summaries while recent messages remain verbatim.

typescript

function compressConversation(
  messages: Message[],
  budget: number
): Message[] {
  const recentWindowSize = 10;
  const recent = messages.slice(-recentWindowSize);
  const older = messages.slice(0, -recentWindowSize);
 
  if (estimateTokens(recent) > budget) {
    // Even recent messages exceed budget — aggressive compression
    return [summarizeMessages(messages)];
  }
 
  if (older.length === 0) return recent;
 
  const olderBudget = budget - estimateTokens(recent);
  const compressed = summarizeMessages(older, { maxTokens: olderBudget });
 
  return [compressed, ...recent];
}

The key insight: the model doesn't need the exact wording of a message from 30 turns ago. It needs the information from that message. Summaries preserve information while reducing token count.

3. Structured over unstructured

Given the same information, structured representations consume fewer tokens and are more reliably interpreted by the model.

Instead of:

text

The user John Smith signed up on January 15, 2026. He has a premium
subscription that costs $29/month. He's based in San Francisco and
works at Acme Corp as a Senior Engineer. His last login was February
14, 2026 and he has 3 active projects.

Use:

json

{
  "user": "John Smith",
  "signup": "2026-01-15",
  "plan": "premium ($29/mo)",
  "location": "San Francisco",
  "company": "Acme Corp",
  "role": "Senior Engineer",
  "lastLogin": "2026-02-14",
  "activeProjects": 3
}

The JSON version is ~40% fewer tokens and the model extracts individual fields more reliably. For tabular data, markdown tables are similarly efficient.

4. Strategic placement

The "lost in the middle" problem is real and measurable. Models attend most strongly to the beginning and end of the context window, with weaker attention to the middle.

Practical implications:

System instructions: Beginning of context. Always.
Critical reference material: Beginning, after system instructions.
The current query/task: End of context. The model attends to this most strongly.
Supporting material: Middle. It's available but doesn't get primary attention.
Few-shot examples: Immediately before the current query. Proximity matters.

typescript

function assembleContext(parts: ContextParts): string {
  return [
    parts.systemPrompt,           // Top: highest attention
    parts.criticalReference,      // Near top: high attention
    parts.domainKnowledge,        // Middle: moderate attention
    parts.conversationHistory,    // Middle: moderate attention
    parts.retrievedDocuments,     // Below middle: lower attention
    parts.fewShotExamples,       // Near bottom: high attention (recency)
    parts.currentQuery,           // Bottom: highest attention (recency)
  ].join("\n\n");
}

RAG vs. long context: the false dichotomy

A recurring debate in 2026: "With million-token windows, do we still need RAG?"

The answer is yes — but the role of RAG has changed.

Old RAG (2024): Compensated for small context windows. You had to retrieve because you couldn't fit everything. The quality bar was "better than nothing."

New RAG (2026): Serves as a relevance filter. Even with a million tokens, you don't want to fill the window with irrelevant information. RAG's job is now precision — surfacing the right 10 documents from a corpus of 100,000 — not just fitting documents into the window.

The optimal architecture uses both:

RAG retrieves the most relevant documents from your knowledge base
Long context holds those documents alongside conversation history, tools, and instructions
Tool calls fetch additional information on demand

RAG handles breadth (searching across all your data). Long context handles depth (reasoning over the retrieved information). They're complementary, not competing.

MCP as context infrastructure

The Model Context Protocol deserves specific mention because it's becoming the standard plumbing for context engineering.

MCP servers are, fundamentally, context providers. They give the model access to information (resources) and capabilities (tools) through a standardized interface. A well-designed MCP architecture is context engineering made concrete:

Resource servers provide domain-specific context (database schemas, documentation, user data)
Tool servers let the model fetch context on demand (search, file reads, API calls)
The client manages what's in the context window and when

If you're doing context engineering without MCP, you're building custom plumbing that the protocol already handles. The protocol won because it turns context management from a bespoke implementation problem into a composable architecture problem.

Measuring context quality

The hardest part of context engineering: how do you know your context architecture is good?

The metrics I track:

Metric	What it measures	Target
Context utilization	% of context tokens that contribute to the response	> 60%
Retrieval precision	% of retrieved documents cited in response	> 70%
Token efficiency	Quality per token spent	Increasing over time
Grounding rate	% of claims traceable to context	> 90% for factual tasks
Latency overhead	Time spent building context vs. model inference	< 30% of total

The first metric — context utilization — is the most underrated. If you're sending 50K tokens of context and the model only references 5K of it, you're wasting 90% of your context budget. That's not just inefficient — it's actively harmful, because irrelevant context can distract the model from the relevant information.

What I'd tell someone starting out

Think in information architecture, not prompt tricks. The structure of your context matters more than the wording of your instructions.
Start small, add incrementally. Begin with the minimum viable context and add information only when you can measure its impact on output quality.
Instrument everything. Track what's in your context, how much of it the model uses, and how output quality correlates with context composition.
Design for eviction. Your context will exceed the window eventually. Decide upfront what gets compressed, what gets dropped, and what's always retained.
Use MCP. Seriously. The protocol turns context management from an ad-hoc problem into a composable system. Every hour spent building custom tool integrations is an hour wasted.

Prompt engineering was about asking the right question. Context engineering is about building the right information environment. The shift is from craftsmanship to architecture — and the architects are just getting started.

Context Engineering: The Discipline That Replaced Prompt Engineering

From prompts to context architecture

The context window is a cache

Strategies that work in production

1. Hierarchical context loading

2. Context compression

3. Structured over unstructured

4. Strategic placement

RAG vs. long context: the false dichotomy

MCP as context infrastructure

Measuring context quality

What I'd tell someone starting out

Related posts

There Is No Best AI Model in 2026

16 Agents Built a C Compiler: What Multi-Agent Coordination Actually Looks Like

Understanding MCP: The Protocol Connecting AI to Everything

From prompts to context architecture

The context window is a cache

Strategies that work in production

1. Hierarchical context loading

2. Context compression

3. Structured over unstructured

4. Strategic placement

RAG vs. long context: the false dichotomy

MCP as context infrastructure

Measuring context quality

What I'd tell someone starting out

Related posts

There Is No Best AI Model in 2026

16 Agents Built a C Compiler: What Multi-Agent Coordination Actually Looks Like

Understanding MCP: The Protocol Connecting AI to Everything