

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita
Prompt engineering had a good run. For about two years, the ability to craft the right prompt was a genuine competitive advantage. People wrote books about it. Companies hired for it. "Prompt engineer" appeared on job titles unironically.
That era is over.
Not because prompts don't matter — they do — but because the bottleneck moved. With models like Claude Opus 4.6 offering million-token context windows and tool-use capabilities that can pull in information dynamically, the hard problem isn't "what do I say to the model?" It's "what information does the model need, in what structure, at what point in the interaction?"
That's context engineering. And it's a fundamentally different discipline.
Prompt engineering was about the instruction layer: how you phrase the ask, what persona you set, what few-shot examples you include. It operated on a fixed, relatively small input.
Context engineering operates on the entire information space the model has access to. It includes:
This is systems architecture, not copywriting. The skill set is closer to database design than creative writing.
The mental model that works best: treat the context window as a fixed-size cache with specific performance characteristics.
Like any cache, it has:
The best context engineers I work with think about information placement, density, and relevance the way a database engineer thinks about indexing and query optimization.
interface ContextArchitecture {
// Static context: always present, loaded at initialization
systemPrompt: string; // Instructions, persona, constraints
domainKnowledge: string; // Core reference material
outputSchema: string; // Expected response format
// Dynamic context: loaded based on the current task
retrievedDocuments: Document[]; // From RAG pipeline
conversationHistory: Message[]; // Compressed/summarized
toolResults: ToolResult[]; // From MCP servers or function calls
// Metadata context: structural information
availableTools: ToolDefinition[];
userPreferences: UserContext;
}Don't dump everything into the context at once. Load information in layers based on relevance:
Layer 1 — Always present: System prompt, core instructions, output format requirements. This is your "hot" cache — small, high-value, never evicted.
Layer 2 — Task-specific: Retrieved documents, relevant code files, conversation history. Loaded when the task is identified, refreshed as the task evolves.
Layer 3 — On-demand: Detailed reference material, API documentation, edge case examples. Loaded only when the model requests it via tool calls.
async function buildContext(task: Task): Promise<Context> {
// Layer 1: Always present (~2K tokens)
const base = {
systemPrompt: SYSTEM_PROMPT,
outputFormat: task.expectedFormat,
constraints: task.constraints,
};
// Layer 2: Task-specific (~10-50K tokens)
const taskContext = await Promise.all([
retrieveRelevantDocs(task.query, { limit: 10 }),
getRecentConversation(task.sessionId, { maxTokens: 8000 }),
getRelatedExamples(task.type, { limit: 3 }),
]);
// Layer 3: Available on-demand via tools
const tools = [
searchDocsTool, // Model can search for more docs if needed
readFileTool, // Model can read specific files
queryDatabaseTool, // Model can look up specific records
];
return { ...base, ...taskContext, tools };
}This approach keeps the initial context lean while giving the model the ability to pull in more information when needed. It's the difference between loading an entire database into memory versus giving the application a connection string.
Long conversations accumulate tokens fast. A 50-message conversation easily consumes 20-30K tokens, most of which is redundant or outdated.
The pattern that works: progressive summarization. As conversations grow, older messages get compressed into summaries while recent messages remain verbatim.
function compressConversation(
messages: Message[],
budget: number
): Message[] {
const recentWindowSize = 10;
const recent = messages.slice(-recentWindowSize);
const older = messages.slice(0, -recentWindowSize);
if (estimateTokens(recent) > budget) {
// Even recent messages exceed budget — aggressive compression
return [summarizeMessages(messages)];
}
if (older.length === 0) return recent;
const olderBudget = budget - estimateTokens(recent);
const compressed = summarizeMessages(older, { maxTokens: olderBudget });
return [compressed, ...recent];
}The key insight: the model doesn't need the exact wording of a message from 30 turns ago. It needs the information from that message. Summaries preserve information while reducing token count.
Given the same information, structured representations consume fewer tokens and are more reliably interpreted by the model.
Instead of:
The user John Smith signed up on January 15, 2026. He has a premium
subscription that costs $29/month. He's based in San Francisco and
works at Acme Corp as a Senior Engineer. His last login was February
14, 2026 and he has 3 active projects.
Use:
{
"user": "John Smith",
"signup": "2026-01-15",
"plan": "premium ($29/mo)",
"location": "San Francisco",
"company": "Acme Corp",
"role": "Senior Engineer",
"lastLogin": "2026-02-14",
"activeProjects": 3
}The JSON version is ~40% fewer tokens and the model extracts individual fields more reliably. For tabular data, markdown tables are similarly efficient.
The "lost in the middle" problem is real and measurable. Models attend most strongly to the beginning and end of the context window, with weaker attention to the middle.
Practical implications:
function assembleContext(parts: ContextParts): string {
return [
parts.systemPrompt, // Top: highest attention
parts.criticalReference, // Near top: high attention
parts.domainKnowledge, // Middle: moderate attention
parts.conversationHistory, // Middle: moderate attention
parts.retrievedDocuments, // Below middle: lower attention
parts.fewShotExamples, // Near bottom: high attention (recency)
parts.currentQuery, // Bottom: highest attention (recency)
].join("\n\n");
}A recurring debate in 2026: "With million-token windows, do we still need RAG?"
The answer is yes — but the role of RAG has changed.
Old RAG (2024): Compensated for small context windows. You had to retrieve because you couldn't fit everything. The quality bar was "better than nothing."
New RAG (2026): Serves as a relevance filter. Even with a million tokens, you don't want to fill the window with irrelevant information. RAG's job is now precision — surfacing the right 10 documents from a corpus of 100,000 — not just fitting documents into the window.
The optimal architecture uses both:
RAG handles breadth (searching across all your data). Long context handles depth (reasoning over the retrieved information). They're complementary, not competing.
The Model Context Protocol deserves specific mention because it's becoming the standard plumbing for context engineering.
MCP servers are, fundamentally, context providers. They give the model access to information (resources) and capabilities (tools) through a standardized interface. A well-designed MCP architecture is context engineering made concrete:
If you're doing context engineering without MCP, you're building custom plumbing that the protocol already handles. The protocol won because it turns context management from a bespoke implementation problem into a composable architecture problem.
The hardest part of context engineering: how do you know your context architecture is good?
The metrics I track:
| Metric | What it measures | Target |
|---|---|---|
| Context utilization | % of context tokens that contribute to the response | > 60% |
| Retrieval precision | % of retrieved documents cited in response | > 70% |
| Token efficiency | Quality per token spent | Increasing over time |
| Grounding rate | % of claims traceable to context | > 90% for factual tasks |
| Latency overhead | Time spent building context vs. model inference | < 30% of total |
The first metric — context utilization — is the most underrated. If you're sending 50K tokens of context and the model only references 5K of it, you're wasting 90% of your context budget. That's not just inefficient — it's actively harmful, because irrelevant context can distract the model from the relevant information.
Think in information architecture, not prompt tricks. The structure of your context matters more than the wording of your instructions.
Start small, add incrementally. Begin with the minimum viable context and add information only when you can measure its impact on output quality.
Instrument everything. Track what's in your context, how much of it the model uses, and how output quality correlates with context composition.
Design for eviction. Your context will exceed the window eventually. Decide upfront what gets compressed, what gets dropped, and what's always retained.
Use MCP. Seriously. The protocol turns context management from an ad-hoc problem into a composable system. Every hour spent building custom tool integrations is an hour wasted.
Prompt engineering was about asking the right question. Context engineering is about building the right information environment. The shift is from craftsmanship to architecture — and the architects are just getting started.

Google leads in math and science. OpenAI leads in agentic coding. Anthropic leads in economically valuable work. A comprehensive breakdown of every flagship AI model with actual numbers.

Nicholas Carlini's experiment proved autonomous agents can tackle real software engineering. Here's what the architecture reveals about where multi-agent systems work — and where they don't.

A practical guide to the Model Context Protocol — what it is, how it works, and how to build MCP servers that extend AI capabilities.