

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita
Nicholas Carlini did something in early 2026 that most people thought was at least a year away: he set up 16 Claude agents, pointed them at the task of building a C compiler, and walked away. The agents divided the work, implemented a lexer, parser, type checker, code generator, and optimizer — coordinating across shared interfaces without human intervention.
The result compiled real C programs. Not toy examples. Real programs with structs, pointers, function calls, and control flow.
The qualifier "almost" in every headline about this experiment matters. The compiler had edge cases it couldn't handle, optimizations that produced incorrect code under specific conditions, and entire C language features that were stubbed out. But the fact that autonomous agents produced a working compiler — one that could bootstrap simple programs — is a proof point that changes the conversation about what's possible.
Having built multi-agent systems in production, I found the architecture choices more interesting than the output. How do you get 16 agents to coordinate on a complex software project? The answer reveals both the promise and the current limits of multi-agent AI.
Carlini's experiment used a pattern that's becoming standard in multi-agent systems: hierarchical task decomposition with shared interfaces.
The high-level structure:
┌─────────────┐
│ Architect │ Defines specs + interfaces
└──────┬──────┘
│
├──────────┬──────────┬──────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌────────┐ ┌────────┐ ┌─────────┐
│ Lexer │ │ Parser │ │ Type │ │ CodeGen │ Module agents
│ Agent │ │ Agent │ │ Check │ │ Agent │ implement
└──────────┘ └────────┘ │ Agent │ └─────────┘ independently
└────────┘
│ │ │ │
└──────────┴──────────┴──────────┘
│
┌───────▼────────┐
│ Integration │ Assembles + tests
│ Agent │
└───────┬────────┘
│
┌───────▼────────┐
│ Review Agents │ Cross-module checks
└────────────────┘
The critical insight: the agents didn't need to communicate with each other directly. They communicated through shared artifacts — interface definitions, type signatures, and test cases. The architect agent created the "contract" that all other agents built against.
This is the same pattern that makes microservice architectures work in human engineering teams. Define the interfaces, let teams work independently, integrate at the boundaries.
The experiment would have failed without strong interface definitions. Here's why:
When 16 agents work independently, they'll make different assumptions about data structures, error handling, naming conventions, and memory management. Without explicit interfaces, these assumptions diverge. Module A expects a linked list of tokens. Module B produces an array. Integration fails.
The architect agent's primary job was producing interface definitions like:
// Token interface: Lexer → Parser
typedef struct {
TokenType type;
const char *value;
size_t length;
SourceLocation location;
} Token;
typedef struct {
Token *tokens;
size_t count;
size_t capacity;
} TokenStream;
// The parser agent codes against this interface
// without knowing how the lexer agent produces tokens.
// The contract is the type + documented invariants.
// AST interface: Parser → Type Checker → Code Generator
typedef struct ASTNode {
NodeKind kind;
Type *type; // Set by type checker, NULL from parser
SourceLocation loc;
union {
struct { BinaryOp op; struct ASTNode *left, *right; } binary;
struct { UnaryOp op; struct ASTNode *operand; } unary;
struct { const char *name; struct ASTNode **args; size_t argc; } call;
struct { int64_t value; } integer_literal;
// ... more variants
};
} ASTNode;These interfaces served as the coordination mechanism. Each module agent received the interface it consumed (input) and the interface it produced (output). As long as the contracts were honored, the modules composed.
The compiler worked, but the failures were instructive. They reveal where multi-agent coordination breaks down:
The hardest bugs involved behavior that spanned multiple modules. Memory management was the biggest offender — when does an AST node get freed? The parser allocates it, the type checker annotates it, the code generator reads it. Each agent made locally reasonable decisions about memory ownership that were globally inconsistent.
This is the distributed ownership problem: when no single agent has a complete view of a resource's lifecycle. It's the same problem human teams face with shared state, and it's just as hard for AI agents.
Despite detailed interfaces, agents still made implicit assumptions. The parser assumed the lexer would never produce consecutive whitespace tokens (it did, for preserving source locations). The code generator assumed all expressions had been type-checked (they hadn't, for certain compound literals). The type checker assumed the AST was acyclic (it was, except for a recursive typedef edge case).
Each assumption was reasonable in isolation. Together, they created subtle bugs that only manifested with specific input programs.
When the lexer produced an error token for invalid input, the expected behavior was graceful degradation — report the error and continue parsing if possible. Instead, the error token propagated through the parser (which didn't handle it), into the type checker (which crashed), and the error message was meaningless because it referred to a type-checking phase when the actual problem was a lexical error.
This is the multi-agent equivalent of exception handling: when something goes wrong in one agent's domain, how does the system as a whole respond? The answer, without explicit error handling protocols, is "badly."
The compiler experiment is impressive, but it doesn't mean multi-agent is always better. Based on what I've built and what the experiment reveals:
The task is naturally decomposable into independent modules. The compiler's pipeline structure (lex → parse → type check → generate) maps perfectly to independent agents. Each phase has clear inputs and outputs.
The total task exceeds a single agent's context window. A complete C compiler is too much code for one agent to hold in context. By splitting into modules, each agent only needs to understand its piece plus the interfaces.
Parallelism matters. The module agents worked simultaneously. If the lexer, parser, and code generator were implemented sequentially by one agent, the wall-clock time would have been roughly 4x longer.
Diverse expertise is needed. An agent specialized in parsing (with relevant examples in its context) will produce better parser code than a generalist agent that also needs to think about code generation and optimization.
The task requires global coherence. If every decision depends on every other decision, splitting across agents introduces coordination overhead that exceeds the benefit. Writing a cohesive essay is better with one agent. Writing a compiler is better with sixteen.
The task is small enough to fit in context. If one agent can hold the entire problem — the specification, relevant examples, and working code — adding more agents just adds latency and coordination bugs.
Error handling is critical. Multi-agent error propagation is an unsolved problem. If a failure in one component needs sophisticated recovery that depends on the state of other components, a single agent with full visibility will handle it better.
The interfaces aren't clear upfront. If you can't define clean boundaries between modules before starting, multi-agent coordination becomes a moving target. The compiler worked because compiler architecture is well-understood. A novel system with unclear boundaries would struggle.
If you're building multi-agent systems, here's what I've learned works:
The architect agent pattern from the compiler experiment is the right approach. Before any implementation agent starts, you need:
This is upfront investment that pays off exponentially. Every hour spent on interface design saves days of integration debugging.
Agents need a place to read and write shared artifacts (code files, test results, documentation). A simple file system works. A git repository is better — you get versioning and conflict detection for free.
const agentConfig = {
architect: {
writes: ["specs/", "interfaces/", "tests/integration/"],
reads: [],
},
lexerAgent: {
writes: ["src/lexer/", "tests/lexer/"],
reads: ["specs/lexer.md", "interfaces/token.h"],
},
parserAgent: {
writes: ["src/parser/", "tests/parser/"],
reads: ["specs/parser.md", "interfaces/token.h", "interfaces/ast.h"],
},
integrationAgent: {
writes: ["tests/integration/results/"],
reads: ["src/", "tests/", "interfaces/"],
},
};The compiler experiment wasn't a one-shot pipeline. When integration tests failed, the integration agent reported failures back to the responsible module agents, who fixed their code and resubmitted. This feedback loop ran multiple times before the compiler was complete.
The pattern: implement → integrate → test → report → fix → repeat. Without this loop, you get a waterfall that fails at the end with no recovery path.
The most useful monitoring in a multi-agent system happens at the interfaces between agents. Track:
The 16-agent compiler isn't just a cool experiment. It's a preview of how software will be built:
The architect role becomes critical. Someone — human or AI — needs to decompose systems into modules, define interfaces, and design coordination protocols. This is the highest-leverage activity in a multi-agent world.
Interface design is the new implementation skill. When agents handle implementation, the quality of the system depends on the quality of the interfaces. A well-defined interface produces good code. A vague interface produces bugs.
Testing becomes the integration contract. In the compiler experiment, the integration test suite was the source of truth for whether the system worked. Agents that passed their unit tests but failed integration were sent back to fix their code. The test suite, not the specification, was the ultimate arbiter.
The compiler took "almost" to a place that was unthinkable a year ago. The gap between "almost" and "fully" is where the interesting engineering problems live — and they're solvable. Not through better models, but through better coordination architecture.
That's the real lesson: multi-agent AI is a systems engineering problem, not a model capability problem. The models are ready. The architectures are catching up.

With million-token context windows, the challenge shifted from crafting prompts to architecting context. Here's what that looks like in practice.

How AI agents evolved from fragile prototypes to reliable production infrastructure, and what that means for software engineering.

Developers spent decades wishing for tools that write code. Now they have them. Why does freedom feel like loss?