16 Agents Built a C Compiler: What Multi-Agent Coordination Actually Looks Like

Nicholas Carlini did something in early 2026 that most people thought was at least a year away: he set up 16 Claude agents, pointed them at the task of building a C compiler, and walked away. The agents divided the work, implemented a lexer, parser, type checker, code generator, and optimizer — coordinating across shared interfaces without human intervention.

The result compiled real C programs. Not toy examples. Real programs with structs, pointers, function calls, and control flow.

The qualifier "almost" in every headline about this experiment matters. The compiler had edge cases it couldn't handle, optimizations that produced incorrect code under specific conditions, and entire C language features that were stubbed out. But the fact that autonomous agents produced a working compiler — one that could bootstrap simple programs — is a proof point that changes the conversation about what's possible.

Having built multi-agent systems in production, I found the architecture choices more interesting than the output. How do you get 16 agents to coordinate on a complex software project? The answer reveals both the promise and the current limits of multi-agent AI.

The coordination architecture

Carlini's experiment used a pattern that's becoming standard in multi-agent systems: hierarchical task decomposition with shared interfaces.

The high-level structure:

Architect agent: Defined the compiler's module structure, interfaces between modules, and the build system. This agent wrote no implementation code — only specifications.
Module agents (one per compiler phase): Each received the specification for their module and the interface contracts they needed to honor. They implemented independently.
Integration agent: Assembled the modules, ran the test suite, and reported failures back to the relevant module agents.
Review agents: Examined code across module boundaries for consistency, performance issues, and correctness.

text

┌─────────────┐
│  Architect   │  Defines specs + interfaces
└──────┬──────┘
       │
       ├──────────┬──────────┬──────────┐
       ▼          ▼          ▼          ▼
┌──────────┐ ┌────────┐ ┌────────┐ ┌─────────┐
│  Lexer   │ │ Parser │ │ Type   │ │ CodeGen │  Module agents
│  Agent   │ │ Agent  │ │ Check  │ │ Agent   │  implement
└──────────┘ └────────┘ │ Agent  │ └─────────┘  independently
                        └────────┘
       │          │          │          │
       └──────────┴──────────┴──────────┘
                      │
              ┌───────▼────────┐
              │  Integration   │  Assembles + tests
              │    Agent       │
              └───────┬────────┘
                      │
              ┌───────▼────────┐
              │  Review Agents │  Cross-module checks
              └────────────────┘

The critical insight: the agents didn't need to communicate with each other directly. They communicated through shared artifacts — interface definitions, type signatures, and test cases. The architect agent created the "contract" that all other agents built against.

This is the same pattern that makes microservice architectures work in human engineering teams. Define the interfaces, let teams work independently, integrate at the boundaries.

Why interfaces are everything

The experiment would have failed without strong interface definitions. Here's why:

When 16 agents work independently, they'll make different assumptions about data structures, error handling, naming conventions, and memory management. Without explicit interfaces, these assumptions diverge. Module A expects a linked list of tokens. Module B produces an array. Integration fails.

The architect agent's primary job was producing interface definitions like:

// Token interface: Lexer → Parser
typedef struct {
    TokenType type;
    const char *value;
    size_t length;
    SourceLocation location;
} Token;
 
typedef struct {
    Token *tokens;
    size_t count;
    size_t capacity;
} TokenStream;
 
// The parser agent codes against this interface
// without knowing how the lexer agent produces tokens.
// The contract is the type + documented invariants.
 
// AST interface: Parser → Type Checker → Code Generator
typedef struct ASTNode {
    NodeKind kind;
    Type *type;           // Set by type checker, NULL from parser
    SourceLocation loc;
    union {
        struct { BinaryOp op; struct ASTNode *left, *right; } binary;
        struct { UnaryOp op; struct ASTNode *operand; } unary;
        struct { const char *name; struct ASTNode **args; size_t argc; } call;
        struct { int64_t value; } integer_literal;
        // ... more variants
    };
} ASTNode;

These interfaces served as the coordination mechanism. Each module agent received the interface it consumed (input) and the interface it produced (output). As long as the contracts were honored, the modules composed.

The failure modes

The compiler worked, but the failures were instructive. They reveal where multi-agent coordination breaks down:

Cross-cutting concerns

The hardest bugs involved behavior that spanned multiple modules. Memory management was the biggest offender — when does an AST node get freed? The parser allocates it, the type checker annotates it, the code generator reads it. Each agent made locally reasonable decisions about memory ownership that were globally inconsistent.

This is the distributed ownership problem: when no single agent has a complete view of a resource's lifecycle. It's the same problem human teams face with shared state, and it's just as hard for AI agents.

Implicit assumptions

Despite detailed interfaces, agents still made implicit assumptions. The parser assumed the lexer would never produce consecutive whitespace tokens (it did, for preserving source locations). The code generator assumed all expressions had been type-checked (they hadn't, for certain compound literals). The type checker assumed the AST was acyclic (it was, except for a recursive typedef edge case).

Each assumption was reasonable in isolation. Together, they created subtle bugs that only manifested with specific input programs.

Error propagation

When the lexer produced an error token for invalid input, the expected behavior was graceful degradation — report the error and continue parsing if possible. Instead, the error token propagated through the parser (which didn't handle it), into the type checker (which crashed), and the error message was meaningless because it referred to a type-checking phase when the actual problem was a lexical error.

This is the multi-agent equivalent of exception handling: when something goes wrong in one agent's domain, how does the system as a whole respond? The answer, without explicit error handling protocols, is "badly."

Single-agent vs. multi-agent: when to use which

The compiler experiment is impressive, but it doesn't mean multi-agent is always better. Based on what I've built and what the experiment reveals:

Use multi-agent when:

The task is naturally decomposable into independent modules. The compiler's pipeline structure (lex → parse → type check → generate) maps perfectly to independent agents. Each phase has clear inputs and outputs.

The total task exceeds a single agent's context window. A complete C compiler is too much code for one agent to hold in context. By splitting into modules, each agent only needs to understand its piece plus the interfaces.

Parallelism matters. The module agents worked simultaneously. If the lexer, parser, and code generator were implemented sequentially by one agent, the wall-clock time would have been roughly 4x longer.

Diverse expertise is needed. An agent specialized in parsing (with relevant examples in its context) will produce better parser code than a generalist agent that also needs to think about code generation and optimization.

Use single-agent when:

The task requires global coherence. If every decision depends on every other decision, splitting across agents introduces coordination overhead that exceeds the benefit. Writing a cohesive essay is better with one agent. Writing a compiler is better with sixteen.

The task is small enough to fit in context. If one agent can hold the entire problem — the specification, relevant examples, and working code — adding more agents just adds latency and coordination bugs.

Error handling is critical. Multi-agent error propagation is an unsolved problem. If a failure in one component needs sophisticated recovery that depends on the state of other components, a single agent with full visibility will handle it better.

The interfaces aren't clear upfront. If you can't define clean boundaries between modules before starting, multi-agent coordination becomes a moving target. The compiler worked because compiler architecture is well-understood. A novel system with unclear boundaries would struggle.

Practical patterns for multi-agent systems

If you're building multi-agent systems, here's what I've learned works:

Define interfaces before agents

The architect agent pattern from the compiler experiment is the right approach. Before any implementation agent starts, you need:

Input/output contracts for each agent
Shared type definitions
Error handling protocols
Integration test specifications

This is upfront investment that pays off exponentially. Every hour spent on interface design saves days of integration debugging.

Use a shared artifact store

Agents need a place to read and write shared artifacts (code files, test results, documentation). A simple file system works. A git repository is better — you get versioning and conflict detection for free.

typescript

const agentConfig = {
  architect: {
    writes: ["specs/", "interfaces/", "tests/integration/"],
    reads: [],
  },
  lexerAgent: {
    writes: ["src/lexer/", "tests/lexer/"],
    reads: ["specs/lexer.md", "interfaces/token.h"],
  },
  parserAgent: {
    writes: ["src/parser/", "tests/parser/"],
    reads: ["specs/parser.md", "interfaces/token.h", "interfaces/ast.h"],
  },
  integrationAgent: {
    writes: ["tests/integration/results/"],
    reads: ["src/", "tests/", "interfaces/"],
  },
};

Build feedback loops, not chains

The compiler experiment wasn't a one-shot pipeline. When integration tests failed, the integration agent reported failures back to the responsible module agents, who fixed their code and resubmitted. This feedback loop ran multiple times before the compiler was complete.

The pattern: implement → integrate → test → report → fix → repeat. Without this loop, you get a waterfall that fails at the end with no recovery path.

Monitor at the boundaries

The most useful monitoring in a multi-agent system happens at the interfaces between agents. Track:

Contract violations: Did an agent produce output that doesn't match the interface spec?
Latency per agent: Which agent is the bottleneck?
Retry count: How many feedback loops before convergence?
Error propagation: When one agent fails, which other agents are affected?

What this means for software engineering

The 16-agent compiler isn't just a cool experiment. It's a preview of how software will be built:

The architect role becomes critical. Someone — human or AI — needs to decompose systems into modules, define interfaces, and design coordination protocols. This is the highest-leverage activity in a multi-agent world.

Interface design is the new implementation skill. When agents handle implementation, the quality of the system depends on the quality of the interfaces. A well-defined interface produces good code. A vague interface produces bugs.

Testing becomes the integration contract. In the compiler experiment, the integration test suite was the source of truth for whether the system worked. Agents that passed their unit tests but failed integration were sent back to fix their code. The test suite, not the specification, was the ultimate arbiter.

The compiler took "almost" to a place that was unthinkable a year ago. The gap between "almost" and "fully" is where the interesting engineering problems live — and they're solvable. Not through better models, but through better coordination architecture.

That's the real lesson: multi-agent AI is a systems engineering problem, not a model capability problem. The models are ready. The architectures are catching up.

16 Agents Built a C Compiler: What Multi-Agent Coordination Actually Looks Like

The coordination architecture

Why interfaces are everything

The failure modes

Cross-cutting concerns

Implicit assumptions

Error propagation

Single-agent vs. multi-agent: when to use which

Use multi-agent when:

Use single-agent when:

Practical patterns for multi-agent systems

Define interfaces before agents

Use a shared artifact store

Build feedback loops, not chains

Monitor at the boundaries

What this means for software engineering

Related posts

Context Engineering: The Discipline That Replaced Prompt Engineering

AI Agents in 2026: From Demos to Production Systems

The Grief of Getting What You Wanted

The coordination architecture

Why interfaces are everything

The failure modes

Cross-cutting concerns

Implicit assumptions

Error propagation

Single-agent vs. multi-agent: when to use which

Use multi-agent when:

Use single-agent when:

Practical patterns for multi-agent systems

Define interfaces before agents

Use a shared artifact store

Build feedback loops, not chains

Monitor at the boundaries

What this means for software engineering

Related posts

Context Engineering: The Discipline That Replaced Prompt Engineering

AI Agents in 2026: From Demos to Production Systems

The Grief of Getting What You Wanted