The Ensemble Age
Why the future of AI agents is competition, not coordination
In my previous piece on agent swarms, I described how AI agents crossed a threshold in early 2026. Cursor ran 2,000 agents to build a browser. Carlini’s 16 parallel Claudes produced a working C compiler. Things had shifted from human expertise to massive parallel iteration with brute force verification.
But I left something out. Something I keep seeing in practice that none of these projects fully captured.
The swarm pieces all describe monocultures. Thousands of GPT-5.2 instances. Sixteen Claude Opus 4.6 agents. Dozens of Claude Code sessions. One model, many copies. And I think this tells us something important about where we are and, more to the point, where we are not yet.
The two-model world
Talk to any serious AI-augmented software engineer right now and they will describe the same pattern. They bounce between Claude Opus 4.6 Ultrathink and Codex 5.4 Extra High. One is not “better”, but they are different in ways that turn out to matter a lot.
Nathan Lambert captured it well in his recent review of GPT-5.4. He described Claude as an “intent-understanding” partner and Codex as a precise executor. He uses Claude for “things I need more of an opinion on” and Codex “to churn through an overwhelmingly specific TODO list.” One of his readers described a staged workflow: use Claude Opus to “align on intent,” spin up a Codex sandbox, have Opus reassess progress, then let Codex run. Lambert pays for subscriptions to both, at $100 and $200 a month respectively, and switches between them based on the task.
This is not a preference thing. It is a practice with real structure. Engineers use Claude to spec, Codex to graft, Codex to review. They hand problems one model can’t crack to the other. They treat the two systems as different minds rather than interchangeable compute.
The open question: is this diversity meaningful, or just surface level variation between two very large neural networks trained on mostly the same internet?
A datacenter of geniuses or clones?
Imagine you could clone Einstein. You put a hundred Einsteins in a room. Would they produce a hundred times the physics?
Almost certainly not. The real Einstein needed Grossmann for differential geometry. He needed Bose for quantum statistics. He needed his arguments with Bohr to sharpen his thinking, and he needed Hilbert breathing down his neck on general relativity to force the final sprint. The breakthroughs came from collision. Research on scientific teams confirms this: team diversity has a measurable positive effect on output, and the biggest advances tend to come from interdisciplinary contact rather than deeper specialisation in a single field.
What we have built so far with agent swarms is, essentially, a datacenter of very impressive clones. Carlini’s 16 Claudes ran the same weights, the same training, the same reasoning patterns. When they all converged on the same Linux kernel bug and started overwriting each other’s fixes, that was a diversity failure dressed up as a coordination problem. Identical minds hit the same wall in the same way.
The power of ensemble intelligence is in productive difference. More of the same is just more of the same.
Ensemble intelligence is not new
Machine learning figured this out decades ago. Combining diverse models reduces error in ways no single model can match. On Kaggle, winning solutions are almost never a single model. They are ensembles, and the key ingredient is that the component models make different kinds of mistakes. Different architectures, different data subsets, sometimes entirely different learning algorithms. You don’t need each model to be the best individually. You need them to fail in different places.
Recent research on multi-agent debate with LLMs makes the same point. A 2024 study had a diverse set of medium-capacity models (Gemini-Pro, Mixtral 7BX8, and PaLM 2-M) debate math problems over four rounds. They hit 91% accuracy on GSM-8K. Three copies of Gemini-Pro debating each other? 82%.
This is also how functional human organisations work, obviously. A hospital works because surgeons, anaesthesiologists, nurses, and radiologists bring genuinely different ways of thinking to the same patient. A good executive team is not five people who see things the same way. The value is in the complementary blind spots, the fact that someone else on the team catches the thing you walked right past.
We are entering a world of ensemble intelligence. The models we orchestrate will need to be as different from each other as the colleagues we rely on.
But are the models really different?
There is a problem with everything I just said, and it is a serious one.
A NeurIPS 2025 paper by Jiang et al., “Artificial Hivemind,” studied what happens when you ask 70+ language models the same open-ended questions. Not coding tasks with test suites, but questions like “write a metaphor about time” or “generate a motto for a social media page.” The kind of thing where you would expect diverse minds to produce diverse outputs.
They don’t. When 25 different models each generated 50 responses to the prompt “write a metaphor about time,” the 1,250 responses clustered into just two groups: “time is a river” and “time is a weaver.” Twenty-five models. Different families: GPT, Claude, Llama, Qwen, Mistral, DeepSeek, Gemini. Two metaphors. The researchers measured pairwise similarity between responses from different models and found it ranged from 71% to 82%. Some cross-family pairs were more similar to each other than responses within a single model. DeepSeek-V3 and GPT-4o hit 82% similarity. Different companies, different architectures, different training pipelines, but near-identical outputs.
The paper calls this “inter-model homogeneity,” and it exists alongside “intra-model repetition” (a single model producing near-identical outputs across 50 samples, even at high temperature). The combined effect is what they call the “Artificial Hivemind”: models converging on the same ideas with minor variations in phrasing, as if they are all drawing from the same well. Which, in a sense, they are. Overlapping pretraining data, similar alignment processes, RLHF reward models that have been trained to prefer the same style of response. The paper warns directly that “model ensembles may not yield true diversity when their constituents share overlapping alignment and training priors.”
This puts the Kaggle ensemble analogy under real pressure. In machine learning competitions, diversity works because you can train models on different subsets of data, use different algorithms, apply different regularisation. The diversity is engineered into the system. With frontier LLMs, the diversity may be largely cosmetic. Different brand names, different API endpoints, similar minds underneath.
So, when engineers report that bouncing between Claude and Codex produces better results, what are they responding to? There are at least three possible sources of the perceived diversity, and they have very different implications.
The first is genuine cognitive diversity in the model weights. Claude and Codex were trained by different teams with different priorities, different data mixes, different alignment methods. This could produce real differences in how they decompose problems, which solution paths they explore, what they are willing to attempt. If this is the main driver, the ensemble thesis holds up well. But the Hivemind paper suggests this kind of diversity is thinner than it looks, at least for open-ended generation. Whether the same holds for code reasoning (where there are concrete right and wrong answers, and the solution space is more structured) is an open question nobody has rigorously tested.
The second is the tooling and harness. Claude Code and Codex are not just different models; they are different products. They have different system prompts, different tool-use architectures, different approaches to context management, different verification loops. When an engineer says “Claude specs better and Codex grinds better,” they may be describing differences in how the products are configured more than differences in the underlying model cognition. If this is the main driver, the diversity is real, but it lives in the scaffolding, not the model. And you could potentially get the same effect by wrapping the same model in two different agent harnesses with different instructions and tools.
The third is the verification and feedback environment. An engineer who uses Claude for specification and Codex for implementation is not just using two models. They are imposing a workflow structure, a form of human-designed decomposition, that forces the models into complementary roles. The diversity here comes from the process design, not the models themselves. It would work with any two sufficiently capable models.
I suspect the honest answer is that all three contribute, but we don’t know the proportions. And that matters enormously for the ensemble thesis. If most of the value comes from genuine model diversity, then the future is multi-vendor orchestration, and we need many different frontier labs pushing in different directions. If most of it comes from tooling and harness design, then the important work is in agent architecture and prompt engineering, not model diversity per se. If it’s mostly workflow structure, then the insight is about task decomposition, and the specific models are somewhat interchangeable.
The Hivemind paper’s data pushes toward a pessimistic reading on the first source. But it tested open-ended creative tasks, not the kind of structured problem solving where engineers report the biggest benefits from model switching. The gap between “write a metaphor about time” and “debug why this compiler segfaults on ARM” is enormous, and the degree of real cognitive diversity between models might be very different in these two settings. This needs research. If the frontier is jagged, as Mollick argues, we need to understand what percentage of the jags are shared across models versus genuinely different. That’s the question that determines whether multi-model ensembles are a deep architectural insight or a temporary workaround.
The future is competition, not delegation
Here is where I think the current conversation about agent orchestration goes wrong.
Most of the frameworks assume a coordinator. A planner hands out work, specialists execute, results flow back up. A management hierarchy applied to AI, and it works fine when the hard part is decomposition. Split a codebase into modules, assign each module to an agent, merge the results. Cursor’s planner-worker-judge pipeline. Gas Town’s Mayor-Witness-Polecat system. These are delegation machines.
But delegation assumes you know which agent should do the work. It assumes the coordinator can evaluate task difficulty and match it to capability. For tasks at the frontier, the tasks nobody knows the best approach to, delegation fails. You cannot assign a breakthrough.
The better model is competition. And it already exists. This is exactly how Kaggle ensembles work at competition time. You don’t assign one model to solve the problem. You let many models solve it independently, then aggregate. You don’t necessarily know which model had the insight, and asking might not even be a coherent question. The winning prediction often emerges from a weighted average where no individual model’s contribution is separable.
I think agentic orchestration is heading this way. You would still route trivially solvable stuff (web search, simple lookups) to known cheaper agents. But for the hard problems, you’d want a competitive arena. Multiple frontier models attacking the same problem independently. The system evaluates outputs, not plans. The resolution comes from whichever reasoning chain happens to find the crack, and we might not be able to trace which model made the decisive move.
The engineer workflows I described earlier are a manual version of this. When you hand a problem Claude can’t solve to Codex, or the reverse, you are running a human-mediated competition. You are the ensemble function. The question is what happens when you automate yourself out of that loop.
How humans handle hard problems
The competitive-ensemble framing may sound odd if you think of AI agents as workers who need management. But it describes how humans handle hard problems.
In any team that works well, you don’t assign the breakthrough to a specific person. Ideas get thrown around, challenged, recombined. Someone says something half-baked that triggers a connection in someone else’s head. The solution emerges from the interaction. Peer review works this way. So do markets, and so do adversarial legal systems. We converge on truth and quality through structured disagreement, not by having a manager tell each participant what to conclude.
Ethan Mollick’s work on the “jagged frontier” makes this even more pointed. AI excels at some tasks and fails at seemingly simpler ones, and every model has a different jagged profile. Claude’s jaggedness looks nothing like Codex’s. When you run both on the same problem, you cover more of the solution space than either can alone. And this stays true even as models improve, because new capabilities come with new blind spots. The jaggedness shifts; it doesn’t smooth out.
The validation problem
Right now, the ensemble approach works because engineers can validate outputs. You can read the code. You can run the tests. You can check whether the compiler boots Linux. The reason it is safe to bounce between Claude and Codex today is that you can tell when either one is wrong.
But Mollick’s work on bottlenecks points at something uncomfortable. AI keeps pushing into domains where verification gets harder. Code you can compile and test, so the feedback loop is tight. A legal analysis, an architectural plan, a strategic recommendation? The feedback loop might be months long. You might not know it was wrong until the consequences show up.
If AI keeps progressing, we will all increasingly work in hard to verify space with frontier models. Code was the easy case. Code had compilers and test suites and CI pipelines. What is the equivalent for strategy? For research synthesis? For medical reasoning? I don’t think anyone has good answers yet.
We will need new ways to validate, and the closest analogy is how we already handle fallible humans in high-stakes roles. We don’t verify a surgeon’s work by redoing the surgery. We use credentialing, institutional oversight, peer review, outcome tracking, malpractice liability. We build trust and accountability structures around fallible people because direct verification is often impossible.
The same kind of infrastructure will need to exist for AI ensembles. Not just “did the test pass” but “do we trust this system’s track record on this class of problem?” Agreement metrics across models. Adversarial red-teaming as ongoing practice rather than a one-off before launch. Audit trails that let you reconstruct which model said what.
And this is where the competitive model helps. If three different frontier models independently reach the same conclusion through different reasoning chains, that is meaningfully stronger evidence than one model reaching it three times with the same biases. Disagreement between models becomes a signal worth paying attention to.
What’s next?
The agent swarms I wrote about in my last piece proved that parallel AI work is possible. They were all monocultures, though, and monocultures are fragile in exactly the ways we saw: convergent failures, identical blind spots, overwriting each other on the same bugs.
The next phase will be systems that deliberately combine different models because diversity of thought produces better outcomes. I expect the orchestration layer to evolve toward something closer to a Kaggle competition pipeline. Multiple models generate candidate solutions. Evaluation functions, some automated and some human, select and combine the best results. The system learns over time which model combinations work for which problem types. The coordinator doesn’t assign work. It designs the contest and judges the entries.
The engineer bouncing between Claude and Codex today, copy-pasting problems from one to the other when they get stuck, is a prototype of this. Clumsy, manual, and I think correct about the underlying architecture. What happens when that loop runs automatically, with dozens of models instead of two, is the question worth working on.
We built the swarms. Now we need to make them different from each other.

