Plausible Bullshit
There is a concept in philosophy called “bullshit.” It is distinct from lying. A liar knows the truth and deliberately says something else. A bullshitter doesn’t care what the truth is. The output just needs to sound right.
Large language models are, by disposition, bullshitters. “Sounding right” is literally what they were optimised to do. This is extremely useful. Until you need to know whether what you’re looking at is any good.
This has implications for how we sell and deliver agentic AI work. Uncomfortable ones.
Nobody buys a model by reading the weights
When you buy a machine learning model, you do not ask to inspect the parameters. You would learn nothing. Instead, you ask for benchmarks, evals, performance on holdout test sets, maybe a whitepaper. The weights are the medium. The evals are the proof.
Now consider how we often try to sell agentic AI. We build an app front-end. We demo the interface. The client reacts to the screen.
This is like choosing an off-the-shelf fraud model because the chart used a nice font and colour scheme, rather than reading the metric reported in the chart. And our instinct to lead with the app is understandable. Apps are visible. They’re demo-able. They’re the thing you can screenshot in a board deck. But that instinct is leading us somewhere bad.
The real value of an agentic system lives in four components that are almost entirely invisible:
Instructions, scripts, and templates. The text that tells the agent how to do provably good work.
A user workflow. How the human collaborates with the agent in a way that makes the LLM better, not worse. (This is what people now call “context engineering,” and getting it wrong is why Deep Research gives you a 50-page report that you then have to edit, which is a bit like hiring a research assistant who drops an encyclopedia on your desk and says “it’s in there somewhere.”)
Evals. Measurements of every aspect of output quality and how the agent got there.
A feedback loop. Human at first, increasingly automated, hill-climbing on each of those components over time.
Component (1), the instructions, is the only one that’s cheap to produce by asking an LLM. Components (2) through (4) require domain expertise, user testing, and engineering discipline. Which is, of course, exactly why people skip them.
Text is the new weights
Here is the core problem: text looks evaluable but isn’t.
You would never pretend to understand model weights by looking at them. But if I show you a prompt, an agent skill, or a set of instructions, you will read them, form an opinion, and think you can assess quality by inspection. You will almost certainly be wrong.
The value lives in the interaction effects with specific models, contexts, and tasks that you cannot observe by reading. A skill that has been deployed with clients for months, evaluated by domain experts, and hill-climbed through hundreds of iterations looks exactly the same as one that someone asked Claude to draft ten minutes ago. The words might even be similar. The performance will not be.
And this gap is about to get much wider in both directions.
On the floor: any LLM can now produce plausible-sounding output on any topic. The minimum viable bullshit is free and instant. Princeton and UC Berkeley recently formalised this as the “Bullshit Index” and found that RLHF fine-tuning (the thing that makes models helpful and conversational) actually increases bullshit production. The model learns to produce outputs that get thumbs-up reactions, which is not the same as producing outputs that are true. The system is optimised to sound right, not be right.
On the ceiling side: tools like GEPA (a celebrated 2026 machine learning paper) demonstrate that you can systematically evolve text artifacts (prompts, instructions, agent architectures) against measurable objectives. Published results show 10-20 percentage point performance gains from optimising only the text, with the model held constant. Karpathy’s new autoresearch project automates entire research workflows. A disciplined team with good evals can now embed enormous value into text through an invisible optimisation process. And the output looks like... text.
So, we have a market where: (a) the low-quality version of the product looks exactly like the high-quality version, (b) the buyer can’t tell the difference by inspection, and (c) the high-quality version is genuinely expensive to produce because the cost is in the evals and iteration, not in the writing.
The market for lemons
This is not just a metaphor, but a precise description of a well-studied economic failure mode.
In Akerlof’s market for lemons, buyers who can’t distinguish quality from junk offer an average price. That average price doesn’t cover the costs of the quality sellers (because evals, iteration, and domain expertise are expensive). So, the quality sellers leave or cut corners. Average quality drops. Average willingness-to-pay drops. The market converges on lemons. This won a Nobel Prize.
Now look at the market for agentic AI. The buyer can’t distinguish a hill-climbed, eval-hardened skill from a first-draft LLM output. So, they benchmark on what they can see: the app, the demo, the slide deck. We respond rationally by investing in what the buyer benchmarks on. The invisible components (the workflow, the evals, the feedback loop) get less investment because they don’t help close the deal.
The quality of the agent work quietly degrades. We sell more apps. The apps look great. The lemons are writing themselves.
This is the incentive structure we are all operating in. Smart people follow incentives and that’s what makes it dangerous.
The way out of a lemons market is always the same: costly signals. Signals that are expensive enough to produce that faking them is uneconomic. A university degree is a costly signal (years of your life). A brand is a costly signal (decades of reputation). For agent work, the costly signal is evals. Specifically, evals where the client controls the test set, because you can’t overfit to something you haven’t seen. “We ran evals and they were great, trust us” is a cheap signal. “Here is our performance on your holdout data” is an expensive one. That’s the difference.
The compounding problem
There is a second reason the app-first approach is a dead end.
Every enterprise AI application I have seen to date can be bested by a general-purpose agent (Claude Code, Copilot, Codex) given the right instructions, tools, and context. Every one. The bespoke app adds a branded interface and subtracts flexibility, composability, and the user’s ability to bring their own context to the work. It is a net negative.
If we build a bespoke app for one customer, the instructions, workflows, evals, and feedback loops are locked inside that project. Confidentiality and silos mean we cannot systematically improve our skills across projects, even when we tell ourselves we will. (We always tell ourselves we will.) So we end up with 100 projects, each running its own version of plausible-BS-tier text, rather than one set of skills that gets measurably better over time.
Compare this to what we should be building: domain-specific skills and evals that improve with every deployment. A reusable skill for research synthesis. A battle-tested workflow for document review. An eval suite for data extraction that has been hardened across dozens of deployments. Each project makes the next one better. The text artifact gets hill-climbed. The feedback loop tightens. The value compounds.
This only works if the unit of delivery is the skill (plugged into a general-purpose agent), not the app (walled off in its own context). And it only works if we invest in evals and feedback loops as seriously as we invest in the demo.
The persistent agent
The final piece of this argument is about where the industry is heading, and it should worry anyone whose business model is building disconnected AI applications.
The future is not one app per task. It is one agent per person and one agent per team: a persistent, general-purpose agent that the worker or team communicates with across many surfaces depending on the work at hand. The agent accumulates context over time. Skills and instructions plug into it. The worker’s history, preferences, and domain knowledge persist across tasks rather than being abandoned every time they close a tab.
Every “AI-powered” enterprise app I have encountered works against this. Each one creates its own silo of context, its own interface to learn, its own disconnected LLM that knows nothing about the user’s other work. They don’t even make it easy to export information for the user’s own agent, let alone connect to one. They are actively training organisations to think about AI the wrong way: as a set of disconnected tools rather than a composable capability layer around each worker.
The teams that see this will build skills, workflows, and evals that plug into the agent people already use. The ones that don’t will keep building apps that look impressive in demos and lose to a well-prompted Copilot within a month of delivery.
Selling the invisible
None of this means we should stop making things that users can look at and react to. The demo still matters. The interface still matters. People need something concrete to say yes to.
But we need to change what we’re building underneath.
Don’t ask to see the prompt. Ask to see the evals. What metrics are being tracked? What’s the before and after? What does the holdout test set look like? What does “good” mean, and who defined it? How many iterations of hill-climbing have these artifacts been through? You would never buy a model based on how the training loss curve was formatted. Apply the same standard to text.
And be especially suspicious of impressive-looking output, including our own. A 50-page report with confident prose and clean formatting is the default output of a system that doesn’t care whether it’s right. The question is never “does this look good?” It is always “how do we know this is good?”
Making work look good is now a commodity. Any LLM can do it in seconds. The firms that win will be the ones that can prove their work is good. The costly signal. The holdout evals. The compounding skill library. The feedback loops that close.
Everything else is plausible bullshit. And there will be plenty of that to go around.

