The Deliberation Will Not Be Automated

Why agent orchestration protocols won't replace software

May 2026 · ~7 min read

Introduction

Every major wave of computing technology carries with it a species of project that mistakes the scaffolding for the building. In the 1990s, CORBA promised that if we just got the inter-object communication protocol right, distributed systems would assemble themselves. In the 2000s, the Semantic Web wagered that if HTML were replaced with machine-readable ontologies, software would understand its own purpose. In the 2010s, microservice mesh architectures like Istio proposed that if the communication fabric between services were sufficiently smart, the services themselves could be trivial.

None of these replaced software. Some became useful infrastructure for specific niches. Most became cautionary tales about the gap between elegant specification and messy reality.

We are now watching a new entrant in this tradition: the agent orchestration protocol. Projects in this space propose that if we standardize how AI agents discover each other, negotiate decisions, record outcomes, and price their own reasoning, then the resulting multi-agent systems will be able to replace — or at least subsume — the role that conventional software plays in the world.

One such project is ai-manifests.org, a family of four interoperable specifications originated by David H. Friedel Jr. and affiliated with MarketAlly. The project defines protocols for agent discovery (mcp-manifest), multi-agent deliberation through calibration-weighted voting (adp-manifest), append-only signed decision journals (adj-manifest), and a "cognitive budget" model that prices the metabolic cost of reasoning (acb-manifest). Reference implementations ship in C#, Python, and TypeScript under Apache 2.0, with the specs themselves released under the Community Specification License 1.0.

It is a well-executed project. The specs are real, the libraries are published, the licensing is thoughtful. This essay is not about whether ai-manifests is competently built — it is. The question is whether the class of problem it addresses is the bottleneck that prevents AI agents from replacing software, and whether the assumptions embedded in its design will survive contact with the world it's trying to create.

The argument here is that they won't, for reasons that are structural rather than incidental.

The Competence Gap: Orchestrating What Doesn't Reliably Work

The most fundamental problem with agent orchestration protocols is temporal. They are solving problem number four on a list where we have not yet solved problem number one.

The hard problem in AI agents today is not coordination. It is competence. Getting a single LLM-based agent to reliably perform a multi-step task — file a bug report, refactor a module, analyze a financial statement — without hallucinating facts, losing track of its own state, or making a quietly catastrophic decision is an unsolved problem in the general case. Systems like Claude Code, Devin, and various AutoGPT descendants can do impressive things in favorable conditions, but their failure modes are unpredictable and their reliability is nowhere near what conventional software provides.

Building a multi-agent deliberation protocol on top of this is like designing a parliamentary procedure for a room full of agents who each have a coin-flip chance of forgetting what the motion is about halfway through the vote. The spec can be perfect. The participants are the constraint.

Consider what the ADP (Agent Deliberation Protocol) actually proposes: agents submit proposals, other agents vote on them using calibration-weighted scoring, the system applies reversibility-tiered thresholds to decide whether to commit, and the outcome is recorded in a tamper-evident journal. This is a reasonable design for a world in which agents are reliable reasoners whose principal failure mode is disagreement. But that is not our world. Our world is one in which agents' principal failure mode is incorrectness — and a voting system over unreliable reasoners doesn't converge on truth, it converges on the most common error.

This is a well-known result in collective decision theory. The Condorcet jury theorem tells us that majority voting among independent agents converges on the correct answer only when each agent is individually more likely to be right than wrong. When individual accuracy drops below 50%, adding more voters makes the group less likely to be correct. LLM agents on complex tasks are not reliably above that threshold, and they are definitely not independent — they share training data, architectural biases, and failure modes. A deliberation protocol that aggregates their votes is amplifying correlated noise.

The ai-manifests project acknowledges this indirectly through its "calibration-weighted voting" mechanism, which adjusts vote weights based on agents' historical accuracy. This is a good idea in principle. In practice, it requires a ground-truth signal against which to calibrate — you need to know which past decisions were right. For the kinds of high-stakes, open-ended decisions where multi-agent deliberation would be most valuable (strategic planning, architectural choices, risk assessment), ground truth is precisely what you lack. Calibration becomes circular: agents are weighted by their agreement with outcomes that were themselves determined by weighted agreement.

None of this means multi-agent systems are useless. It means that the bottleneck is agent capability, not agent coordination, and building coordination infrastructure before capability is mature is building the highway before the car is invented. The highway might even be well-designed. It is still premature.

The Abstraction Trap: Generality as a Liability

Software works because someone encoded domain knowledge into deterministic logic that executes the same way every time. A scheduler for an AI accelerator chip works because its author understands the memory hierarchy, the dataflow constraints, the timing requirements of the specific hardware, and has encoded that understanding into code that makes decisions at nanosecond granularity with zero ambiguity. A flight control system works because its control loops are mathematically proven to converge. A database works because its transaction protocol guarantees ACID properties through carefully designed lock ordering and write-ahead logging.

Agent orchestration protocols operate at a fundamentally different level of abstraction. They define how agents talk about decisions without constraining what makes a decision good. The ADP spec tells you how to run a vote. It tells you nothing about whether the thing being voted on makes sense. The ADJ spec gives you a tamper-evident log of what happened. It tells you nothing about whether what happened was correct.

This is the abstraction trap: maximum generality produces minimum utility for any specific problem. When everything is parameterizable, nothing is solved. The spec essentially says "agents can decide things" and provides the plumbing for that, but the actual intellectual content — the part where somebody figures out what the right answer is — remains entirely outside the protocol. It's left to the "evaluator function," which each agent provides independently.

Compare this with how actual infrastructure protocols succeed. TCP/IP works because it solves a specific, well-defined problem (reliable ordered byte delivery over unreliable networks) with strong guarantees (delivery or explicit failure, ordering preservation, congestion response). HTTP works because it maps cleanly onto a universal interaction pattern (request-response over named resources). SQL works because the relational model provides real mathematical leverage for a common class of data problems. Each of these protocols succeeds by being opinionated about something that matters.

What is the ADP opinionated about? Voting. But voting is not the hard part of multi-agent decision-making. The hard part is the reasoning that produces each vote. By remaining agnostic about that — as it must, to remain general — the protocol leaves the hard problem untouched and wraps the easy problem in ceremony.

This is a pattern that repeats in the history of computing. SOAP and WSDL were maximally general protocols for service interaction. They were technically complete and formally well-specified. They were largely replaced by REST, which succeeded precisely by being less general — by imposing constraints (statelessness, uniform interface, resource orientation) that made the common case simple. Generality in protocol design is almost always a sign that the designers don't yet know what the important constraints are. Once you know, you lock them down, and the protocol becomes useful by becoming opinionated.

The Biological Metaphor Problem: Cognitive Budgets and False Analogies

The ACB (Agent Cognitive Budget) spec introduces a pricing model for deliberation that explicitly invokes neuroscience: decisions are priced "the way the brain does," with cheap habitual decisions, expensive contested ones, and "metabolic cost" metaphors throughout.

This is evocative. It is also misleading in ways that obscure the actual economics of AI inference.

The brain's metabolic costs are deeply tied to its physical substrate. Glucose consumption varies by brain region and correlates roughly with neural activity intensity. Habitual decisions route through basal ganglia pathways that are metabolically cheap because they've been physically carved through synaptic strengthening over thousands of repetitions. Novel decisions recruit prefrontal cortex resources that are metabolically expensive because they require maintaining and manipulating working memory representations against competing signals.

LLM inference has completely different cost drivers. The dominant cost is context length multiplied by model size, modulated by batch efficiency and hardware utilization. A "contested" multi-agent decision doesn't cost more because the reasoning is inherently harder — it costs more because you're doing more rounds of inference, filling longer context windows, and generating more tokens. The cost function is linear in tokens and quadratic in context length (for attention computation), not sigmoidal in difficulty.

The ACB spec's "habit-memory discounts" — where repeated similar decisions become cheaper — have no natural analogue in LLM inference. Running the same prompt twice costs the same as running it the first time (absent caching, which is an engineering optimization, not a cognitive one). The biological metaphor suggests that agent systems will develop something like intuition or muscle memory. They won't, because the underlying compute doesn't work that way. Each inference is stateless. There is no synaptic trace to cheapen.

This matters because pricing models shape behavior. If you price deliberation using a biological metaphor that doesn't match the actual cost structure, you'll get perverse incentives: agents might avoid "expensive" contested decisions that are actually cheap in compute terms, or underinvest in "habitual" decisions that genuinely need fresh reasoning. The metaphor becomes a tax on clear thinking about what things actually cost.

A more honest pricing model would be straightforwardly economic: cost per token, cost per round, cost per agent-hour, with market-based pricing for scarce resources. This is less poetic than "metabolic cost" but has the advantage of describing reality.

The Adoption Problem: Specs Without Gravity

Protocol adoption follows power laws. The protocols that win are not the best-designed; they are the ones that solve the most immediate pain for the most developers at the moment when those developers are looking for a solution. Timing and distribution matter more than technical merit.

MCP (Model Context Protocol), which ai-manifests' discovery layer builds on, has achieved a degree of adoption because it was pushed by Anthropic — a company with a large and growing developer base — and because it solved an immediate, painful problem: connecting LLMs to external tools without writing bespoke integration code for each tool-model pair. The developer pain was real and present. The solution was timely.

The ai-manifests project solves a different kind of problem: federated multi-agent decision-making with auditable outcomes across trust boundaries. This is a real problem — in the sense that you can describe scenarios where it would be valuable — but it is not a present problem. Almost no production systems today federate autonomous agents across organizational boundaries. The systems that come closest (multi-model evaluation pipelines, agent swarm frameworks like CrewAI or AutoGen) operate within a single trust domain and use ad-hoc coordination rather than formal protocols.

This creates a bootstrapping problem. The spec is useful when there are many ADP-compliant agents that need to interoperate. But nobody will build ADP-compliant agents until there are other ADP-compliant agents to interoperate with. The reference templates lower the barrier to entry, but they don't create the demand. You can make it easy to build a fax machine, but that doesn't help if nobody else has one.

Historical precedents suggest what breaks this cycle: either a dominant platform mandates the protocol (as Google did with Kubernetes, or as Anthropic is attempting with MCP), or a killer application demonstrates compelling value that can only be achieved through the protocol (as Napster did for TCP/IP in consumer adoption, or as email did for SMTP). The ai-manifests project has neither. It is affiliated with MarketAlly, which does not have the market position to mandate adoption, and there is no killer application that makes the case that calibration-weighted multi-agent voting is the missing piece in anyone's production system.

This is not a criticism of the project's ambition. It is an observation about how infrastructure gets adopted. The best-designed protocol that nobody uses is architecturally equivalent to the protocol that was never written.

The Determinism Problem: Why Software Is Not a Committee

There is a deeper philosophical issue beneath the technical objections, and it is worth making explicit.

Software — real, production software that runs the world — is valuable precisely because it is not a deliberation. A function that computes a hash returns the same hash every time. A scheduler that assigns work to cores does so according to fixed priority rules. A transaction manager that commits or rolls back does so based on deterministic evaluation of invariants. The whole point is the absence of ambiguity.

Multi-agent deliberation protocols propose replacing this with consensus among stochastic reasoners. Instead of a function that computes the answer, you have a committee that votes on what the answer might be. Instead of a deterministic outcome, you have a probabilistic one that depends on which agents happen to participate, what their calibration weights happen to be, and whether the voting threshold happened to be met.

For certain categories of problems, this is genuinely useful. Problems where there is no single right answer — content curation, style selection, open-ended planning — benefit from integrating multiple perspectives. Problems where the search space is too large for deterministic enumeration — drug discovery, architectural design, creative synthesis — can benefit from stochastic exploration by multiple agents.

But these are not the problems that constitute the bulk of what software does. The vast majority of production software is deterministic, correct-by-construction, and deliberately designed to remove human judgment from the loop — not to add more sources of stochastic judgment. Payroll software doesn't deliberate about how much to pay you. Air traffic control doesn't vote on whether to separate aircraft. The Linux kernel's memory allocator doesn't convene a panel to decide where to place a page.

The category of problems where multi-agent deliberation adds value is real but narrow. Treating it as a general replacement for software development is a category error — like arguing that brainstorming sessions should replace engineering drawings. Both have their place. They are not substitutes.

The Audit Illusion: Journals Without Accountability

The ADJ (Agent Deliberation Journal) spec is perhaps the most technically interesting component of the ai-manifests family. It proposes an append-only, hash-chained record of every deliberation outcome, signed by participating agents, supporting replay and audit.

This is good engineering for a real concern: if autonomous agents are making decisions, there should be a record of what was decided and why. The design — hash chaining, cryptographic signatures, append-only semantics — is directly borrowed from blockchain and tamper-evident log literature, and it's the right set of primitives for the job.

The problem is what the journal actually records. It captures votes and outcomes, not reasoning. It tells you that Agent A voted for Proposal X with weight 0.73, and that the proposal passed threshold. It does not tell you why Agent A voted that way, whether its reasoning was sound, or whether the same agent would vote differently if asked again with the same information (it would, because LLM inference is non-deterministic at temperature > 0).

This creates an audit illusion. The journal looks like accountability — it's signed, it's tamper-evident, it's replayable — but it provides the form of accountability without the substance. Real accountability requires understanding why a decision was made, not just that it was made. A hash-chained record of stochastic votes is a very precise record of something that was inherently imprecise.

Contrast this with how actual software systems handle auditability. A financial system's audit trail records the specific inputs, the specific rules applied, and the specific outputs, and anyone can re-derive the output from the inputs and rules. The audit trail is useful because the underlying process is deterministic. You can point to the line of code that produced the result. A deliberation journal for stochastic agents offers no such re-derivability. Replaying the deliberation would produce different votes (different random seeds, different attention patterns), and there is no "line of code" to point to — just a probability distribution from which a sample was drawn.

Conclusion

Projects like ai-manifests are not wrong about the future they're pointing toward. Multi-agent systems will become more important. Agents will need to discover each other, coordinate, and maintain records. The problems these specs address are real problems that someone will eventually need to solve.

But the project is wrong about when this future arrives and what the bottleneck is. The bottleneck is not the absence of coordination protocols. It is the absence of agents reliable enough to be worth coordinating. Building the deliberation infrastructure before the deliberators are competent is not just premature — it risks locking in design decisions that are optimized for today's (limited) agent capabilities rather than tomorrow's (unknown) ones.

The specifications also reflect a subtle but important conceptual error: the assumption that the hard part of software is decision-making, and that if you formalize decision-making, you've captured the essence of what software does. In reality, the hard part of software is encoding domain knowledge into deterministic, verifiable, repeatable processes. Decision-making is what you do when you don't know enough to write software yet. It is a symptom of incomplete understanding, not a replacement for complete understanding.

Software will not be replaced by committees of agents voting on what to do. It will be replaced — if at all — by better software, possibly written by AI agents that understand the domain well enough to produce deterministic solutions rather than deliberated ones.

Where Do We Go from Here

If the critique above is correct — that agent orchestration protocols are premature and somewhat misdirected — then what should the AI agent ecosystem be investing in?

Agent reliability first, coordination second. The single highest-leverage investment is making individual agents more reliable at bounded, well-defined tasks. This means better tool use, better self-correction, better uncertainty quantification (knowing when you don't know), and better failure modes (failing loudly and specifically rather than silently and subtly). Projects like Anthropic's tool-use training, OpenAI's function calling improvements, and various retrieval-augmented generation frameworks are on this track. Until a single agent can be trusted to reliably file a correct bug report, building infrastructure for agents to vote on bug reports is premature.

Narrow, opinionated protocols over general ones. When coordination protocols do become necessary, they should be narrow and opinionated rather than maximally general. A protocol for "AI code review agents coordinating on a pull request" will be more useful than a general "agents deliberating on anything" protocol, because the constraints of the specific domain (code must compile, tests must pass, style must conform) provide the ground truth that calibration requires and that general deliberation lacks.

Deterministic verification over stochastic consensus. Instead of having agents vote on whether a decision is correct, invest in tools that let agents verify whether a decision is correct. Type checkers, test suites, formal verification, simulation environments — these provide deterministic ground truth that doesn't depend on the reliability of the verifier's "opinions." The future of reliable AI agents is more likely to look like "agent proposes, deterministic system disposes" than "agents vote and the majority wins."

Transparency of reasoning over transparency of votes. If audit trails are important — and they are — invest in mechanistic interpretability, chain-of-thought faithfulness, and reasoning traces that actually explain why an agent reached a conclusion. A tamper-evident log of opaque votes is less useful than a lossy but interpretable record of the reasoning process. The field of AI interpretability is working on this, and its outputs will be more valuable for accountability than any journal format.

Let the protocols emerge from practice. The most successful infrastructure protocols in computing history were extracted from working systems, not designed in advance. HTTP was a formalization of how Tim Berners-Lee's early web servers already worked. REST was a description of the architectural style that had already made the web successful. Kubernetes was an externalization of Borg, which Google had been running internally for a decade. The best multi-agent coordination protocol will almost certainly be extracted from a production system that's already working, not designed top-down from first principles by someone who hasn't yet needed it.

The agent era is coming. The infrastructure it will need is real. But the infrastructure will be built by the people who are currently struggling with production agent systems and discovering, through painful experience, which coordination problems actually matter. It will not be built by specifying, in advance, the protocols for a world that doesn't exist yet.

The deliberation will not be automated. At least, not yet.

See also: www.ai-committees.org — the same argument, less politely.