Why multi-agent systems fail: three causes and how to fix them
Multi-agent systems often show managerial problems: agents fail to share information, follow roles mechanically, or drift into unproductive chatting. Today let’s see why good engineering is more important than improvement of prompts.
Multi-agent systems built on large language models are in high demand: several specialized agents are expected to outperform a single generalist. In practice, the gains are often marginal. In some cases, multi-agent systems even perform worse than a single agent.
Why does this happen?
A recent study from University of California, Berkeley, “Why Do Multi-Agent LLM Systems Fail?”, offers a systematic answer. The authors analyzed more than 1,600 multi-agent systems (including ChatDev, MetaGPT, HyperAgent, and others) and identified the main sources of failure.
What counts as failure, and how bad is it?
Across tested systems, failure rates ranged from 41% to 86%, depending on the task (programming, math, general reasoning). A failure is defined as any case where the system does not reach the core intent of the user request.
Example:
A task requires logging into a service to perform an action. One agent knows the API requires a phone number as the login identifier but does not share this fact. Another agent repeatedly attempts to authenticate using an email address and fails.
Notably, failures do not include hallucinations, extra steps, suboptimal answers, or mistakes made by individual agents. The problem is not isolated errors, it is systemic breakdown.
The evidence suggests that most failures stem from system design, not from weak LLMs. The issue is architecture, coordination rules, and outcome control.
Three sources of failure in multi-agent systems
1. System design failures
These errors arise from how the system is structured:
- Agents violate task requirements.
- Agents ignore their assigned roles (e.g., a “reviewer” writes code).
- The system loops endlessly.
- Context is lost.
- The system fails to recognize when the task is complete.
These issues occur even when all agents use the same model. The paper describes a case where ChatDev failed to implement a Wordle-like game. The system consistently violated the specification by selecting words from a fixed list instead of randomizing them. Once the the final authority was handed over a specific agent, the task succeeded.
Conclusion: MAS architecture matters as much as model choice. Agent roles, workflows, and requirement validation must be designed explicitly.
2. Agent misalignment
This class of failures stems from poor coordination:
- Agents do not ask clarifying questions when information is missing.
- Critical knowledge is not shared.
- Messages from other agents are ignored.
- Agents drift away from the shared goal.
- Reasoning and action diverge.
The authors describe this as a lack of “theory of mind.” Agents cannot reason about what other agents know or need to know.
Conclusion: The problem is not message format. It is the inability of agents to model each other’s knowledge and intent.
3. Weak result verification
Even when a system produces output, it often:
- Stops too early.
- Skips validation.
- Performs shallow, formal checks.
In one example, the system generates a chess simulator. The code compiles, but the rules are implemented incorrectly. The task is marked as complete anyway. Current verification mechanisms in MAS are weak and often limited to syntactic checks.
Conclusion: Checking not only whether the system runs, but whether it solves the problem semantically significantly improves outcomes (multi-level validation).
How to build a system that doesn’t fail
It is tempting to blame MAS failures on LLM hallucinations. The study suggests otherwise. Multi-agent systems fail for the same reasons poorly run organizations fail: unclear roles, opaque processes, weak quality control, and ineffective communication. Even highly capable agents cannot compensate for a flawed structure.
A MAS is not just a list of agents — it is a system; from an engineering perspective, role allocation is equal to a system design. This is why moving from GenAI experiments to stable business value requires experience in process design, automation, and system engineering, not just prompt tuning.
To help clients extract business value from generative and classical ML we run focused workshops. These sessions align technology choices with real business constraints. During the workshop, we address:
- Which technology fits your goals (multi-agent systems vs. classical ML);
- Where fast ROI is realistically achievable;
- How to ensure data security and output quality;
- How to mitigate project-specific risks.
If a single agent is an employee, a multi-agent system is a business entity. And an organization needs the structure, roles, and control.