Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline
Jiawei Xu, Arief Koesdwiady, Sisong Bei, Yan Han, Baixiang Huang, Dakuo Wang, Yutong Chen, Zheshen Wang, Peihao Wang, Pan Li, Ying Ding

TL;DR
This paper demonstrates that a single large language model can replicate the performance of multi-agent workflows across various tasks, offering efficiency benefits and establishing a strong baseline for future multi-agent system research.
Contribution
It shows that single-agent approaches can match multi-agent workflows' performance, introduces OneFlow for optimized single-agent workflows, and highlights limitations in current heterogeneous multi-agent systems.
Findings
Single-agent can match multi-agent workflow performance.
KV cache reuse provides efficiency advantages.
Single-LLM methods cannot currently replicate heterogeneous workflows.
Abstract
Recent advances in LLM-based multi-agent systems (MAS) show that workflows composed of multiple LLM agents with distinct roles, tools, and communication patterns can outperform single-LLM baselines on complex tasks. However, most frameworks are homogeneous, where all agents share the same base LLM and differ only in prompts, tools, and positions in the workflow. This raises the question of whether such workflows can be simulated by a single agent through multi-turn conversations. We investigate this across seven benchmarks spanning coding, mathematics, general question answering, domain-specific reasoning, and real-world planning and tool use. Our results show that a single agent can reach the performance of homogeneous workflows with an efficiency advantage from KV cache reuse, and can even match the performance of an automatically optimized heterogeneous workflow. Building on this…
Peer Reviews
Decision·Submitted to ICLR 2026
The author Reframes multi-agent research with a rigorous single-agent equivalence argument. they provide Six general benchmarks + domain-specific tasks. they also Quantifies KV-cache benefits clearly. it shows that OneFlow’s dual-meta LLM + MCTS is a creative and reproducible design. and it clearly delineates where single-agent simulation applies and where heterogeneity still matters.
Limited empirical heterogeneity analysis: Pilot study is small; results inconclusive about real multi-model synergy. Simulation of KV cache: Since APIs hide internal caching, efficiency results are theoretical. A small open-weight replication (e.g., LLaMA-3 8B) would strengthen credibility. Ablations: Lack of ablation on MCTS parameters (α, β, iterations) and meta-LLM roles; unclear how much each contributes. Over-dependence on closed models: Limits reproducibility beyond cost estimation. Wr
- **S1.** The paper tackles a timely and important issue whether multi-agent systems provide real advantages over single-agent reasoning when the base LLM is homogeneous. - **S2.** Well-explained theoretical formulation that logically connects shared KV cache to computational efficiency. - **S3.** Comprehensive experimental coverage across six benchmarks and one domain-specific dataset.
- **W1.** The OneFlow framework largely replicates the AFlow architecture with minor adaptations. The use of MCTS for workflow generation is not new, and the manuscript does not clearly articulate what conceptual or technical innovation distinguishes OneFlow from AFlow. - **W2.** The evaluation primarily relies on closed-weight models (GPT-4o-mini, Claude 3.5 Haiku), and the KV-cache advantages are simulated rather than directly measured. The real experiments using open models capable of genuin
* The paper proposes an interesting point of view. * The experiments test on six benchmark and report both accuracy and cost to support claims.
* The OneFlow methods composes of two parts: search for optimized workflow and perform single LLM implementation. The first part seems like an improved version of Aflow and lacks novelty, for example, the critic prompt is adopted from AFlow. * The costs for single-agent are simulated due to closed-weight APIs; add open-weight runs (or vendor KV-sharing APIs) to validate real-world latency/$ savings * While the method mentions tool calling, the benchmark tested are static QA/math/code; include to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies · Topic Modeling
