Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity

Hangfan Zhang; Zhiyao Cui; Jianhao Chen; Xinrun Wang; Qiaosheng Zhang; Zhen Wang; Dinghao Wu; Shuyue Hu

arXiv:2502.08788·cs.CL·June 24, 2025

Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity

Hangfan Zhang, Zhiyao Cui, Jianhao Chen, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, Dinghao Wu, Shuyue Hu

PDF

Open Access

TL;DR

This paper critically evaluates multi-agent debate methods, revealing they often underperform simple baselines and emphasizing the importance of model heterogeneity for meaningful progress in the field.

Contribution

It provides a systematic evaluation of MAD methods, highlights evaluation limitations, and advocates for embracing model heterogeneity to improve future research.

Findings

01

MAD often do not outperform simple baselines

02

Model heterogeneity improves MAD performance

03

Current evaluation practices are insufficient

Abstract

Multi-agent debate (MAD) has gained significant attention as a promising line of research to improve the factual accuracy and reasoning capabilities of large language models (LLMs). Despite its conceptual appeal, current MAD research suffers from critical limitations in evaluation practices, including limited benchmark coverage, weak baseline comparisons, and inconsistent setups. This paper presents a systematic evaluation of 5 representative MAD methods across 9 benchmarks using 4 foundational models. Surprisingly, our findings reveal that MAD often fail to outperform simple single-agent baselines such as Chain-of-Thought and Self-Consistency, even when consuming significantly more inference-time computation. To advance MAD research, we further explore the role of model heterogeneity and find it as a universal antidote to consistently improve current MAD frameworks. Based on our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGame Theory and Applications · Auction Theory and Applications · Economic Policies and Impacts