M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?
Ao Li, Jinghui Zhang, Luyu Li, Yuxiang Duan, Lang Gao, Mingcai Chen, Weijun Qin, Shaopeng Li, Fengxian Ji, Ning Liu, Lizhen Cui, Xiuying Chen, Yuntao Du

TL;DR
M3MAD-Bench is a comprehensive benchmark that evaluates multi-agent debate methods across multiple domains and modalities, addressing previous limitations of fragmented evaluation and modality restrictions.
Contribution
It introduces a standardized, extensible benchmark for multi-agent debate evaluation across diverse tasks, modalities, and metrics, enabling fairer and more comprehensive comparisons.
Findings
MAD methods show varying effectiveness across domains and modalities.
Multimodal inputs improve debate quality in complex reasoning tasks.
Efficiency metrics reveal trade-offs between performance and resource consumption.
Abstract
As an agent-level reasoning and coordination paradigm, Multi-Agent Debate (MAD) orchestrates multiple agents through structured debate to improve answer quality and support complex reasoning. However, existing research on MAD suffers from two fundamental limitations: evaluations are conducted under fragmented and inconsistent settings, hindering fair comparison, and are largely restricted to single-modality scenarios that rely on textual inputs only. To address these gaps, we introduce M3MAD-Bench, a unified and extensible benchmark for evaluating MAD methods across Multi-domain tasks, Multi-modal inputs, and Multi-dimensional metrics. M3MAD-Bench establishes standardized protocols over five core task domains: Knowledge, Mathematics, Medicine, Natural Sciences, and Complex Reasoning, and systematically covers both pure text and vision-language datasets, enabling controlled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Multi-Agent Systems and Negotiation
