Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

Yongjin Yang; Euiin Yi; Jongwoo Ko; Kimin Lee; Zhijing Jin; Se-Young Yun

arXiv:2505.22960·cs.AI·June 23, 2025

Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

Yongjin Yang, Euiin Yi, Jongwoo Ko, Kimin Lee, Zhijing Jin, Se-Young Yun

PDF

Open Access

TL;DR

This paper systematically evaluates multi-agent debate as a test-time scaling method, revealing its strengths and limitations across different tasks, model sizes, and agent configurations, guiding future MAD system development.

Contribution

It provides a comprehensive empirical analysis of MAD's effectiveness compared to self-agent methods, highlighting conditions where MAD excels or falls short.

Findings

01

MAD offers limited benefits for mathematical reasoning but improves with task difficulty.

02

Agent diversity shows little impact on mathematical reasoning performance.

03

In safety tasks, MAD can increase vulnerability but helps reduce attack success with diverse configurations.

Abstract

The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Reinforcement Learning in Robotics · Topic Modeling