DeliberationBench: When Do More Voices Hurt? A Controlled Study of Multi-LLM Deliberation Protocols

Vaarunay Kaushal; Taranveer Singh

arXiv:2601.08835·cs.CL·January 15, 2026

DeliberationBench: When Do More Voices Hurt? A Controlled Study of Multi-LLM Deliberation Protocols

Vaarunay Kaushal, Taranveer Singh

PDF

Open Access

TL;DR

This study evaluates multi-LLM deliberation protocols and finds that simple selection methods outperform complex deliberation, often with less computational cost, challenging assumptions about the benefits of increased complexity.

Contribution

We introduce DELIBERATIONBENCH, a benchmark that systematically compares multi-LLM deliberation protocols against simple baseline methods.

Findings

01

Best-single response baseline outperforms deliberation protocols by 6x in win rate.

02

Deliberation protocols are 1.5-2.5x more computationally expensive.

03

Complex deliberation does not improve, and may harm, system performance.

Abstract

Multi-agent systems where Large Language Models (LLMs) deliberate to form consensus have gained significant attention, yet their practical value over simpler methods remains under-scrutinized. We introduce DELIBERATIONBENCH, a controlled benchmark evaluating three deliberation protocols against a strong baseline of selecting the best response from a pool of model outputs. Across 270 questions and three independent seeds (810 total evaluations), we find a striking negative result: the best-single baseline achieves an 82.5% +- 3.3% win rate, dramatically outperforming the best deliberation protocol(13.8% +- 2.6%). This 6.0x performance gap is statistically significant (p < 0.01) and comes at 1.5-2.5x higher computational cost. Our findings challenge assumptions that complexity enhances quality in multi-LLM systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Advanced Graph Neural Networks · Multimodal Machine Learning Applications