Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols
Julia Hu, Alfred Shen, Kumar Lakshmipathi

TL;DR
This study evaluates the effectiveness of debate protocols in open-weight LLM reasoning, revealing structural limitations in predicting debate safety and usefulness, and highlighting the challenge of recovering debate headroom from cheap signals.
Contribution
It provides a matched-ceiling analysis of debate protocols, showing vote entropy predicts debate safety but not necessity, and demonstrates the difficulty of recovering debate benefits from cheap signals.
Findings
Vote entropy predicts debate safety but not necessity.
High vote entropy reduces debate backfire.
Most debate-helpful examples occur with unanimous but wrong votes.
Abstract
When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched ceiling on generated tokens (960 per example), how much per-example routing headroom exists, and how much is recoverable from cheap pre-deliberation signals? We evaluate greedy decoding, three-sample voting, and a two-agent critique-revise debate on MuSiQue and GSM8K using Llama 3.1 8B Instruct and Ministral 3 8B Instruct. On MuSiQue, an oracle selecting the correct protocol per example gains +14.0 and +13.7 pp over the best fixed one. The best fixed protocol is model- and dataset-dependent: each (model, dataset) cell has a different winner. This headroom is hard to recover from cheap ex-ante signals. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
