Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling
Florian Valentin Wunderlich, Lars Benedikt Kaesberg, Jan Philip Wahle, Terry Ruas, Bela Gipp

TL;DR
This paper systematically analyzes multi-agent inference strategies to enhance computational efficiency and accuracy in language models, identifying Pareto-optimal configurations across various benchmarks and model sizes.
Contribution
It introduces a comprehensive evaluation of inference scaling methods, revealing how multi-agent debate and mixture-of-agents outperform traditional approaches in resource-constrained settings.
Findings
Inference scaling improves accuracy by up to +7.1% on MMLU-Pro.
Debate and mixture-of-agents outperform self-consistency at equal compute budgets.
Mixture-of-agents is most efficient when parallel generations exceed sequential aggregations.
Abstract
Advances in inference methods have enabled language models to improve their predictions without additional training. These methods often prioritize raw performance over cost-effective compute usage. However, computational efficiency is key for real-world applications with resource constraints. We provide a systematic analysis of the inference scaling strategies self-consistency, self-refinement, multi-agent debate, and mixture-of-agents, to study their computational performance tradeoffs. We evaluate methods on two reasoning benchmarks (MMLU-Pro, BBH) and include extensive parameter configurations (e.g., scaling the number of parallel predictions, agents, and debate rounds) across different model sizes. Across 34 configurations and over 100 evaluations, we compute the Pareto-optimal front to select methods that achieve the best accuracy with the lowest computational budget. Notably,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
