Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs
Chiyan Loo

TL;DR
This paper argues that self-consistency in large language models offers diminishing returns and increased costs, especially as models improve and solve problems reliably, suggesting more targeted use of multi-path sampling.
Contribution
The study demonstrates that increasing reasoning paths yields minimal accuracy improvements and higher costs on modern models, advocating for selective application of self-consistency.
Findings
Accuracy gains from multiple samples are minimal (0.4% on HotpotQA, 1.6% on MATH-500).
Token costs scale nearly linearly with sample count.
Performance plateaus or declines at high sample counts, indicating noise introduction.
Abstract
Self-consistency -- sampling multiple reasoning paths and selecting the most frequent answer -- was designed for an era when language models made frequent, unpredictable errors. This study argues that the technique has become increasingly wasteful as models grow stronger, and may degrade performance on problems that modern models already solve reliably. Using Gemini 2.5 models on HotpotQA and MATH-500, we show that accuracy gains from increasing the number of sampled reasoning paths are minimal -- 0.4% on HotpotQA across 20 samples, and 1.6% on MATH-500 -- while token costs scale nearly linearly with sample count. Critically, performance plateaued early and in some configurations declined at high sample counts, suggesting that additional paths introduce noise rather than signal when models already solve problems reliably. As inference costs rise with model scale, indiscriminate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
