More Test-Time Compute Can Hurt: Overestimation Bias in LLM Beam Search
Gal Dalal, Assaf Hallak, Gal Chechik, Yftah Ziser

TL;DR
This paper reveals that increasing beam width in large language models can actually harm output quality due to overestimation bias caused by scorer noise, and it provides a theoretical framework to determine optimal beam width based on scorer signal-to-noise ratio.
Contribution
The paper introduces a novel analysis based on Extreme Value Theory that explains when wider beam search degrades performance and offers practical diagnostics for optimal beam width selection.
Findings
Overestimation bias grows with candidate pool size.
Optimal beam width depends on scorer's signal-to-noise ratio.
Perplexity scoring benefits diminish at any width, while PRM scoring improves with larger beams.
Abstract
Wider beam search should improve LLM reasoning, but when should you stop widening? Prior work on beam width selection has focused on inference efficiency \citep{qin2025dsbd, freitag2017beam}, without analyzing whether wider search can \emph{hurt} output quality. We present an analysis, grounded in Extreme Value Theory, that answers this question. Beam selection over noisy scorer outputs introduces a systematic overestimation bias that grows with the candidate pool size, and we derive a maximum useful beam width beyond which search degrades performance. This critical width depends on the signal-to-noise ratio of the scorer: grows exponentially with , where is the quality advantage of correct paths over incorrect ones and is the scorer noise. We validate this theory by comparing perplexity-guided and PRM-guided beam search…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning and Data Classification · Machine Learning and Algorithms
