TL;DR
This paper introduces a novel intrinsic reward based on entropy centroids, which clusters high-entropy tokens to better assess model uncertainty during inference, improving response selection in large language models.
Contribution
It proposes the entropy centroid as a new measure of model uncertainty, enabling more stable and effective response selection without external reward models.
Findings
Lowest Centroid method outperforms existing baselines across tasks.
Stable performance gains increase with larger model sizes.
Entropy centroid correlates with higher response quality.
Abstract
An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high-entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
