OptScale: Probabilistic Optimality for Inference-time Scaling
Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei

TL;DR
This paper introduces OptScale, a probabilistic framework and algorithm for principled inference-time scaling of LLMs, optimizing sampling efficiency while maintaining reasoning performance.
Contribution
It provides the first theoretical lower bound for sample requirements and develops OptScale, a practical method for dynamic, optimal sampling in LLM inference.
Findings
OptScale reduces sampling overhead significantly.
It maintains or improves reasoning performance compared to state-of-the-art methods.
Experimental results on reasoning benchmarks validate the effectiveness of OptScale.
Abstract
Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of- selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop \textsc{OptScale}, a practical algorithm that dynamically determines the optimal number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
