What should post-training optimize? A test-time scaling law perspective
Muheng Li, Jian Qian, Wenlong Mou

TL;DR
This paper introduces Tail-Extrapolated estimators to improve test-time selection in large language models, effectively addressing the budget mismatch between training and deployment for best-of-N response selection.
Contribution
It proposes novel tail-extrapolation methods for post-training optimization, enabling better best-of-N performance with limited training rollouts.
Findings
TEA and Prefix-TEA improve best-of-N performance across models and datasets.
The methods effectively extrapolate upper-tail reward statistics from small rollout groups.
Experiments demonstrate robustness under various training and test-time budget settings.
Abstract
Large language models are increasingly deployed with test-time strategies: sample responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of- performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only per-prompt rollouts are available during training but the target objective is best-of- deployment. Under structural assumptions on the reward tails, we show that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
