What should post-training optimize? A test-time scaling law perspective

Muheng Li; Jian Qian; Wenlong Mou

arXiv:2605.10716·cs.LG·May 12, 2026

What should post-training optimize? A test-time scaling law perspective

Muheng Li, Jian Qian, Wenlong Mou

PDF

TL;DR

This paper introduces Tail-Extrapolated estimators to improve test-time selection in large language models, effectively addressing the budget mismatch between training and deployment for best-of-N response selection.

Contribution

It proposes novel tail-extrapolation methods for post-training optimization, enabling better best-of-N performance with limited training rollouts.

Findings

01

TEA and Prefix-TEA improve best-of-N performance across models and datasets.

02

The methods effectively extrapolate upper-tail reward statistics from small rollout groups.

03

Experiments demonstrate robustness under various training and test-time budget settings.

Abstract

Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of- $N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m ≪ N$ per-prompt rollouts are available during training but the target objective is best-of- $N$ deployment. Under structural assumptions on the reward tails, we show that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.