SCATR: Simple Calibrated Test-Time Ranking
Divya Shyamal, Marta Kne\v{z}evi\'c, Lan Tran, Chanakya Ekbote, Vijay Lingam, Paul Pu Liang

TL;DR
SCATR is an efficient method that learns a lightweight scorer from a small calibration set to improve test-time ranking of large language models, offering a strong accuracy-efficiency balance.
Contribution
Introduces SCATR, a simple, calibration-based ranking method that rivals learned scorers with significantly less training and inference cost.
Findings
SCATR improves confidence heuristics by up to 9% on benchmarks.
Achieves comparable accuracy to LoRA fine-tuning with 8000x fewer parameters.
Reduces training and inference latency by up to 150x and 1000x, respectively.
Abstract
Test-time scaling (TTS) improves large language models (LLMs) by allocating additional compute at inference time. In practice, TTS is often achieved through parallel scaling: generating multiple candidate responses and selecting the best via a Best-of-N (BoN) strategy. Its effectiveness therefore hinges on the scoring function. Learned scorers such as process reward models (PRMs) can be strong, but they are expensive to train and run. Lightweight confidence heuristics based on token log-probabilities are much cheaper, yet we find that they often perform substantially worse. To improve on lightweight confidence heuristics without incurring the full cost of stronger learned scorers, we introduce SCATR, a simple and efficient BoN ranking method that learns a lightweight scorer from a small calibration set using hidden representations from the base model. Across coding and mathematical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
