Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri; Michael Hinczewski; Jing Ma; Vipin Chaudhary

arXiv:2603.10960·cs.LG·May 12, 2026

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary

PDF

1 Repo

TL;DR

This paper introduces Scorio, a library for ranking reasoning large language models under test-time scaling, demonstrating high agreement with Bayesian standards across multiple benchmarks and methods.

Contribution

It formalizes benchmark ranking under test-time scaling and evaluates various statistical ranking methods, releasing Scorio as an open-source tool.

Findings

01

Most full-trial rankings closely match Bayesian gold standards (τ_b=0.93-0.95).

02

19-34 methods recover the exact same ordering in full-trial regimes.

03

Greedy decoding reduces variance but can bias rankings when compared to stochastic sampling.

Abstract

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N = 80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $Bayes_{U} @80$ (mean Kendall's $τ_{b} = 0.93$ -- $0.95$ ), and $19$ -- $34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_{b} \approx 0.86$ . Using greedy decoding as an empirical prior ( $Bayes_{R_{0}} @ N$ ) reduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mohsenhariri/scorio
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.