Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

Esma Balk{\i}r; Alice Pernthaller; Marco Basaldella; Jos\'e Hern\'andez-Orallo; Nigel Collier

arXiv:2601.13885·cs.CL·January 21, 2026

Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

Esma Balk{\i}r, Alice Pernthaller, Marco Basaldella, Jos\'e Hern\'andez-Orallo, Nigel Collier

PDF

Open Access

TL;DR

This paper extends adaptive testing methods to continuous scoring in LLM evaluation, enabling reliable rankings with fewer items and lower costs by modeling scores with heteroskedastic normal distributions.

Contribution

It introduces a new heteroskedastic normal model for continuous scores and an uncertainty-aware ranker with adaptive stopping, improving efficiency and accuracy in LLM evaluation.

Findings

01

Uses only 2% of items compared to full testing.

02

Improves ranking correlation by 0.12 τ over random sampling.

03

Achieves 95% accuracy on confident predictions.

Abstract

Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 {\tau} over random sampling, with 95% accuracy on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPsychometric Methodologies and Testing · Topic Modeling · Machine Learning and Data Classification