Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores
Esma Balk{\i}r, Alice Pernthaller, Marco Basaldella, Jos\'e Hern\'andez-Orallo, Nigel Collier

TL;DR
This paper extends adaptive testing methods to continuous scoring in LLM evaluation, enabling reliable rankings with fewer items and lower costs by modeling scores with heteroskedastic normal distributions.
Contribution
It introduces a new heteroskedastic normal model for continuous scores and an uncertainty-aware ranker with adaptive stopping, improving efficiency and accuracy in LLM evaluation.
Findings
Uses only 2% of items compared to full testing.
Improves ranking correlation by 0.12 τ over random sampling.
Achieves 95% accuracy on confident predictions.
Abstract
Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 {\tau} over random sampling, with 95% accuracy on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychometric Methodologies and Testing · Topic Modeling · Machine Learning and Data Classification
