Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks
Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, Nitesh V. Chawla

TL;DR
ATLAS introduces an adaptive testing framework based on Item Response Theory to evaluate large language models efficiently, reducing the number of required items by up to 90% while maintaining measurement accuracy.
Contribution
It presents a novel adaptive testing method for LLM evaluation using IRT, significantly decreasing testing resources needed compared to static benchmarks.
Findings
Reduces item count by up to 90% while maintaining accuracy.
Ability estimates closely match raw accuracy across benchmarks.
Provides finer discrimination among models with similar accuracy.
Abstract
Evaluating large language models (LLMs) typically requires thousands of benchmark items, making the process expensive, slow, and increasingly impractical at scale. Existing evaluation protocols rely on average accuracy over fixed item sets, treating all items as equally informative despite substantial variation in difficulty and discrimination. We introduce ATLAS, an adaptive testing framework based on Item Response Theory (IRT) that estimates model ability using Fisher information-guided item selection. ATLAS reduces the number of required items by up to 90% while maintaining measurement precision. For instance, it matches whole-bank ability estimates using only 41 items (0.157 MAE) on HellaSwag (5,600 items). We further reconstruct accuracy from ATLAS's ability estimates and find that reconstructed accuracies closely match raw accuracies across all five benchmarks, indicating that…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The core ideas here are interesting and can motivate new research in improvements to how we perform LLM evaluation. It’s nice to pull in techniques from other fields like psychometrics and CAT to see how we can improve our field. The experimental results showing efficiency versus baselines are good.
I want to be upfront: There is concurrent work at COLM 2025 called Fluid Language Model Benchmarking that is highly similar (also inspiration from psychometrics & CAT, also fitting IRT models on Open LLM leaderboard, using the Fisher information for item selection, also baselining against tinyBenchmarks and MetaBench). I did not penalize this work for overlap with concurrent work as the COLM paper was published around the same time this paper was submitted. That being said, there are some issue
- The authors tackle an important problem of making evaluations cheaper for large language models, given a lot of benchmarks are used to track a given model's capabilities. - The proposed methodology is clear by formalizing evaluation as a latent-ability measurement with a three-parameter logistic model. - The results are nice compared to other baselines like TinyBenchmarks with huge reduction in size of the evaluation sets.
- Code and calibrated items are missing in the provided link. - There are some inherent issues with IRT framing and using reduced sizes for evaluation of language models. See [1] - Inconsistent claims and results: the main text of the paper mentions good fits with RMSEA $\leq 0.05$ but Table 4 reports otherwise. - Current framework is only applicable to MCQ tasks, but MCQ benchmarks have many inherent problems [2]. Generalizability to many modern evals which are free-form like math, coding, reas
ATLAS draws upon several decades of research in psychometrics and shows that the methods developed in that field can be fruitfully applied in the context of LLM evaluation. I liked it that the authors thought very carefully about how best to adapt IRT/CAT to the LLM domain (e.g., by using common-person calibration). The experimental setup is also sound, and the authors convincingly show that ATLAS offers advantages for LLM evaluation (but see my concerns below).
There are currently several weaknesses that undermine the contribution of the paper. If the authors address them, I will consider raising my score. - The experimental section misses key details. Specifically, it is unclear whether the LLMs used for evaluation were already used for fitting the IRT models (which would be problematic). Further, if there _was_ a clear train-test split, it is unclear how it was determined. This limits the credibility of the reported results. - For measuring precisi
- Clear, concise, accurate title - Important problem - accurately evaluating language models' capabilities at specific tasks is good - The paper is well written and easy to follow. (Note: I believe some details are omitted, which I pointed out under Questions)
1. The goal when evaluating language models is “How good is model X on task Y?” Here, the primary metrics of interest is MAE between an IRT estimate on a subset of data and the corresponding IRT estimate on all the data. Thus, MAE is a proxy metric that doesn’t really capture what we care about. 2. When considering efficiency, the real concern for practitioners is that evaluating models requires paying for accelerators (GPUs, TPUs, whatever) to run these models. Something like “Selection Time
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychometric Methodologies and Testing · Intelligent Tutoring Systems and Adaptive Learning · Topic Modeling
