Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking
Tianpeng Zheng, Zhehan Jiang, Jiayi Liu, Shicong Feng

TL;DR
This paper introduces a computerized adaptive testing framework based on item response theory to efficiently evaluate large language models in medical knowledge, significantly reducing assessment time and cost while maintaining accuracy.
Contribution
It develops and validates a novel adaptive testing method for medical LLM benchmarking, enabling rapid, low-cost, and reliable performance measurement.
Findings
Near-perfect correlation (r=0.988) with full-item bank estimates
Reduced assessment time from hours to minutes
Significant decrease in token usage and computational cost
Abstract
The rapid proliferation of large language models (LLMs) in healthcare creates an urgent need for scalable and psychometrically sound evaluation methods. Conventional static benchmarks are costly to administer repeatedly, vulnerable to data contamination, and lack calibrated measurement properties for fine-grained performance tracking. We propose and validate a computerized adaptive testing (CAT) framework grounded in item response theory (IRT) for efficient assessment of standardized medical knowledge in LLMs. The study comprises a two-phase design: a Monte Carlo simulation to identify optimal CAT configurations and an empirical evaluation of 38 LLMs using a human-calibrated medical item bank. Each model completed both the full item bank and an adaptive test that dynamically selected items based on real-time ability estimates and terminated upon reaching a predefined reliability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychometric Methodologies and Testing · Artificial Intelligence in Healthcare and Education · Intelligent Tutoring Systems and Adaptive Learning
