Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking

Tianpeng Zheng; Zhehan Jiang; Jiayi Liu; Shicong Feng

arXiv:2603.23506·cs.CL·March 26, 2026

Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking

Tianpeng Zheng, Zhehan Jiang, Jiayi Liu, Shicong Feng

PDF

Open Access

TL;DR

This paper introduces a computerized adaptive testing framework based on item response theory to efficiently evaluate large language models in medical knowledge, significantly reducing assessment time and cost while maintaining accuracy.

Contribution

It develops and validates a novel adaptive testing method for medical LLM benchmarking, enabling rapid, low-cost, and reliable performance measurement.

Findings

01

Near-perfect correlation (r=0.988) with full-item bank estimates

02

Reduced assessment time from hours to minutes

03

Significant decrease in token usage and computational cost

Abstract

The rapid proliferation of large language models (LLMs) in healthcare creates an urgent need for scalable and psychometrically sound evaluation methods. Conventional static benchmarks are costly to administer repeatedly, vulnerable to data contamination, and lack calibrated measurement properties for fine-grained performance tracking. We propose and validate a computerized adaptive testing (CAT) framework grounded in item response theory (IRT) for efficient assessment of standardized medical knowledge in LLMs. The study comprises a two-phase design: a Monte Carlo simulation to identify optimal CAT configurations and an empirical evaluation of 38 LLMs using a human-calibrated medical item bank. Each model completed both the full item bank and an adaptive test that dynamically selected items based on real-time ability estimates and terminated upon reaching a predefined reliability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPsychometric Methodologies and Testing · Artificial Intelligence in Healthcare and Education · Intelligent Tutoring Systems and Adaptive Learning