An Interpretable and Scalable Framework for Evaluating Large Language Models
Xinhao Qu, Qiang Heng, Hao Zeng, Xiaoqian Liu

TL;DR
This paper introduces a scalable, interpretable framework for evaluating large language models using a reformulation of Item Response Theory, providing faster and more insightful assessments than existing methods.
Contribution
It presents a novel, efficient approach based on majorization-minimization for large-scale LLM evaluation, improving stability, interpretability, and computational speed.
Findings
Achieves significant speedups over existing methods.
Maintains or improves estimation accuracy.
Provides insights into item difficulty and discrimination.
Abstract
Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response Theory (IRT) offers a principled framework for modeling latent model abilities and item characteristics, but conventional methods are computationally expensive and numerically unstable, limiting large-scale implementations. To address these challenges, we propose an interpretable and scalable framework for LLM evaluation based on the majorization-minimization principle. Our approach reformulates the problem as a sequence of constrained matrix factorization subproblems, enabling stable and efficient parameter estimation with theoretical guarantees for identifiability and convergence. Experiments on synthetic and real-world datasets, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
