Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Hongli Zhou; Hui Huang; Ziqing Zhao; Lvyuan Han; Huicheng Wang; Kehai Chen; Muyun Yang; Wei Bao; Jian Dong; Bing Xu; Conghui Zhu; Hailong Cao; Tiejun Zhao

arXiv:2505.15055·cs.CL·January 19, 2026

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper critically examines the effectiveness of large language model benchmarks, introduces an improved IRT-based framework called PSN-IRT, and demonstrates its ability to better evaluate and optimize benchmark design for LLMs.

Contribution

It introduces PSN-IRT, an enhanced IRT framework for more accurate LLM evaluation, and provides a comprehensive analysis revealing shortcomings of current benchmarks.

Findings

01

Current benchmarks show significant measurement shortcomings.

02

PSN-IRT can construct smaller, more aligned benchmarks.

03

Enhanced evaluation leads to better understanding of LLM capabilities.

Abstract

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Joe-Hall-Lee/PSN-IRT
pytorchOfficial

Videos

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory· underline

Taxonomy

TopicsTopic Modeling

MethodsSparse Evolutionary Training