Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities
Xu Zhang, Xudong Gong, Jiacheng Qin, Qiang Wang, JiaQi Liao, Zhe Wang, Dawei Feng, Bo Ding

TL;DR
This paper introduces a cognitive diagnostic framework that estimates fine-grained abilities of large language models across multiple domains, enabling more targeted evaluation and improvement.
Contribution
It develops a 35-dimensional ability taxonomy and employs multidimensional Item Response Theory for precise ability estimation and prediction across diverse scientific domains.
Findings
Strong criterion validity demonstrated across benchmarks.
Accurate prediction of unseen items with AUC 0.80-0.89 within benchmarks.
Consistent diagnostic performance across physics, chemistry, and computer science.
Abstract
Current evaluations of large language models aggregate performance across diverse tasks into single scores. This obscures fine-grained ability variation, limiting targeted model improvement and ability-guided selection for specific tasks. Motivated by this gap, we propose a cognitive diagnostic framework that estimates model abilities across multiple fine-grained dimensions. For mathematics, we construct a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge. The framework employs multidimensional Item Response Theory with an item-ability association matrix to estimate fine-grained ability levels, which in turn enable prediction of performance on unseen items (questions of benchmark). Evaluated on 41 models, our approach demonstrates strong criterion validity, consistent ability estimates across benchmarks, and accurate prediction of unseen items with AUC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
