Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models
Jia Liu, Zhiyu Xu, and Yuqi Gu

TL;DR
This paper introduces a scalable, fine-grained evaluation method for large language models using cognitive diagnosis models, incorporating textual information to diagnose strengths and weaknesses across numerous capabilities.
Contribution
It adapts psychometric cognitive diagnosis models for large-scale LLM evaluation, enabling detailed capability profiling with textual priors and efficient estimation algorithms.
Findings
Accurate mastery profile estimation demonstrated in simulations
Effective application to MATH Level 5 benchmark
Uncovered detailed LLM strengths and weaknesses
Abstract
Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel methodologies to adapt cognitive diagnosis models (CDMs) in psychometrics to LLM evaluation, enabling fine-grained diagnosis via multidimensional discrete capability profiles and interpretable characterizations of LLM strengths and weaknesses. First, to enable CDM-based evaluation at benchmark scale (more than 1000 items), we propose a scalable method that jointly estimates LLM mastery profiles and the item-attribute Q-matrix, addressing key challenges posed by high-dimensional latent attributes (K > 20), large item pools, and the prohibitive computational cost of existing marginal maximum likelihood-based estimation. Second, we incorporate item-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychometric Methodologies and Testing · Topic Modeling · Mental Health via Writing
