Quantifying construct validity in large language model evaluations
Ryan Othniel Kearns

TL;DR
This paper introduces the structured capabilities model, a novel approach that better isolates and measures the true capabilities of large language models from benchmark results, improving construct validity and prediction accuracy.
Contribution
The paper presents the structured capabilities model, integrating scaling laws and measurement error to extract interpretable, generalisable capabilities from LLM benchmarks, outperforming existing models.
Findings
Structured capabilities outperform latent factor models in fit indices.
Structured capabilities show better out-of-distribution prediction.
Existing models fail to properly separate scale from capabilities.
Abstract
The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Natural Language Processing Techniques · Authorship Attribution and Profiling
