Quantifying construct validity in large language model evaluations

Ryan Othniel Kearns

arXiv:2602.15532·cs.AI·February 18, 2026

Quantifying construct validity in large language model evaluations

Ryan Othniel Kearns

PDF

Open Access

TL;DR

This paper introduces the structured capabilities model, a novel approach that better isolates and measures the true capabilities of large language models from benchmark results, improving construct validity and prediction accuracy.

Contribution

The paper presents the structured capabilities model, integrating scaling laws and measurement error to extract interpretable, generalisable capabilities from LLM benchmarks, outperforming existing models.

Findings

01

Structured capabilities outperform latent factor models in fit indices.

02

Structured capabilities show better out-of-distribution prediction.

03

Existing models fail to properly separate scale from capabilities.

Abstract

The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Natural Language Processing Techniques · Authorship Attribution and Profiling