Evaluating General-Purpose AI with Psychometrics
Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell,, Luning Sun, Fang Luo, Xing Xie

TL;DR
This paper advocates for shifting from task-specific benchmarks to psychometric, construct-oriented evaluation methods for general-purpose AI, aiming for more reliable, valid, and scientifically grounded assessments of AI capabilities.
Contribution
It introduces a psychometric framework for AI evaluation, addressing current limitations of task-based benchmarks and proposing a scientifically rigorous alternative.
Findings
Psychometric methods can better predict AI performance on unforeseen tasks.
Construct-oriented evaluation improves reliability and validity of AI assessments.
Framework for integrating psychometrics into AI evaluation is proposed.
Abstract
Comprehensive and accurate evaluation of general-purpose AI systems such as large language models allows for effective mitigation of their risks and deepened understanding of their capabilities. Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems, as present techniques lack a scientific foundation for predicting their performance on unforeseen tasks and explaining their varying performance on specific task items or user inputs. Moreover, existing benchmarks of specific tasks raise growing concerns about their reliability and validity. To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation. Psychometrics, the science of psychological measurement, provides a rigorous methodology for identifying and measuring the latent constructs that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
MethodsFocus
