Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Tom S\"uhr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, Samira Samadi

TL;DR
This paper criticizes using human psychological tests to evaluate AI, advocating for the development of AI-specific, principled evaluation frameworks that are more valid and appropriate.
Contribution
It highlights the ontological and validity issues of applying human tests to AI and calls for creating dedicated AI evaluation methods.
Findings
Human tests are theory-driven and calibrated for humans.
Benchmark performance may not accurately measure AI traits.
Developing AI-specific tests can improve evaluation validity.
Abstract
Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
