On Benchmarking Human-Like Intelligence in Machines
Lance Ying, Katherine M. Collins, Lionel Wong, Ilia Sucholutsky, Ryan, Liu, Adrian Weller, Tianmin Shu, Thomas L. Griffiths, Joshua B. Tenenbaum

TL;DR
This paper critiques current AI benchmarking methods for lacking human-like evaluation criteria, highlights biases in existing benchmarks, and proposes five recommendations to improve assessment of human-like intelligence in AI systems.
Contribution
It identifies key shortcomings in current AI benchmarks and offers concrete guidelines for developing more human-centric evaluation standards.
Findings
Current benchmarks lack human-validated labels
Existing tasks do not capture human response variability
Biases and flaws are prevalent in benchmark designs
Abstract
Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
