On Benchmarking Human-Like Intelligence in Machines

Lance Ying; Katherine M. Collins; Lionel Wong; Ilia Sucholutsky; Ryan; Liu; Adrian Weller; Tianmin Shu; Thomas L. Griffiths; Joshua B. Tenenbaum

arXiv:2502.20502·cs.AI·March 3, 2025

On Benchmarking Human-Like Intelligence in Machines

Lance Ying, Katherine M. Collins, Lionel Wong, Ilia Sucholutsky, Ryan, Liu, Adrian Weller, Tianmin Shu, Thomas L. Griffiths, Joshua B. Tenenbaum

PDF

TL;DR

This paper critiques current AI benchmarking methods for lacking human-like evaluation criteria, highlights biases in existing benchmarks, and proposes five recommendations to improve assessment of human-like intelligence in AI systems.

Contribution

It identifies key shortcomings in current AI benchmarks and offers concrete guidelines for developing more human-centric evaluation standards.

Findings

01

Current benchmarks lack human-validated labels

02

Existing tasks do not capture human response variability

03

Biases and flaws are prevalent in benchmark designs

Abstract

Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.