Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations

Kevin L. Wei; Patricia Paskov; Sunishchal Dev; Michael J. Byun; Anka Reuel; Xavier Roberts-Gaal; Rachel Calcott; Evie Coxon; Chinmay Deshpande

arXiv:2506.13776·cs.AI·November 4, 2025

Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations

Kevin L. Wei, Patricia Paskov, Sunishchal Dev, Michael J. Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Calcott, Evie Coxon, Chinmay Deshpande

PDF

Open Access 1 Repo

TL;DR

This paper emphasizes the need for more rigorous and transparent human baselines in foundation model evaluations, providing a checklist to improve measurement and reporting practices for better comparison between human and AI performance.

Contribution

It offers a comprehensive framework and checklist for designing, executing, and reporting human baselines, addressing current shortcomings in AI evaluation practices.

Findings

01

Identified gaps in existing human baselining methods

02

Developed a systematic checklist for better reporting

03

Reviewed 115 studies to assess baseline quality

Abstract

In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kevinlwei/human-baselines
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Systems and Decision Making