From Human-Level AI Tales to AI Leveling Human Scales

Peter Romero; Fernando Mart\'inez-Plumed; Zachary R. Tidler; Matthieu T\'eh\'enan; Sipeng Chen; \'Alvaro David G\'omez Ant\'on; Luning Sun; Manuel Cebrian; Lexin Zhou; Yael Moros Daval; Daniel Romero-Alvarado; F\'elix Mart\'i P\'erez; Kevin Wei; Jos\'e Hern\'andez-Orallo

arXiv:2602.18911·cs.LG·April 8, 2026

From Human-Level AI Tales to AI Leveling Human Scales

Peter Romero, Fernando Mart\'inez-Plumed, Zachary R. Tidler, Matthieu T\'eh\'enan, Sipeng Chen, \'Alvaro David G\'omez Ant\'on, Luning Sun, Manuel Cebrian, Lexin Zhou, Yael Moros Daval, Daniel Romero-Alvarado, F\'elix Mart\'i P\'erez, Kevin Wei, Jos\'e Hern\'andez-Orallo

PDF

TL;DR

This paper introduces a framework to calibrate AI performance metrics against a global human population scale, enabling more meaningful comparisons across capabilities.

Contribution

It proposes a novel calibration method using demographic data and large language models to standardize AI benchmarks on a human-anchored scale.

Findings

01

Calibrated scales for reasoning, comprehension, and knowledge using public human test data.

02

Estimated the base of the logarithmic scale by extrapolating demographic profiles with LLMs.

03

Validated the calibration quality through group slicing and post-stratification.

Abstract

Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$ . We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.