From Human-Level AI Tales to AI Leveling Human Scales
Peter Romero, Fernando Mart\'inez-Plumed, Zachary R. Tidler, Matthieu T\'eh\'enan, Sipeng Chen, \'Alvaro David G\'omez Ant\'on, Luning Sun, Manuel Cebrian, Lexin Zhou, Yael Moros Daval, Daniel Romero-Alvarado, F\'elix Mart\'i P\'erez, Kevin Wei, Jos\'e Hern\'andez-Orallo

TL;DR
This paper introduces a framework to calibrate AI performance metrics against a global human population scale, enabling more meaningful comparisons across capabilities.
Contribution
It proposes a novel calibration method using demographic data and large language models to standardize AI benchmarks on a human-anchored scale.
Findings
Calibrated scales for reasoning, comprehension, and knowledge using public human test data.
Estimated the base of the logarithmic scale by extrapolating demographic profiles with LLMs.
Validated the calibration quality through group slicing and post-stratification.
Abstract
Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base . We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
