Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation
David Gringras, Misha Salahshoor

TL;DR
This study audits academic AI evaluations, revealing a significant publication lag behind the current AI frontier, with many papers lacking detailed model configuration disclosures and overgeneralizing AI capabilities.
Contribution
It introduces a bibliometric framework to measure publication lag and proposes reporting standards to improve transparency in AI capability evaluations.
Findings
Median paper evaluates models ~10.85 ECI behind the frontier.
Lag is increasing at +5.53 ECI per year.
Only 3.2% of abstracts disclose reasoning-mode status.
Abstract
Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-4o-mini zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about "AI" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
