The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Adil Amin

arXiv:2605.18840·cs.LG·May 20, 2026

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Adil Amin

PDF

1 Repo

TL;DR

This paper analyzes the interactions and trade-offs among frontier AI models' capabilities, proposing diagnostic tools and a playbook to guide measurement and development strategies for the next phase of AI progress.

Contribution

It introduces a decomposition method to diagnose capability emphasis, reveals how cooperation varies across labs and over time, and offers a practical playbook and dashboard for guiding future frontier model evaluations.

Findings

01

Capabilities tend to cooperate with a correlation of +0.72.

02

A second capability transition occurs at 30--72B parameters.

03

SWE-bench saturation indicates the need for new axes of measurement.

Abstract

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ( $h$ -field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs (2024--2026), capabilities cooperate ( $r = + 0.72$ , $p < 1 0^{- 6}$ ), but cooperation varies by lab and over time: DeepSeek reversed from reasoning-rich to coding-first ( $h$ : $+ 11.2 \to - 4.7$ , 15.9-pp swing); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery. Cooperation is not static -- it cascades. Six open-weight architectures confirm a second capability transition at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://zehenlabs.com/cape
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.