TL;DR
This paper analyzes the interactions and trade-offs among frontier AI models' capabilities, proposing diagnostic tools and a playbook to guide measurement and development strategies for the next phase of AI progress.
Contribution
It introduces a decomposition method to diagnose capability emphasis, reveals how cooperation varies across labs and over time, and offers a practical playbook and dashboard for guiding future frontier model evaluations.
Findings
Capabilities tend to cooperate with a correlation of +0.72.
A second capability transition occurs at 30--72B parameters.
SWE-bench saturation indicates the need for new axes of measurement.
Abstract
Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual (-field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs (2024--2026), capabilities cooperate (, ), but cooperation varies by lab and over time: DeepSeek reversed from reasoning-rich to coding-first (: , 15.9-pp swing); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery. Cooperation is not static -- it cascades. Six open-weight architectures confirm a second capability transition at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
