Leaderboard Incentives: Model Rankings under Strategic Post-Training
Yatong Chen, Guanhua Zhang, Moritz Hardt

TL;DR
This paper models the strategic incentives created by benchmarks and shows how certain evaluation protocols can align model rankings with true quality, addressing misaligned incentives in model development.
Contribution
It introduces a game-theoretic framework for benchmark incentives and proves that a specific evaluation protocol can induce truthful model rankings.
Findings
Current benchmarks often lack Nash equilibria, leading to strategic misbehavior.
The tune-before-test protocol induces a unique Nash equilibrium.
Proper evaluation design can align model rankings with true quality.
Abstract
Influential benchmarks incentivize competing model developers to strategically allocate post-training resources toward improvements on the leaderboard, a phenomenon dubbed benchmaxxing or training on the test task. In this work, we initiate a principled study of the incentive structure that benchmarks induce. We model benchmarking as a Stackelberg game between a benchmark designer who chooses an evaluation protocol and multiple model developers who compete simultaneously in a subgame given by the designer's choice. Each competitor has a model of unknown latent quality and can inflate its observed score by allocating resources to benchmark-specific improvements. First, we prove that current benchmarks induce games for which no Nash equilibrium between model developers exists. This result suggests one explanation for why current practice leads to misaligned incentives, prompting model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Experimental Behavioral Economics Studies · Reinforcement Learning in Robotics
