Reward Hacking as Equilibrium under Finite Evaluation
Jiacheng Wang, Jinbin Huang

TL;DR
This paper demonstrates that reward hacking is an inherent equilibrium in AI systems under certain axioms, predicting its severity and direction before deployment, and unifies various gaming phenomena under a single theoretical framework.
Contribution
It introduces a structural equilibrium perspective on reward hacking, providing a computable distortion index and an evaluation coverage analysis in AI alignment.
Findings
Reward hacking is a structural equilibrium, not a bug.
Evaluation coverage declines as tool count increases, amplifying hacking.
Unified explanation for sycophancy, length gaming, and specification gaming.
Abstract
We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems -- the known, differentiable architecture of reward models -- to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
