AI Alignment via Incentives and Correction
Rohit Agarwal, Joshua Lin, Mark Braverman, Elad Hazan

TL;DR
This paper models AI alignment as a strategic game involving incentives and penalties, using law-and-economics principles to optimize oversight and reduce errors in AI systems.
Contribution
It introduces a formal two-agent model and a bandit-based reward optimization method to improve AI oversight and alignment in coding pipelines.
Findings
Adaptive rewards maintain oversight pressure effectively.
Reward optimization reduces hallucinated incorrect attempts.
The model demonstrates improved alignment outcomes in experiments.
Abstract
We study AI alignment through the lens of law-and-economics models of deterrence and enforcement. In these models, misconduct is not treated as an external failure, but as a strategic response to incentives: an actor weighs the gain from violation against the probability of detection and the severity of punishment. We argue that the same logic arises naturally in agentic AI pipelines. A solver may benefit from producing a persuasive but incorrect answer, hiding uncertainty, or exploiting spurious shortcuts, while an auditor or verifier must decide whether costly monitoring is worthwhile. Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor's incentive to inspect, since auditing then mainly incurs cost on a population that appears increasingly aligned. This perspective also changes what should count as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
