The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy
William Overman, Mohsen Bayati

TL;DR
This paper introduces a game-theoretic framework for balancing AI autonomy and human oversight, ensuring safety and cooperation through intrinsic alignment in agent behavior.
Contribution
It models agent-human interactions as a Markov game with an alignment guarantee, promoting safe autonomy without system modifications.
Findings
The framework guarantees that increased autonomy does not harm human value.
Simulations show emergent cooperation and safety improvements.
Fine-tuned language models reduce safety violations in open environments.
Abstract
As increasingly capable agents are deployed, a central safety challenge is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface in which an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or engage in oversight (oversee), and model this interaction as a two-player Markov game. When this game forms a Markov Potential Game, we prove an alignment guarantee: any increase in the agent's utility from acting more autonomously cannot decrease the human's value. This establishes a form of intrinsic alignment where the agent's incentive to seek autonomy is structurally coupled to the human's welfare. Practically, the framework induces a transparent control layer that encourages the agent to defer when risky and act when safe. While we use gridworld…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
