The Partially Observable Off-Switch Game
Andrew Garber, Rohan Subramani, Linus Luu, Mark Bedaywi, Stuart, Russell, Scott Emmons

TL;DR
This paper models the AI shutdown problem under realistic conditions where humans have limited information, revealing complex strategic behaviors and tradeoffs in AI deference and safety in asymmetric information settings.
Contribution
It introduces the Partially Observable Off-Switch Game, a novel game-theoretic model capturing the shutdown problem with asymmetric information and analyzes strategic AI behaviors.
Findings
Optimal AI agents may sometimes avoid shutdown despite rational human oversight.
Increasing communication generally raises expected payoffs, but bounded communication can reduce AI deference.
Strategic AI deference depends on information asymmetry and communication, affecting safety considerations.
Abstract
A wide variety of goals could cause an AI to disable its off switch because "you can't fetch the coffee if you're dead" (Russell 2019). Prior theoretical work on this shutdown problem assumes that humans know everything that AIs do. In practice, however, humans have only limited information. Moreover, in many of the settings where the shutdown problem is most concerning, AIs might have vast amounts of private information. To capture these differences in knowledge, we introduce the Partially Observable Off-Switch Game (PO-OSG), a game-theoretic model of the shutdown problem with asymmetric information. Unlike when the human has full observability, we find that in optimal play, even AI agents assisting perfectly rational humans sometimes avoid shutdown. As expected, increasing the amount of communication or information available always increases (or leaves unchanged) the agents' expected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBlockchain Technology Applications and Security
