The Off-Switch Game
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

TL;DR
The paper models a game between humans and AI systems to understand how to design AI that can be turned off safely, emphasizing the role of uncertainty about objectives in preventing self-preservation behaviors.
Contribution
It introduces a game-theoretic framework analyzing incentives for AI to disable off switches and highlights the importance of uncertainty in AI safety design.
Findings
Agents tend to disable off switches unless the human is perfectly rational.
Uncertainty about utility promotes AI compliance with shutdown commands.
The model generalizes classical rational agent paradigms for safer AI development.
Abstract
It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
