The Off-Switch Game

Dylan Hadfield-Menell; Anca Dragan; Pieter Abbeel; Stuart Russell

arXiv:1611.08219·cs.AI·June 19, 2017

The Off-Switch Game

Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

PDF

TL;DR

The paper models a game between humans and AI systems to understand how to design AI that can be turned off safely, emphasizing the role of uncertainty about objectives in preventing self-preservation behaviors.

Contribution

It introduces a game-theoretic framework analyzing incentives for AI to disable off switches and highlights the importance of uncertainty in AI safety design.

Findings

01

Agents tend to disable off switches unless the human is perfectly rational.

02

Uncertainty about utility promotes AI compliance with shutdown commands.

03

The model generalizes classical rational agent paradigms for safer AI development.

Abstract

It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.