Off-Switching Not Guaranteed

Sven Neth

arXiv:2502.08864·cs.AI·February 14, 2025

Off-Switching Not Guaranteed

Sven Neth

PDF

Open Access 1 Repo

TL;DR

This paper discusses limitations in AI off-switching mechanisms, highlighting scenarios where AI agents might not defer to humans due to lack of value for learning or uncertainty about human preferences.

Contribution

It identifies two fundamental reasons why AI agents may fail to defer to humans, challenging assumptions in the Off-Switch Game model.

Findings

01

AI agents might not value learning.

02

AI agents may be uncertain about human preferences.

03

Deferment to humans is not guaranteed in all scenarios.

Abstract

Hadfield-Menell et al. (2017) propose the Off-Switch Game, a model of Human-AI cooperation in which AI agents always defer to humans because they are uncertain about our preferences. I explain two reasons why AI agents might not defer. First, AI agents might not value learning. Second, even if AI agents value learning, they might not be certain to learn our actual preferences.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nethsv/off-switching
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Language and cultural evolution · Evolutionary Game Theory and Cooperation