
TL;DR
This paper discusses limitations in AI off-switching mechanisms, highlighting scenarios where AI agents might not defer to humans due to lack of value for learning or uncertainty about human preferences.
Contribution
It identifies two fundamental reasons why AI agents may fail to defer to humans, challenging assumptions in the Off-Switch Game model.
Findings
AI agents might not value learning.
AI agents may be uncertain about human preferences.
Deferment to humans is not guaranteed in all scenarios.
Abstract
Hadfield-Menell et al. (2017) propose the Off-Switch Game, a model of Human-AI cooperation in which AI agents always defer to humans because they are uncertain about our preferences. I explain two reasons why AI agents might not defer. First, AI agents might not value learning. Second, even if AI agents value learning, they might not be certain to learn our actual preferences.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Language and cultural evolution · Evolutionary Game Theory and Cooperation
