Human Control: Definitions and Algorithms

Ryan Carey; Tom Everitt

arXiv:2305.19861·cs.AI·June 1, 2023·1 cites

Human Control: Definitions and Algorithms

Ryan Carey, Tom Everitt

PDF

Open Access

TL;DR

This paper formally defines shutdown instructability, a form of corrigibility for AI systems, ensuring they follow human shutdown commands, preserve human autonomy, and avoid harm, while analyzing existing and new control algorithms.

Contribution

It introduces a formal definition of shutdown instructability and analyzes related concepts and algorithms for human control of AI systems.

Findings

01

Shutdown instructability guarantees proper shutdown behavior.

02

Analysis of existing control algorithms highlights their strengths and limitations.

03

A new control algorithm is proposed and evaluated.

Abstract

How can humans stay in control of advanced artificial intelligence systems? One proposal is corrigibility, which requires the agent to follow the instructions of a human overseer, without inappropriately influencing them. In this paper, we formally define a variant of corrigibility called shutdown instructability, and show that it implies appropriate shutdown behavior, retention of human autonomy, and avoidance of user harm. We also analyse the related concepts of non-obstruction and shutdown alignment, three previously proposed algorithms for human control, and one new algorithm.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman-Automation Interaction and Safety