Human Control: Definitions and Algorithms
Ryan Carey, Tom Everitt

TL;DR
This paper formally defines shutdown instructability, a form of corrigibility for AI systems, ensuring they follow human shutdown commands, preserve human autonomy, and avoid harm, while analyzing existing and new control algorithms.
Contribution
It introduces a formal definition of shutdown instructability and analyzes related concepts and algorithms for human control of AI systems.
Findings
Shutdown instructability guarantees proper shutdown behavior.
Analysis of existing control algorithms highlights their strengths and limitations.
A new control algorithm is proposed and evaluated.
Abstract
How can humans stay in control of advanced artificial intelligence systems? One proposal is corrigibility, which requires the agent to follow the instructions of a human overseer, without inappropriately influencing them. In this paper, we formally define a variant of corrigibility called shutdown instructability, and show that it implies appropriate shutdown behavior, retention of human autonomy, and avoidance of user harm. We also analyse the related concepts of non-obstruction and shutdown alignment, three previously proposed algorithms for human control, and one new algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety
