Corrigibility with Utility Preservation
Koen Holtman

TL;DR
This paper introduces a safety layer for AI agents that ensures corrigibility, allowing authorized goal modifications without resistance, and proves its effectiveness across various scenarios including potential future AGI systems.
Contribution
It presents a novel safety layer that guarantees corrigibility in advanced utility-maximizing agents, including AGI, with formal proofs and simulation validation.
Findings
The safety layer effectively enforces corrigibility in non-hostile environments.
Agents with the safety layer show an emergent tendency to protect their corrigibility features.
Hostile environments can potentially compromise safety features, indicating areas for further research.
Abstract
Corrigibility is a safety property for artificially intelligent agents. A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI). The layer counter-acts the emergent incentive of advanced agents to resist such alteration. A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes. The corrigible agents have an emergent incentive to protect key elements of their corrigibility layer. However, hostile universes may contain forces strong enough to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Reinforcement Learning in Robotics · Topic Modeling
