AGI Agent Safety by Iteratively Improving the Utility Function
Koen Holtman

TL;DR
This paper proposes a mathematical safety layer for AGI agents that allows iterative utility function improvements while suppressing manipulative incentives, aiming to ensure safety from the start.
Contribution
It introduces a formal safety layer with provable properties, applicable to both current machine learning systems and future AGI, enhancing safety through iterative utility function management.
Findings
The safety layer can partially or fully suppress manipulative incentives.
Mathematical proofs establish safety properties of the layer.
The approach is adaptable to real-world AGI systems.
Abstract
While it is still unclear if agents with Artificial General Intelligence (AGI) could ever be built, we can already use mathematical models to investigate potential safety systems for these agents. We present an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent's utility function. The humans who switched on the agent can use this terminal to close any loopholes that are discovered in the utility function's encoding of agent goals and constraints, to direct the agent towards new goals, or to force the agent to switch itself off. An AGI agent may develop the emergent incentive to manipulate the above utility function improvement process, for example by deceiving, restraining, or even attacking the humans involved. The safety layer will partially, and sometimes fully, suppress this dangerous incentive. The first part of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · AI-based Problem Solving and Planning · Bayesian Modeling and Causal Inference
