TL;DR
This paper presents a comprehensive formal framework for ensuring provable corrigibility in AI agents, with guarantees in complex environments, by structurally separating safety objectives and providing verifiable safety properties.
Contribution
It introduces a novel multi-head utility architecture with provable safety guarantees, extending corrigibility to multi-step, partially observed, and adversarial settings.
Findings
Exact single-round corrigibility proven in off-switch game
Multi-step safety guarantees with bounded violation probability
Decidable finite-horizon safety verification methods
Abstract
We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multi-step, partially observed environments. Our framework consists of five *structurally separate* utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward -- combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is *learned* to mean-squared error and the planner is -sub-optimal, the probability of violating *any* safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
