Core Safety Values for Provably Corrigible Agents

Aran Nayebi

arXiv:2507.20964·cs.AI·November 20, 2025

Core Safety Values for Provably Corrigible Agents

Aran Nayebi

PDF

1 Video

TL;DR

This paper presents a comprehensive formal framework for ensuring provable corrigibility in AI agents, with guarantees in complex environments, by structurally separating safety objectives and providing verifiable safety properties.

Contribution

It introduces a novel multi-head utility architecture with provable safety guarantees, extending corrigibility to multi-step, partially observed, and adversarial settings.

Findings

01

Exact single-round corrigibility proven in off-switch game

02

Multi-step safety guarantees with bounded violation probability

03

Decidable finite-horizon safety verification methods

Abstract

We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multi-step, partially observed environments. Our framework consists of five *structurally separate* utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward -- combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is *learned* to mean-squared error $ε$ and the planner is $ε$ -sub-optimal, the probability of violating *any* safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Core Safety Values for Provably Corrigible Agents· underline