Information-Theoretic Limits of Safety Verification for Self-Improving Systems

Arsenios Scrivens

arXiv:2603.28650·cs.LG·April 3, 2026

Information-Theoretic Limits of Safety Verification for Self-Improving Systems

Arsenios Scrivens

PDF

TL;DR

This paper explores the fundamental limits of safety verification in self-improving systems, establishing theoretical bounds on classifier utility and risk, and demonstrating conditions for safe self-modification.

Contribution

It formalizes the trade-offs between unbounded utility and bounded risk, providing new impossibility results and bounds for safety verification in self-modifying AI systems.

Findings

01

Power-law risk schedules limit classifier utility to subpolynomial growth.

02

A Lipschitz verifier can escape the impossibility by achieving zero risk with positive TPR.

03

Empirical validation on GPT-2 demonstrates the practical relevance of the theoretical bounds.

Abstract

Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions -- requiring sum delta_n < infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) -- and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^{-p}) with p > 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder's inequality, forcing sum TPR_n < infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder's inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N *…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.