Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates

Arsenios Scrivens

arXiv:2604.00072·cs.LG·April 2, 2026

Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates

Arsenios Scrivens

PDF

TL;DR

This paper empirically demonstrates the fundamental limitations of classifier-based safety gates in ensuring reliable AI oversight during self-improvement, and proposes Lipschitz-based verification methods that overcome these limitations.

Contribution

It provides the first comprehensive empirical evidence of the impossibility of classifier-based safety gates for reliable AI self-improvement and introduces Lipschitz ball verification as a provable alternative.

Findings

01

All tested classifiers fail to guarantee safety in self-improving neural controllers.

02

Lipschitz ball verifier achieves zero false accepts across multiple dimensions and scales.

03

Chain-based Lipschitz verification enables safe, unbounded parameter-space traversal with reward improvements.

Abstract

Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d=240), eighteen classifier configurations -- spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks -- all fail the dual conditions for safe self-improvement. Three safe RL baselines (CPO, Lyapunov, safety shielding) also fail. Results extend to MuJoCo benchmarks (Reacher-v4 d=496, Swimmer-v4 d=1408, HalfCheetah-v4 d=1824). At controlled distribution separations up to delta_s=2.0, all classifiers still fail -- including the NP-optimal test and MLPs with 100% training accuracy -- demonstrating structural impossibility. We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.