Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol
Christopher Altman

TL;DR
The paper introduces UCIP, a novel detection framework using quantum-inspired models to distinguish whether AI agents have terminal self-preservation objectives or merely instrumental ones, based on latent trajectory structure.
Contribution
UCIP employs a Quantum Boltzmann Machine to measure entanglement entropy, providing a new, behavior-independent criterion for identifying agents' continuation interests.
Findings
UCIP achieves 100% detection accuracy on gridworld agents.
Type A and Type B agents show a significant entanglement gap with AUC-ROC of 1.0.
Classical models fail to reproduce the entanglement effect.
Abstract
How can we determine whether an AI system preserves itself as a deeply held objective or merely as an instrumental strategy? Autonomous agents with memory, persistent context, and multi-step planning create a measurement problem: terminal and instrumental self-preservation can produce similar behavior, so behavior alone cannot reliably distinguish them. We introduce the Unified Continuation-Interest Protocol (UCIP), a detection framework that shifts analysis from behavior to latent trajectory structure. UCIP encodes trajectories with a Quantum Boltzmann Machine, a classical model using density-matrix formalism, and measures von Neumann entropy over a bipartition of hidden units. The core hypothesis is that agents with terminal continuation objectives (Type A) produce higher entanglement entropy than agents with merely instrumental continuation (Type B). UCIP combines this signal with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
