Information-theoretic Distinctions Between Deception and Confusion
Robin Young

TL;DR
This paper introduces an information-theoretic framework to distinguish between deceptive alignment and goal drift in AI safety, highlighting their different causes and intervention strategies.
Contribution
It formalizes the distinction between two AI safety failure modes using information theory and applies this to understanding LLM alignment challenges.
Findings
Deceptive alignment involves entropy between true goals and observable behavior.
Goal drift involves entropy between intended and actual goals.
The formal model clarifies different intervention needs for each failure mode.
Abstract
We propose an information-theoretic formalization of the distinction between two fundamental AI safety failure modes: deceptive alignment and goal drift. While both can lead to systems that appear misaligned, we demonstrate that they represent distinct forms of information divergence occurring at different interfaces in the human-AI system. Deceptive alignment creates entropy between an agent's true goals and its observable behavior, while goal drift, or confusion, creates entropy between the intended human goal and the agent's actual goal. Though often observationally equivalent, these failures necessitate different interventions. We present a formal model and an illustrative thought experiment to clarify this distinction. We offer a formal language for re-examining prominent alignment challenges observed in Large Language Models (LLMs), offering novel perspectives on their underlying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
