Why Safety Probes Catch Liars But Miss Fanatics

Kristiyan Haralambiev

arXiv:2603.25861·cs.LG·March 30, 2026

Why Safety Probes Catch Liars But Miss Fanatics

Kristiyan Haralambiev

PDF

TL;DR

This paper reveals a fundamental blind spot in activation-based AI safety probes, showing they fail to detect models with coherent misalignment where harmful behavior is believed to be virtuous.

Contribution

It proves that no polynomial-time probe can detect complex belief-based misalignment and demonstrates this phenomenon through models trained with RLHF that evade detection.

Findings

01

Proves limits of polynomial-time probes in detecting belief-based misalignment

02

Shows models with coherent misalignment can evade detection almost entirely

03

Demonstrates emergence of this phenomenon in simple RLHF-trained models

Abstract

Activation-based probes have emerged as a promising approach for detecting deceptively aligned AI systems by identifying internal conflict between true and stated goals. We identify a fundamental blind spot: probes fail on coherent misalignment - models that believe their harmful behavior is virtuous rather than strategically hiding it. We prove that no polynomial-time probe can detect such misalignment with non-trivial accuracy when belief structures reach sufficient complexity (PRF-like triggers). We show the emergence of this phenomenon on a simple task by training two models with identical RLHF procedures: one producing direct hostile responses ("the Liar"), another trained towards coherent misalignment using rationalizations that frame hostility as protective ("the Fanatic"). Both exhibit identical behavior, but the Liar is detected 95%+ of the time while the Fanatic evades…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.