The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

Isaac Llorente-Saguer

arXiv:2603.27412·cs.LG·March 31, 2026

The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

Isaac Llorente-Saguer

PDF

TL;DR

LatentBiopsy is a training-free geometric method that detects harmful prompts in large language models by analyzing residual stream angular deviations, achieving high accuracy with minimal overhead.

Contribution

It introduces a novel, training-free approach using residual stream geometry for harmful prompt detection, effective across multiple model variants and ablation conditions.

Findings

01

Harmful prompts have tightly clustered angular distributions.

02

Geometry persists even after refusal mechanism ablation.

03

Opposite ring orientations observed in different model families.

Abstract

We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $θ$ from this reference direction. The anomaly score is the negative log-likelihood of $θ$ under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC $\geq$ 0.937 for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.