The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams
Isaac Llorente-Saguer

TL;DR
LatentBiopsy is a training-free geometric method that detects harmful prompts in large language models by analyzing residual stream angular deviations, achieving high accuracy with minimal overhead.
Contribution
It introduces a novel, training-free approach using residual stream geometry for harmful prompt detection, effective across multiple model variants and ablation conditions.
Findings
Harmful prompts have tightly clustered angular distributions.
Geometry persists even after refusal mechanism ablation.
Opposite ring orientations observed in different model families.
Abstract
We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle from this reference direction. The anomaly score is the negative log-likelihood of under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC 0.937 for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
