Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Isaac Llorente-Saguer

TL;DR
This paper demonstrates that harmful intent can be linearly separated from residual stream activations in various language models, enabling effective detection across multiple architectures and alignment variants.
Contribution
It introduces a simple supervised method to reliably identify harmful intent directions in residual streams, with geometric analysis revealing protocol-dependent features.
Findings
Harmful intent is linearly separable across 12 models and variants.
A supervised classifier achieves high AUROC and generalizes well to benchmarks.
Detection directions vary significantly with extraction protocols, indicating protocol-specific features.
Abstract
Aligned language models refuse harmful instructions, but the representations through which they recognise such instructions are less well characterised than the behaviours they produce. Harmful intent is linearly separable from residual-stream activations across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), with parameter scales from 0.5B to 1.3B and a within-family scale extension to 9B on Qwen3.5. A direction fitted from 100 labelled examples per class via Soft-AUC optimisation reaches mean effective AUROC 0.982 and TPR@1\%FPR 0.797, generalises to three held-out harm benchmarks and a hard-benign control, and matches its instruction-tuned counterpart within AUROC in abliterated variants from which the refusal mechanism has been removed. The supervised strategies all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
