Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

Isaac Llorente-Saguer

arXiv:2604.18901·cs.LG·May 12, 2026

Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

Isaac Llorente-Saguer

PDF

TL;DR

This paper demonstrates that harmful intent can be linearly separated from residual stream activations in various language models, enabling effective detection across multiple architectures and alignment variants.

Contribution

It introduces a simple supervised method to reliably identify harmful intent directions in residual streams, with geometric analysis revealing protocol-dependent features.

Findings

01

Harmful intent is linearly separable across 12 models and variants.

02

A supervised classifier achieves high AUROC and generalizes well to benchmarks.

03

Detection directions vary significantly with extraction protocols, indicating protocol-specific features.

Abstract

Aligned language models refuse harmful instructions, but the representations through which they recognise such instructions are less well characterised than the behaviours they produce. Harmful intent is linearly separable from residual-stream activations across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), with parameter scales from 0.5B to 1.3B and a within-family scale extension to 9B on Qwen3.5. A direction fitted from 100 labelled examples per class via Soft-AUC optimisation reaches mean effective AUROC 0.982 and TPR@1\%FPR 0.797, generalises to three held-out harm benchmarks and a hard-benign control, and matches its instruction-tuned counterpart within $\pm 0.003$ AUROC in abliterated variants from which the refusal mechanism has been removed. The supervised strategies all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.