The Geometry of Harmfulness in LLMs through Subconcept Probing
McNair Shah, Saleena Angeline, Adhitya Rajendra Kumar, Naitik Chheda, Kevin Zhu, Vasu Sharma, Sean O'Brien, Will Cai

TL;DR
This paper develops a multidimensional probing framework to understand and mitigate harmful behaviors in large language models by identifying and manipulating specific harmfulness subspaces, enabling effective reduction of harmful outputs.
Contribution
It introduces a novel multidimensional probing method for harmfulness concepts in LLMs and demonstrates effective steering and ablation of harmful behaviors with minimal utility loss.
Findings
Harmfulness subspaces are low-rank and interpretable.
Ablation of dominant harmfulness directions reduces harmful outputs.
Steering in the harmfulness subspace effectively mitigates harmful behaviors.
Abstract
Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
