LLMs Encode Harmfulness and Refusal Separately
Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

TL;DR
This paper reveals that LLMs encode harmfulness and refusal as separate internal concepts, enabling new safety mechanisms that detect unsafe inputs by analyzing internal harmfulness representations, which are more robust than refusal signals.
Contribution
The work identifies a distinct harmfulness direction in LLMs' internal representations, separate from refusal, and demonstrates its use for robust safety safeguards against jailbreaks and finetuning attacks.
Findings
Harmfulness and refusal are encoded as separate directions in LLMs.
Steering along the harmfulness direction influences model judgment, unlike the refusal direction.
Latent Guard based on harmfulness outperforms or matches existing safety methods.
Abstract
LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBiomedical Ethics and Regulation
