LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao; Jing Huang; Zhengxuan Wu; David Bau; Weiyan Shi

arXiv:2507.11878·cs.CL·December 16, 2025

LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

PDF

Open Access 2 Models 1 Video

TL;DR

This paper reveals that LLMs encode harmfulness and refusal as separate internal concepts, enabling new safety mechanisms that detect unsafe inputs by analyzing internal harmfulness representations, which are more robust than refusal signals.

Contribution

The work identifies a distinct harmfulness direction in LLMs' internal representations, separate from refusal, and demonstrates its use for robust safety safeguards against jailbreaks and finetuning attacks.

Findings

01

Harmfulness and refusal are encoded as separate directions in LLMs.

02

Steering along the harmfulness direction influences model judgment, unlike the refusal direction.

03

Latent Guard based on harmfulness outperforms or matches existing safety methods.

Abstract

LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

LLMs Encode Harmfulness and Refusal Separately· slideslive

Taxonomy

TopicsBiomedical Ethics and Regulation