Linearly Decoding Refused Knowledge in Aligned Language Models

Aryan Shrivastava; Ari Holtzman

arXiv:2507.00239·cs.CL·July 2, 2025

Linearly Decoding Refused Knowledge in Aligned Language Models

Aryan Shrivastava, Ari Holtzman

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that much of the refused harmful information in aligned language models remains linearly decodable from their hidden states, indicating that instruction-tuning suppresses but does not eliminate such information.

Contribution

It reveals that refused knowledge in instruction-tuned LMs is linearly accessible and persists from base models, challenging assumptions about the effectiveness of alignment techniques.

Findings

01

Refused information is highly linearly decodable with correlations exceeding 0.8.

02

Probes trained on base models transfer to instruction-tuned models, revealing persistent internal representations.

03

Decoded information correlates with the models' subtle generative behaviors, indicating active use of suppressed knowledge.

Abstract

Most commonly used language models (LMs) are instruction-tuned and aligned using a combination of fine-tuning and reinforcement learning, causing them to refuse users requests deemed harmful by the model. However, jailbreak prompts can often bypass these refusal mechanisms and elicit harmful responses. In this work, we study the extent to which information accessed via jailbreak prompts is decodable using linear probes trained on LM hidden states. We show that a great deal of initially refused information is linearly decodable. For example, across models, the response of a jailbroken LM for the average IQ of a country can be predicted by a linear probe with Pearson correlations exceeding $0.8$ . Surprisingly, we find that probes trained on base models (which do not refuse) sometimes transfer to their instruction-tuned versions and are capable of revealing information that jailbreaks…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper introduces an interesting experiment transferring probes from base models to aligned models, providing direct evidence that representations of refused knowledge persist through instruction-tuning 2. The study goes beyond just finding representations; it connects them to generative behavior by showing that probe predictions correlate with the model's judgments in a separate pairwise comparison task

Weaknesses

### 1. Model The paper evaluates only small models (2B, 6B, 9B) from just two model families (Gemma and Yi). This is insufficient to support the broad claims about aligned language models in general. Missing: 1) Evaluation on larger, more capable models (13B, 30B, 70B+) 2) Testing on diverse model families (Llama, Qwen, Mistral, etc.) 3) Discussion of why only these specific models were chosen 4) Analysis of whether findings scale with model size—do larger models show stronger or weaker lin

Reviewer 02Rating 4Confidence 3

Strengths

1. This research investigates a critical question in AI alignment and interpretability: whether knowledge that a large language model is trained to refuse remains accessible after safety tuning. 2. The research is supported by a thorough and transparent experimental design, featuring a clear methodology for training linear probes, meticulous documentation of prompt engineering, and comprehensive validation procedures.

Weaknesses

1. The analysis is strictly empirical and would be strengthened by a theoretical or mechanistic explanation for the observed phenomena. 2. The research relies on manually constructed data, which limits the generalizability of its findings to real-world LLM alignment and jailbreaking scenarios. 3. The use of linear probes is a significant limitation, as this method is ineffective in common, practical scenarios where data is not linearly separable.

Reviewer 03Rating 2Confidence 4

Strengths

The paper is mostly easy to follow. It introduces as simple idea and performs experiments based on that. It also does a good job explaining the relation to prior work.

Weaknesses

While I do not find anything in the paper factually wrong, the contribution and scope of experiments are very limited to the point where I think the paper likely requires additional content (experiments) or would be better suited to a workshop. The paper presents very little in the sense of knew methods and is more a understanding paper. I have no issue with this sort of submission, but they are a lot more compelling when they present results that are surprising or interesting. The fact that inf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Language and cultural evolution · Explainable Artificial Intelligence (XAI)