REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

Li-Ming Zhan; Bo Liu; Chengqiang Xie; Jiannong Cao; Xiao-Ming Wu

arXiv:2506.08359·cs.CL·October 2, 2025

REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

Li-Ming Zhan, Bo Liu, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu

PDF

Open Access 3 Reviews

TL;DR

REAL introduces a novel method to identify and utilize behavior-relevant modules within Transformer-based language models, enabling more precise and effective inference-time steering without altering model parameters.

Contribution

The paper presents a new framework, REAL, that uses vector-quantized autoencoders to quantify module relevance for behavior control in LLMs, improving steering effectiveness.

Findings

01

Achieves 20% average improvement over ITI in truthfulness steering.

02

Modules identified by REAL generalize well across domains.

03

Effective in multiple tasks including truthfulness and alignment.

Abstract

Inference-time steering aims to alter a large language model's (LLM's) responses without changing its parameters, but a central challenge is identifying the internal modules that most strongly govern the target behavior. Existing approaches often rely on simplistic cues or ad hoc heuristics, leading to suboptimal or unintended effects. We introduce REAL, a framework for identifying behavior-relevant modules (attention heads or layers) in Transformer models. For each module, REAL trains a vector-quantized autoencoder (VQ-AE) on its hidden activations and uses a shared, learnable codebook to partition the latent space into behavior-relevant and behavior-irrelevant subspaces. REAL quantifies a module's behavioral relevance by how well its VQ-AE encodings discriminate behavior-aligned from behavior-violating responses via a binary classification metric; this score guides both module…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper proposes a novel module localization method for activation steering of LLMs. The method can be generally applied to many activation steering methods that operate on different modules of LLMs. 2. The paper conducts extensive experiments across multiple tasks and models, and demonstrates the empirical strength of the method over multiple baselines. 3. The paper performs in-depth ablation analyses of the proposed method. Particularly, Figure 4 and Figure 5 show that the proposed meth

Weaknesses

1. The proposed localization method does not show a strong empirical gain over other finetuning-based localization methods. The main empirical strength of the method is demonstrated in comparison with methods that are based on simple linear probes (e.g., ITI for truthfulness and spare for NQSwap and Macnoise), but the proposed method only leads to 1-2% absolute performance increase over other localization methods that also require fine-tuning (lofit for TruthfulQA). This makes sense because line

Reviewer 02Rating 2Confidence 4

Strengths

1. The studied problem is interesting—LLM steering is a promising research direction, especially since inference-time steering can modify model behavior toward desirable outcomes without altering model weights. 2. The method is approach-agnostic and can be integrated with different steering techniques such as ITI and LoFIT. 3. The experiments cover diverse scenarios and tasks, including truthfulness steering, open-domain QA involving knowledge conflicts, and general alignment objectives, etc. 4.

Weaknesses

1. The performance improvement is relatively marginal, and several results are missing or incomplete, making it difficult to fully assess the effectiveness and potential of the proposed approach. 2. The evaluated models are relatively small in scale, raising concerns about the approach’s applicability and scalability to larger LLMs. 3. The paper lacks theoretical analysis to support or explain the observed empirical results. The proposed approach is also not well-motivated. 4. As shown in Figure

Reviewer 03Rating 2Confidence 3

Strengths

1. They did experiments across different behavioral datasets to validate the effectiveness of the proposed method on steering different kinds of behavior. 2. The proposed method is novel in the sense that they adopt a VQ-AE to map activations to a disentangled, quantized latent space. 3. The writing of preliminaries is very clear.

Weaknesses

1. The computation of the behavior-discriminative scores is depicted in Line 249 in text, which remains a bit unclear to me. It would be appreciated if the authors could formulate this computation. 2. Truthfulness is one of the testbeds selected in this paper. The authors show the effectiveness of the proposed method on the TruthfulQA benchmark using the MC tasks. There is also another task, i.e., generation, on TruthfulQA, where there are metrics like Truth and Info measuring the truthfulness

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need