Mechanistic Indicators of Steering Effectiveness in Large Language Models
Mehdi Jafari, Hao Xue, Flora Salim

TL;DR
This paper investigates internal signals within large language models to predict and diagnose the success of activation-based steering, aiming to understand the mechanistic factors influencing steering effectiveness.
Contribution
It introduces entropy-based measures and demonstrates their predictive power for steering reliability, advancing understanding of internal model mechanisms.
Findings
Mechanistic signals can predict steering success with high accuracy.
Entropy preservation and KL divergence correlate with effective steering.
Proposed evaluation baseline improves assessment of steering methods.
Abstract
Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we investigate whether the reliability of steering can be diagnosed using internal model signals. We focus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback-Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured entropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Computational and Text Analysis Methods
