Efficient and accurate steering of Large Language Models through attention-guided feature learning
Parmida Davarmanesh, Ashia Wilson, Adityanarayanan Radhakrishnan

TL;DR
This paper presents an attention-guided framework for steering large language models, improving the selection of relevant features and layers, thus enabling more effective and scalable manipulation of model responses for semantic concepts.
Contribution
The authors introduce a novel attention-guided steering method that automatically identifies relevant features and layers, significantly enhancing steering success across various LLM architectures.
Findings
Nearly doubled the number of successfully steered concepts
Improved steering across models up to 70 billion parameters
Provided insights into concept feature distribution across layers
Abstract
Steering, or direct manipulation of internal activations to guide LLM responses toward specific semantic concepts, is emerging as a promising avenue for both understanding how semantic concepts are stored within LLMs and advancing LLM capabilities. Yet, existing steering methods are remarkably brittle, with seemingly non-steerable concepts becoming completely steerable based on subtle algorithmic choices in how concept-related features are extracted. In this work, we introduce an attention-guided steering framework that overcomes three core challenges associated with steering: (1) automatic selection of relevant token embeddings for extracting concept-related features; (2) accounting for heterogeneity of concept-related features across LLM activations; and (3) identification of layers most relevant for steering. Across a steering benchmark of 512 semantic concepts, our framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Machine Learning in Materials Science
