Interpretable Steering of Large Language Models with Feature Guided Activation Additions
Samuel Soo, Chen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan, Guoxian, Yan Ming

TL;DR
This paper introduces FGAA, a new activation steering method for large language models that improves control precision and interpretability by leveraging autoencoder features and optimization, outperforming existing techniques.
Contribution
FGAA is a novel activation steering approach that enhances control over LLM outputs by operating in a sparse autoencoder latent space, offering better steering effects and interpretability.
Findings
FGAA outperforms existing steering methods like CAA and SAE-TS.
Steering effectiveness varies with scale and model capabilities.
Trade-offs exist between steering strength and model performance.
Abstract
Effective and reliable control over large language model (LLM) behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Autoencoder
