Interpretable Steering of Large Language Models with Feature Guided   Activation Additions

Samuel Soo; Chen Guang; Wesley Teng; Chandrasekaran Balaganesh; Tan; Guoxian; Yan Ming

arXiv:2501.09929·cs.LG·April 3, 2025

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

Samuel Soo, Chen Guang, Wesley Teng, Chandrasekaran Balaganesh, Tan, Guoxian, Yan Ming

PDF

Open Access

TL;DR

This paper introduces FGAA, a new activation steering method for large language models that improves control precision and interpretability by leveraging autoencoder features and optimization, outperforming existing techniques.

Contribution

FGAA is a novel activation steering approach that enhances control over LLM outputs by operating in a sparse autoencoder latent space, offering better steering effects and interpretability.

Findings

01

FGAA outperforms existing steering methods like CAA and SAE-TS.

02

Steering effectiveness varies with scale and model capabilities.

03

Trade-offs exist between steering strength and model performance.

Abstract

Effective and reliable control over large language model (LLM) behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Autoencoder