Interpreting and Steering LLMs with Mutual Information-based   Explanations on Sparse Autoencoders

Xuansheng Wu; Jiayi Yuan; Wenlin Yao; Xiaoming Zhai; Ninghao Liu

arXiv:2502.15576·cs.CL·February 24, 2025

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming Zhai, Ninghao Liu

PDF

TL;DR

This paper introduces a mutual information-based explanation method for sparse autoencoders in LLMs, improving semantic interpretability and enabling effective behavior steering to prevent jailbreak attacks.

Contribution

It proposes a novel mutual information-based approach with a fixed vocabulary for better semantic explanations and runtime strategies for steering LLMs.

Findings

01

Provides discourse-level explanations of LLM features

02

Enhances LLM behavior steering against jailbreaks

03

Outperforms baseline explanation methods

Abstract

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explored how to better explain SAE features, i.e., understanding the semantic meaning of features learned by SAE. Our theoretical analysis reveals that existing explanation methods suffer from the frequency bias issue, where they emphasize linguistic patterns over semantic concepts, while the latter is more critical to steer LLM behaviors. To address this, we propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective, aiming to better capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training