Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Dong Shu; Xuansheng Wu; Haiyan Zhao; Mengnan Du; Ninghao Liu

arXiv:2505.08080·cs.LG·September 24, 2025

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Dong Shu, Xuansheng Wu, Haiyan Zhao, Mengnan Du, Ninghao Liu

PDF

1 Video

TL;DR

This paper introduces Gradient Sparse Autoencoders (GradSAE), a method that identifies influential latent features in large language models by considering output-side gradient information, improving interpretability and steering capabilities.

Contribution

The paper proposes GradSAE, a novel approach that incorporates output gradients to identify influential latents, advancing interpretability and control of large language models.

Findings

01

GradSAE effectively identifies influential latents based on causal influence.

02

Incorporating output gradients improves model interpretability.

03

GradSAE enhances model steering by focusing on high-impact latents.

Abstract

Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the causal influence between each latent feature and the model's output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model's output, and (2) only latents with high causal influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders· underline

Taxonomy

MethodsSparse Autoencoder