Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Daniil Laptev; Nikita Balagansky; Yaroslav Aksenov; Daniil Gavrilov

arXiv:2502.03032·cs.LG·July 28, 2025

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov

PDF

Open Access 1 Video

TL;DR

This paper presents a novel, data-free method to trace feature evolution across layers in large language models, enhancing interpretability and enabling targeted control over model outputs.

Contribution

It introduces a cosine similarity-based approach to map feature flow across layers, allowing for detailed interpretability and direct steering of language models.

Findings

01

Granular feature flow graphs reveal feature persistence and transformation.

02

Cross-layer feature maps enable targeted manipulation of model behavior.

03

Method improves understanding and control of large language models.

Abstract

We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling