Tracking the Feature Dynamics in LLM Training: A Mechanistic Study

Yang Xu; Yi Wang; Hengguan Huang; Hao Wang

arXiv:2412.17626·cs.LG·June 4, 2025

Tracking the Feature Dynamics in LLM Training: A Mechanistic Study

Yang Xu, Yi Wang, Hengguan Huang, Hao Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SAE-Track, a new method for analyzing how features in large language models evolve during training, revealing insights into feature formation, semantic changes, and vector drift.

Contribution

SAE-Track is a novel approach that enables continuous tracking of feature dynamics in LLMs, advancing mechanistic interpretability.

Findings

01

Features undergo semantic evolution during training

02

Feature vectors exhibit directional drift over time

03

The method reveals underlying processes of feature formation

Abstract

Understanding training dynamics and feature evolution is crucial for the mechanistic interpretability of large language models (LLMs). Although sparse autoencoders (SAEs) have been used to identify features within LLMs, a clear picture of how these features evolve during training remains elusive. In this study, we (1) introduce SAE-Track, a novel method for efficiently obtaining a continual series of SAEs, providing the foundation for a mechanistic study that covers (2) the semantic evolution of features, (3) the underlying processes of feature formation, and (4) the directional drift of feature vectors. Our work provides new insights into the dynamics of features in LLMs, enhancing our understanding of training mechanisms and feature evolution. For reproducibility, our code is available at https://github.com/Superposition09m/SAE-Track.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

- Addresses an important interpretability gap by enabling continuous tracking of feature evolution rather than static analyses. - Combines semantic, geometric, and dynamical perspectives, offering a comprehensive view of feature development.

Weaknesses

- The experiments rely on relatively small open-source models, which may limit the generalizability of the conclusions to larger LLMs. - The definition and identification of the three phases could be clarified. It may be useful to compare how the phases emerge across different perspectives (semantic concepts, feature dynamics, decoder vector evolution) and whether they match up to ensure consistency. - Although some results for additional layers are provided in the appendix, it would strengthe

Reviewer 02Rating 6Confidence 4

Strengths

The authors present a well-written and comprehensive study of feature evolution during LLM training. They introduce a novel method for training sequences of SAEs on training checkpoints and study the evolution of features during training in a variety of ways. They study both token- and concept-level feature formation qualitatively and quantitatively and provide original analysis of feature trajectories in the latent space. During their analysis, they also propose a new metric to measure the prog

Weaknesses

It is unclear to what degree these results are influenced by the continual training approach used for the SAEs. There should be a comparison with training SAEs separately on each checkpoint to determine a) how similar the features from those SAEs are to those from the continually trained models and b) how the feature evolution looks for the standard approach. Additionally, quality measures should be reported to evaluate how well the SAEs are capturing the LLMs' representations. Some analysis is

Reviewer 03Rating 2Confidence 4

Strengths

- The track-SAE method gives a way to track features dynamically, assuming that dense checkpoints are available. - The geometric perspective in fig. 3 is interesting. - The writing of the paper is well organised, and the limitation section is well documented and clear.

Weaknesses

- A key limitation of the proposed method is its reliance on the assumption that representations do not vary significantly from checkpoint to checkpoint. This assumption only holds for models with very dense checkpoints—a scenario that is often impractical or unavailable for very large models. Consequently, the SAE-track method is applicable to only a narrow subset of models. This not only increases training costs but also limits the analysis of neuron tracking: it would be impossible to track a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques