Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Usha Bhalla; Alex Oesterling; Claudio Mayrink Verdun; Himabindu Lakkaraju; Flavio P. Calmon

arXiv:2511.05541·cs.CL·February 27, 2026

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Temporal Sparse Autoencoders (T-SAEs), which incorporate a contrastive loss to leverage the sequential nature of language, resulting in more coherent and interpretable semantic features without sacrificing reconstruction quality.

Contribution

The paper proposes T-SAEs that utilize temporal structure and contrastive loss to improve interpretability of language models by disentangling semantic features in an unsupervised manner.

Findings

01

T-SAEs recover smoother, more coherent semantic concepts.

02

They outperform traditional SAEs in interpretability without losing reconstruction quality.

03

Semantic structure emerges clearly even without explicit semantic supervision.

Abstract

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they often only recover token-specific, noisy, or highly local concepts. We argue that this limitation stems from neglecting the temporal structure of language, where semantic content typically evolves smoothly over sequences. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 4Confidence 2

Strengths

1.The topic about temporal SAE is promising and interesting. 2. The experiments are conducted on multiple models and datasets. 3.The visualizations of tnse are impressive.

Weaknesses

[1] In Figure 1 (c, what is the number in x-axis mean? Does it mean the location in a long sentence? [2] Could you also discuss the limitations of this study, and potential future directions? [3] the experiments are conducted on models Pythia-160m and Gemma2-2b, with small parameter sizes. The reviewer understands this might be constrained by computational resources. I am not asking for additional experiments. However, could you discuss the motivations for choosing these models, and whether

Reviewer 02Rating 6Confidence 4

Strengths

The presentation of the paper is very clear: the motivation, method, and results are all presented in a way that is easy to follow. The spliced-text visualizations are particularly strong, providing an "it just works" demonstration that is more compelling than the quantitative metrics alone. The experimental validation is robust. The authors demonstrate their method's contribution through: (a) The smooth, semantic features in the visualizations are neat, (b) probing results confirm the high-lev

Weaknesses

A major weakness of this work is the overall lack of proper contextualization of their work against highly relevant works. The central problem—that the i.i.d. assumption for tokens is a flaw and corresponding solution that temporal dynamics should be leveraged for smoother extracted features is not new. For instance, this paper published earlier at iclr 2025 has identified very same problem and proposed a similar temporal modification to SAE (https://scholar.google.com/citations?view_op=view_cit

Reviewer 03Rating 10Confidence 4

Strengths

This work addresses a core shortcoming of SAEs as an interpretability technique, which is that the interpretable features they find are often too specific to individual tokens to be useful. For instance, a feature might be "Sentences endings or periods" (Figure 1), which is interpretable and useful to the SAE's reconstruction, but is not useful for downstream applications like steering. In this regard, the addition of temporal consistency is a natural evolution of the SAE architecture. The disc

Weaknesses

The contrastive loss has a complicated structure, and the authors do not motivate or explain what it has that form. There is insufficient explanation of why this contrastive loss outperforms the naive temporal similarity loss term (Lines 454-458 and Table 2). The case study in Section 4.5 shows quantitative results, but does not compare them to a classical SAE. This makes it hard to judge whether the TSAE architecture is an improvement over previous methods in this context.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis