Priors in Time: Missing Inductive Biases for Language Model Interpretability

Ekdeep Singh Lubana; Can Rager; Sai Sumedh R. Hindupur; Valerie Costa; Greta Tuckute; Oam Patel; Sonia Krishna Murthy; Thomas Fel; Daniel Wurgaft; Eric J. Bigelow; Johnny Lin; Demba Ba; Martin Wattenberg; Fernanda Viegas; Melanie Weber; Aaron Mueller

arXiv:2511.01836·cs.LG·November 25, 2025

Priors in Time: Missing Inductive Biases for Language Model Interpretability

Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, Aaron Mueller

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new interpretability method for language models that incorporates temporal biases, addressing the limitations of existing methods that assume independence across time, and effectively captures dynamic language representations.

Contribution

We propose Temporal Feature Analysis, a novel interpretability objective with a temporal inductive bias, inspired by neuroscience, to better analyze language model dynamics.

Findings

01

Temporal Feature Analyzers successfully parse garden path sentences.

02

They identify event boundaries and distinguish slow-moving from fast-moving information.

03

Existing Sparse Autoencoders show significant limitations in these tasks.

Abstract

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. *Formal Characterization of SAE Priors:* The paper provides a clear and valuable formalization of the implicit assumptions in standard SAEs. Characterizing the priors as i.i.d. across time (Proposition 3.1) 6cleanly articulates the central problem. 2. *Strong Empirical Analysis:* The claims about the data's temporal structure are well-supported by a robust empirical analysis. The use of multiple metrics, including intrinsic dimensionality (U-statistic) and autocorrelation, demonstrates the no

Weaknesses

1. *Evaluation Relies on Correlational Evidence:* The primary evidence for the Temporal SAE's improved interpretability is qualitative (dendrograms in Fig. 5, 7) or correlational (CKA with slow/fast Fourier signals in Table 3, Fig. 6). While standard metrics like reconstruction (Table 1) are included as sanity checks, the paper never demonstrates that the disentangled $z_p$and $z_n$ features are _causally_ more useful than standard SAE features at steering model behavior. - The qualitative c

Reviewer 02Rating 4Confidence 4

Strengths

1. The topic of decomposes the activation into two parts is interesting. 2. The core problem and flawed assumption that standard SAE has is well introduced. 3. The proposed method is interesting. 4. The temporal SAE can successfully identify concept that standard SAE can't

Weaknesses

1. The author claims their temporal SAE is better than standard SAE. However, sometimes it is really difficult to evaluate an SAE, since there is no "ground truth". Therefore, the author should evaluate their SAE on some downstream tasks such as model steering. 2. What about the interpretable concepts in the predictable component? Why are we not finding interpretable features in predictable component as well? 3. The paper never evaluates its new features on any of downstream tasks. The evaluati

Reviewer 03Rating 6Confidence 4

Strengths

1. The primary advantage of this paper lies in its thoughtful incorporation of linguistic insights into the analysis, design, and evaluation of SAE methods and LLM representations. The empirical experiments in Section 4, presented in Figure 2, clearly show that LLM representations of token sequences conform to the non-stationarity of natural language, featuring continuous increments of new information regulated by underlying dependency structures. The evaluation of the proposed Temporal SAE on g

Weaknesses

1. One of the primary motivations (and functions) of the original SAE is to decompose the polysemous LLM representations into distinct monosemantic features that are more interpretable and less ambiguous, which can further be utilized in steering experiments. This paper, however, makes no attempt to semantically interpret the learned features in this regard—likely because the architectural design of the Temporal SAE, where the feature activations of the predictable part are obtained via an atten

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling