Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Mark Rofin; Jalal Naghiyev; Michael Hahn

arXiv:2603.14087·cs.LG·March 17, 2026

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

Mark Rofin, Jalal Naghiyev, Michael Hahn

PDF

Open Access 3 Reviews

TL;DR

This paper investigates why Transformers develop seemingly redundant features, identifying gradient components responsible and analyzing their influence on feature emergence in models like OthelloGPT and LLMs, especially in reasoning tasks.

Contribution

It introduces a method to estimate gradient component influence on feature emergence, providing insights into hidden features in Transformers during training.

Findings

01

Gradient signals shape feature emergence in Transformers.

02

Features with high or low influence relate to formal reasoning.

03

Interpretation of features in models like OthelloGPT and LLMs.

Abstract

Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The paper introduces a novel decomposition of gradient signal into direct, pre-cached and shared components. This framework extends [Wu et al., 2024](https://arxiv.org/pdf/2404.00859)'s work by studying the emergence of features as a function of the different gradient signals. - The authors offer a valuable methodological advance for interpreting training dynamics. - The framework is used in four diverse settings: toy algorithmic tasks, OthelloGPT, TinyStories, and Gemma 2 2B, and is used to e

Weaknesses

- The influence estimation framework requires retraining the models with the same random seed and data order, which is infeasible for large-scale models. This limits the method’s practical applicability. - The role of "pre-cached" features in text generation models (section 4.3) is unclear. No statistical significant results showed. Absolute levels of influence are not informative. - In the pre-cached features analysis on Gemma 2 (section 5.1), the authors claim that SAE features with high pre-c

Reviewer 02Rating 6Confidence 3

Strengths

- **Originality.** The three-way gradient decomposition is novel and well formalized. It offers a developmental view of feature emergence that complements prior interpretability work. - **Quality.** The theoretical analysis is rigorous and the experiments are well designed. Ablations on toy tasks and real models support the framework. - **Structure and coherence.** The presentation follows a logical order from theory to empirical evidence. Figures and proofs are consistent with the stated cl

Weaknesses

- **Lack of empirical validation of \(Q(w)\).** Proposition 5.1 defines \(Q(w)\) as an influence proxy, but it is not validated against true influence ratios despite available data. - **Modified optimizer.** The use of a non-standard Adam variant with separate moments for each gradient component could alter convergence. A control experiment with standard Adam is needed. - **Correlation without causation.** The Gemma 2 analysis links high \(Q(w)\) to formal reasoning features but does not tes

Reviewer 03Rating 6Confidence 3

Strengths

- Studies the utility of emergent features that allow predictions beyond the next token in next token prediction pre-training - Additional confirmation of hypothesis or previous literature results (Wu et al.'s pre-caching speculation/breadcrumbs hypothesis, OthelloGPT and board-state encoding)

Weaknesses

- The premise of the study feels partially undefined: the assumption that transformers trained on NTP will converge to greedy decoding feels unintuitive/by fiat (this can be mitigated with a citation if appropriate when the problem space is defined in the Introduction) - Very minor: the use of "rose" makes the information difficult to read (on a screen and borderline invisible on an eInk display)

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications