Incremental Learning of Sparse Attention Patterns in Transformers
O\u{g}uz Kaan Y\"uksel, Rodrigo Alvarez Lucendo, Nicolas Flammarion

TL;DR
This paper provides a theoretical analysis of how transformers learn sparse attention patterns incrementally, revealing staged learning dynamics, the role of early stopping, and the progression through simpler hypothesis classes.
Contribution
It introduces a high-order Markov chain task to model incremental learning in transformers and characterizes the stage-wise convergence and dynamics of attention pattern specialization.
Findings
Transformers learn attention patterns in stages, from simple to complex.
Early stopping biases models toward simpler, more generalizable patterns.
Differential equations model the transition dynamics and convergence stages.
Abstract
This paper introduces a high-order Markov chain task to investigate how transformers learn to integrate information from multiple past positions with varying statistical significance. We demonstrate that transformers learn this task incrementally: each stage is defined by the acquisition of specific information through sparse attention patterns. Notably, we identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns. We model these dynamics using simplified differential equations that characterize the trajectory and prove stage-wise convergence results. Our analysis reveals that transformers ascend a complexity ladder by passing through simpler, misspecified hypothesis classes before reaching the full model class. We further show that early stopping acts as an…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is well-motivated, and the problem setup is clearly presented. The task design is well-suited for studying learning dynamics in a simplified and controlled testbed that also lends itself to theoretical analysis (though the theoretical part later relies on additional simplifications). For the proposed task, the empirical study effectively reveals and visualizes the inner mechanisms and learning dynamics, offering nice insight into how attention heads specialize during training. The ov
- The theoretical analysis is conducted on a much simpler setup than the synthetic task used in the empirical study. This level of simplification is not inherently problematic for a theoretical treatment, provided the simplified model replicates the key behaviors and offers a tractable framework for analysis. However, the theory presented here is only partial: it focuses solely on the initial stage of training, where all heads learn the same pattern under specific initialization assumptions. In
1. The synthetic task cleanly isolates how transformers learn hierarchical sparse dependencies, providing a tractable testbed for analyzing training dynamics. 2. The theoretical analysis offers a principled explanation for the competitive phase, complementing empirical observations. 3. Experiments on dataset size reveal that limited data induces implicit regularization(learning fewer blocks), deepening understanding of generalization in data-scarce regimes.
1. While previous works like [1] have studied how transformers learn causal structure, this paper provides a more detailed analysis of the training dynamics; however, most of the analysis remains experimental, and the finding that the model first learns to attend to the most important tokens and then refines the pattern seems intuitive and not surprising. 2. The theoretical analysis lacks a clear explanation; for example, V(t) and s(t) are not defined, making it hard to grasp the main statement
* The figures are thoughtfully designed and effectively convey the key results. * The theoretical development is clear and satisfying: it provides a principled dynamical systems view of an empirically observed phenomenon. * The connection between optimization dynamics and low-rank tensor factorization is elegant and offers intuitive insight into head specialization.
1. Missing Discussion of Generalization. The abstract mentions “generalization,” yet this topic is not revisited in the main text. 2. Limited Scope of Contribution. The first stated contribution—analyzing a simplified, single-layer model—should be reframed as a limitation rather than a contribution. It remains unclear how the conclusions would extend to deeper or larger-scale architectures or more generalized type of sequences. 3. Unclear Cognitive Relevance. The setup—modeling the next token
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbodied and Extended Cognition · Neural dynamics and brain function · Motor Control and Adaptation
