Incremental Learning of Sparse Attention Patterns in Transformers

O\u{g}uz Kaan Y\"uksel; Rodrigo Alvarez Lucendo; Nicolas Flammarion

arXiv:2602.19143·cs.LG·February 24, 2026

Incremental Learning of Sparse Attention Patterns in Transformers

O\u{g}uz Kaan Y\"uksel, Rodrigo Alvarez Lucendo, Nicolas Flammarion

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical analysis of how transformers learn sparse attention patterns incrementally, revealing staged learning dynamics, the role of early stopping, and the progression through simpler hypothesis classes.

Contribution

It introduces a high-order Markov chain task to model incremental learning in transformers and characterizes the stage-wise convergence and dynamics of attention pattern specialization.

Findings

01

Transformers learn attention patterns in stages, from simple to complex.

02

Early stopping biases models toward simpler, more generalizable patterns.

03

Differential equations model the transition dynamics and convergence stages.

Abstract

This paper introduces a high-order Markov chain task to investigate how transformers learn to integrate information from multiple past positions with varying statistical significance. We demonstrate that transformers learn this task incrementally: each stage is defined by the acquisition of specific information through sparse attention patterns. Notably, we identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns. We model these dynamics using simplified differential equations that characterize the trajectory and prove stage-wise convergence results. Our analysis reveals that transformers ascend a complexity ladder by passing through simpler, misspecified hypothesis classes before reaching the full model class. We further show that early stopping acts as an…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The paper is well-motivated, and the problem setup is clearly presented. The task design is well-suited for studying learning dynamics in a simplified and controlled testbed that also lends itself to theoretical analysis (though the theoretical part later relies on additional simplifications). For the proposed task, the empirical study effectively reveals and visualizes the inner mechanisms and learning dynamics, offering nice insight into how attention heads specialize during training. The ov

Weaknesses

- The theoretical analysis is conducted on a much simpler setup than the synthetic task used in the empirical study. This level of simplification is not inherently problematic for a theoretical treatment, provided the simplified model replicates the key behaviors and offers a tractable framework for analysis. However, the theory presented here is only partial: it focuses solely on the initial stage of training, where all heads learn the same pattern under specific initialization assumptions. In

Reviewer 02Rating 4Confidence 3

Strengths

1. The synthetic task cleanly isolates how transformers learn hierarchical sparse dependencies, providing a tractable testbed for analyzing training dynamics. 2. The theoretical analysis offers a principled explanation for the competitive phase, complementing empirical observations. 3. Experiments on dataset size reveal that limited data induces implicit regularization(learning fewer blocks), deepening understanding of generalization in data-scarce regimes.

Weaknesses

1. While previous works like [1] have studied how transformers learn causal structure, this paper provides a more detailed analysis of the training dynamics; however, most of the analysis remains experimental, and the finding that the model first learns to attend to the most important tokens and then refines the pattern seems intuitive and not surprising. 2. The theoretical analysis lacks a clear explanation; for example, V(t) and s(t) are not defined, making it hard to grasp the main statement

Reviewer 03Rating 6Confidence 3

Strengths

* The figures are thoughtfully designed and effectively convey the key results. * The theoretical development is clear and satisfying: it provides a principled dynamical systems view of an empirically observed phenomenon. * The connection between optimization dynamics and low-rank tensor factorization is elegant and offers intuitive insight into head specialization.

Weaknesses

1. Missing Discussion of Generalization.  The abstract mentions “generalization,” yet this topic is not revisited in the main text. 2. Limited Scope of Contribution. The first stated contribution—analyzing a simplified, single-layer model—should be reframed as a limitation rather than a contribution. It remains unclear how the conclusions would extend to deeper or larger-scale architectures or more generalized type of sequences. 3. Unclear Cognitive Relevance. The setup—modeling the next token

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbodied and Extended Cognition · Neural dynamics and brain function · Motor Control and Adaptation