Focus and Dilution: The Multi-stage Learning Process of Attention

Zheng-An Chen; Pengxiao Lin; Zhi-Qin John Xu; Tao Luo

arXiv:2605.01199·cs.LG·May 5, 2026

Focus and Dilution: The Multi-stage Learning Process of Attention

Zheng-An Chen, Pengxiao Lin, Zhi-Qin John Xu, Tao Luo

PDF

TL;DR

This paper uncovers a recurrent focus-dilution cycle in Transformer attention training, explaining its stages and dynamics through gradient-flow analysis and stage-wise linearization, supported by experiments on synthetic and real data.

Contribution

It provides a rigorous, stage-wise explanation of the focus-dilution cycle in attention learning, a phenomenon previously not well understood.

Findings

01

Embedding and projection condense to rank-one structure rapidly.

02

Attention shifts focus toward high-frequency tokens during training.

03

The focus-dilution cycle repeats, driving attention dynamics in Transformers.

Abstract

Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus-dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.