Focus and Dilution: The Multi-stage Learning Process of Attention
Zheng-An Chen, Pengxiao Lin, Zhi-Qin John Xu, Tao Luo

TL;DR
This paper uncovers a recurrent focus-dilution cycle in Transformer attention training, explaining its stages and dynamics through gradient-flow analysis and stage-wise linearization, supported by experiments on synthetic and real data.
Contribution
It provides a rigorous, stage-wise explanation of the focus-dilution cycle in attention learning, a phenomenon previously not well understood.
Findings
Embedding and projection condense to rank-one structure rapidly.
Attention shifts focus toward high-frequency tokens during training.
The focus-dilution cycle repeats, driving attention dynamics in Transformers.
Abstract
Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus-dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
