Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis
Hongru Yang, Bhavya Kailkhura, Zhangyang Wang, Yingbin Liang

TL;DR
This paper analyzes the training dynamics of shallow transformers recognizing word co-occurrence, revealing a two-phase process driven by gradient flow, and provides theoretical insights into how attention and MLP layers evolve during training.
Contribution
It offers a novel framework analyzing the coupled gradient flow dynamics of all transformer components from random initialization, with theoretical proofs and experimental validation.
Findings
Gradient flow divides training into two phases.
MLP quickly aligns with target signals in Phase 1.
Attention matrices and MLP jointly optimize in Phase 2.
Abstract
Understanding the training dynamics of transformers is important to explain the impressive capabilities behind large language models. In this work, we study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. In the literature of studying training dynamics of transformers, several simplifications are commonly adopted such as weight reparameterization, attention linearization, special initialization, and lazy regime. In contrast, we analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear MLP layer from random initialization, and provide a framework of analyzing such dynamics via a coupled dynamical system. We establish near minimum loss and characterize the attention model after training. We discover that gradient flow serves as an inherent mechanism that naturally divide the training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Softmax
