Training Dynamics of Transformers to Recognize Word Co-occurrence via   Gradient Flow Analysis

Hongru Yang; Bhavya Kailkhura; Zhangyang Wang; Yingbin Liang

arXiv:2410.09605·cs.LG·October 15, 2024

Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis

Hongru Yang, Bhavya Kailkhura, Zhangyang Wang, Yingbin Liang

PDF

Open Access

TL;DR

This paper analyzes the training dynamics of shallow transformers recognizing word co-occurrence, revealing a two-phase process driven by gradient flow, and provides theoretical insights into how attention and MLP layers evolve during training.

Contribution

It offers a novel framework analyzing the coupled gradient flow dynamics of all transformer components from random initialization, with theoretical proofs and experimental validation.

Findings

01

Gradient flow divides training into two phases.

02

MLP quickly aligns with target signals in Phase 1.

03

Attention matrices and MLP jointly optimize in Phase 2.

Abstract

Understanding the training dynamics of transformers is important to explain the impressive capabilities behind large language models. In this work, we study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. In the literature of studying training dynamics of transformers, several simplifications are commonly adopted such as weight reparameterization, attention linearization, special initialization, and lazy regime. In contrast, we analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear MLP layer from random initialization, and provide a framework of analyzing such dynamics via a coupled dynamical system. We establish near minimum loss and characterize the attention model after training. We discover that gradient flow serves as an inherent mechanism that naturally divide the training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Softmax