Non-asymptotic Convergence of Training Transformers for Next-token Prediction
Ruiquan Huang, Yingbin Liang, Jing Yang

TL;DR
This paper provides a detailed non-asymptotic analysis of training dynamics for a one-layer transformer in next-token prediction, revealing convergence properties and generalization capabilities.
Contribution
It introduces a two-stage training algorithm with proven convergence rates and offers new insights into the structural properties influencing transformer training performance.
Findings
Both layers converge sub-linearly to max-margin solutions
Cross-entropy loss converges linearly during training
Trained transformers show strong prediction ability under dataset shift
Abstract
Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data, especially in next-token prediction (NTP) tasks. However, the theoretical understanding of their performance in NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer. We first characterize the essential structural properties of training datasets for NTP using a mathematical framework based on partial orders. Then, we design a two-stage training algorithm, where the pre-processing stage for training the feed-forward layer and the main stage for training the attention layer exhibit fast convergence performance. Specifically, both layers converge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
