Non-asymptotic Convergence of Training Transformers for Next-token   Prediction

Ruiquan Huang; Yingbin Liang; Jing Yang

arXiv:2409.17335·cs.LG·October 1, 2024

Non-asymptotic Convergence of Training Transformers for Next-token Prediction

Ruiquan Huang, Yingbin Liang, Jing Yang

PDF

Open Access

TL;DR

This paper provides a detailed non-asymptotic analysis of training dynamics for a one-layer transformer in next-token prediction, revealing convergence properties and generalization capabilities.

Contribution

It introduces a two-stage training algorithm with proven convergence rates and offers new insights into the structural properties influencing transformer training performance.

Findings

01

Both layers converge sub-linearly to max-margin solutions

02

Cross-entropy loss converges linearly during training

03

Trained transformers show strong prediction ability under dataset shift

Abstract

Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data, especially in next-token prediction (NTP) tasks. However, the theoretical understanding of their performance in NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer. We first characterize the essential structural properties of training datasets for NTP using a mathematical framework based on partial orders. Then, we design a two-stage training algorithm, where the pre-processing stage for training the feed-forward layer and the main stage for training the attention layer exhibit fast convergence performance. Specifically, both layers converge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need