Understanding Generalization in Transformers: Error Bounds and Training   Dynamics Under Benign and Harmful Overfitting

Yingying Zhang; Zhenyu Wu; Jian Li; Yong Liu

arXiv:2502.12508·cs.LG·February 19, 2025

Understanding Generalization in Transformers: Error Bounds and Training Dynamics Under Benign and Harmful Overfitting

Yingying Zhang, Zhenyu Wu, Jian Li, Yong Liu

PDF

Open Access

TL;DR

This paper develops a generalization theory for two-layer transformers, analyzing how training dynamics and noise influence overfitting and test errors, supported by experiments that validate the theoretical bounds.

Contribution

It introduces the first generalization error bounds for transformers under benign and harmful overfitting, considering training stages and signal-to-noise ratios.

Findings

01

Error bounds vary across training stages

02

Training dynamics significantly impact generalization

03

Experimental results confirm theoretical predictions

Abstract

Transformers serve as the foundational architecture for many successful large-scale models, demonstrating the ability to overfit the training data while maintaining strong generalization on unseen data, a phenomenon known as benign overfitting. However, research on how the training dynamics influence error bounds within the context of benign overfitting has been limited. This paper addresses this gap by developing a generalization theory for a two-layer transformer with labeled flip noise. Specifically, we present generalization error bounds for both benign and harmful overfitting under varying signal-to-noise ratios (SNR), where the training dynamics are categorized into three distinct stages, each with its corresponding error bounds. Additionally, we conduct extensive experiments to identify key factors that influence test errors in transformers. Our experimental results align closely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForecasting Techniques and Applications · Neural Networks and Applications