Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization
Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie

TL;DR
This paper provides the first theoretical analysis of benign overfitting in Vision Transformers, revealing conditions under which they generalize well despite overfitting, supported by both theory and experiments.
Contribution
It introduces a novel theoretical framework for understanding benign overfitting in transformers, addressing challenges posed by softmax and weight interdependence.
Findings
Established a sharp condition for generalization based on signal-to-noise ratio.
Characterized training dynamics and convergence behavior of transformers.
Validated theoretical insights through experimental simulations.
Abstract
Transformers have demonstrated great power in the recent development of large foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary changes to the field of vision, achieving significant accomplishments on the experimental side. However, their theoretical capabilities, particularly in terms of generalization when trained to overfit training data, are still not fully understood. To address this gap, this work delves deeply into the benign overfitting perspective of transformers in vision. To this end, we study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. By developing techniques that address the challenges posed by softmax and the interdependent nature of multiple weights in transformer optimization, we successfully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding
