Unveil Benign Overfitting for Transformer in Vision: Training Dynamics,   Convergence, and Generalization

Jiarui Jiang; Wei Huang; Miao Zhang; Taiji Suzuki; Liqiang Nie

arXiv:2409.19345·cs.LG·November 25, 2024

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization

Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie

PDF

Open Access

TL;DR

This paper provides the first theoretical analysis of benign overfitting in Vision Transformers, revealing conditions under which they generalize well despite overfitting, supported by both theory and experiments.

Contribution

It introduces a novel theoretical framework for understanding benign overfitting in transformers, addressing challenges posed by softmax and weight interdependence.

Findings

01

Established a sharp condition for generalization based on signal-to-noise ratio.

02

Characterized training dynamics and convergence behavior of transformers.

03

Validated theoretical insights through experimental simulations.

Abstract

Transformers have demonstrated great power in the recent development of large foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary changes to the field of vision, achieving significant accomplishments on the experimental side. However, their theoretical capabilities, particularly in terms of generalization when trained to overfit training data, are still not fully understood. To address this gap, this work delves deeply into the benign overfitting perspective of transformers in vision. To this end, we study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. By developing techniques that address the challenges posed by softmax and the interdependent nature of multiple weights in transformer optimization, we successfully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding