Benign Overfitting in Token Selection of Attention Mechanism
Keitaro Sakamoto, Issei Sato

TL;DR
This paper investigates how attention mechanisms in transformers can overfit label noise yet still generalize well, revealing delayed generalization and supporting findings with experiments.
Contribution
It provides a theoretical analysis of benign overfitting in attention token selection, highlighting the role of signal-to-noise ratio and delayed generalization.
Findings
Attention achieves benign overfitting despite fitting label noise
Delayed generalization occurs after initial overfitting phase
Experimental validation on synthetic and real datasets
Abstract
Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success. However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research. In this work, we study the training dynamics and generalization ability of the attention mechanism under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves benign overfitting, i.e., maintaining high generalization performance despite fitting label noise. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting. Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsManufacturing Process and Optimization
MethodsSoftmax · Attention Is All You Need
