Benign Overfitting in Token Selection of Attention Mechanism

Keitaro Sakamoto; Issei Sato

arXiv:2409.17625·cs.LG·May 20, 2025

Benign Overfitting in Token Selection of Attention Mechanism

Keitaro Sakamoto, Issei Sato

PDF

Open Access 1 Repo

TL;DR

This paper investigates how attention mechanisms in transformers can overfit label noise yet still generalize well, revealing delayed generalization and supporting findings with experiments.

Contribution

It provides a theoretical analysis of benign overfitting in attention token selection, highlighting the role of signal-to-noise ratio and delayed generalization.

Findings

01

Attention achieves benign overfitting despite fitting label noise

02

Delayed generalization occurs after initial overfitting phase

03

Experimental validation on synthetic and real datasets

Abstract

Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success. However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research. In this work, we study the training dynamics and generalization ability of the attention mechanism under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves benign overfitting, i.e., maintaining high generalization performance despite fitting label noise. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting. Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

keitaroskmt/benign-attention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsManufacturing Process and Optimization

MethodsSoftmax · Attention Is All You Need