One-Layer Transformer Provably Learns One-Nearest Neighbor In Context

Zihao Li; Yuan Cao; Cheng Gao; Yihan He; Han Liu; Jason M. Klusowski,; Jianqing Fan; Mengdi Wang

arXiv:2411.10830·cs.LG·November 19, 2024

One-Layer Transformer Provably Learns One-Nearest Neighbor In Context

Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski,, Jianqing Fan, Mengdi Wang

PDF

Open Access

TL;DR

This paper proves that a one-layer transformer with softmax attention can learn to implement the one-nearest neighbor classifier, providing theoretical insight into transformers' in-context learning abilities for classical nonparametric methods.

Contribution

It offers a theoretical demonstration that a single-layer transformer can learn to perform one-nearest neighbor classification, highlighting the role of softmax attention.

Findings

01

Transformer learns one-nearest neighbor rule from prompts

02

Gradient descent training converges despite nonconvex loss

03

Provides theoretical understanding of transformers' in-context learning

Abstract

Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability -- even without fine-tuning, they are still able to solve unseen tasks well purely based on task-specific prompts. In this paper, we study the capability of one-layer transformers in learning one of the most classical nonparametric estimators, the one-nearest neighbor prediction rule. Under a theoretical framework where the prompt contains a sequence of labeled training data and unlabeled test data, we show that, although the loss function is nonconvex when trained with gradient descent, a single softmax attention layer can successfully learn to behave like a one-nearest neighbor classifier. Our result gives a concrete example of how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Softmax