Provably learning a multi-head attention layer
Sitan Chen, Yuanzhi Li

TL;DR
This paper provides the first provable algorithms and lower bounds for learning multi-head attention layers in transformers, demonstrating exponential complexity dependence on the number of heads and focusing on Boolean inputs to model token discreteness.
Contribution
It introduces the first nontrivial algorithms and lower bounds for provably learning multi-head attention layers from random examples, advancing theoretical understanding of transformer components.
Findings
Algorithm learns attention layer with small error under certain conditions
Exponential lower bounds show worst-case hardness with respect to number of heads
Techniques extend to continuous distributions like Gaussian
Abstract
The multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. Given a sequence length , attention matrices , and projection matrices , the corresponding multi-head attention layer transforms length- sequences of -dimensional tokens via . In this work, we initiate the study of provably learning a multi-head attention layer from random examples and give the first nontrivial upper and lower bounds for this problem: - Provided $\{\mathbf{W}_i,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsAttention Is All You Need · Focus · Softmax · Linear Layer · Multi-Head Attention
