Superiority of Multi-Head Attention in In-Context Linear Regression
Yingqian Cui, Jie Ren, Pengfei He, Jiliang Tang, Yue Xing

TL;DR
This paper provides a theoretical comparison showing that multi-head attention in transformers outperforms single-head attention in in-context linear regression tasks, especially as the number of examples increases.
Contribution
The paper offers the first exact theoretical analysis demonstrating the superiority of multi-head attention over single-head attention in in-context learning for linear regression.
Findings
Multi-head attention has a smaller prediction loss constant than single-head attention.
Performance improves as the number of in-context examples increases, with loss decreasing at a rate of O(1/D).
Multi-head attention is generally preferred across various data scenarios.
Abstract
We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing literature predominantly focuses on the convergence of transformers with single-/multi-head attention, our research centers on comparing their performance. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. When the number of in-context examples D increases, the prediction loss using single-/multi-head attention is in O(1/D), and the one for multi-head attention has a smaller multiplicative constant. In addition to the simplest data distribution setting, we consider more scenarios, e.g., noisy labels, local examples, correlated features, and prior knowledge. We observe that, in general, multi-head attention is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · EEG and Brain-Computer Interfaces · Distributed Sensor Networks and Detection Algorithms
MethodsAttention Is All You Need · Linear Layer · Linear Regression · Multi-Head Attention · Softmax
