Superiority of Multi-Head Attention in In-Context Linear Regression

Yingqian Cui; Jie Ren; Pengfei He; Jiliang Tang; Yue Xing

arXiv:2401.17426·cs.LG·February 1, 2024·2 cites

Superiority of Multi-Head Attention in In-Context Linear Regression

Yingqian Cui, Jie Ren, Pengfei He, Jiliang Tang, Yue Xing

PDF

Open Access

TL;DR

This paper provides a theoretical comparison showing that multi-head attention in transformers outperforms single-head attention in in-context linear regression tasks, especially as the number of examples increases.

Contribution

The paper offers the first exact theoretical analysis demonstrating the superiority of multi-head attention over single-head attention in in-context learning for linear regression.

Findings

01

Multi-head attention has a smaller prediction loss constant than single-head attention.

02

Performance improves as the number of in-context examples increases, with loss decreasing at a rate of O(1/D).

03

Multi-head attention is generally preferred across various data scenarios.

Abstract

We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing literature predominantly focuses on the convergence of transformers with single-/multi-head attention, our research centers on comparing their performance. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. When the number of in-context examples D increases, the prediction loss using single-/multi-head attention is in O(1/D), and the one for multi-head attention has a smaller multiplicative constant. In addition to the simplest data distribution setting, we consider more scenarios, e.g., noisy labels, local examples, correlated features, and prior knowledge. We observe that, in general, multi-head attention is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · EEG and Brain-Computer Interfaces · Distributed Sensor Networks and Detection Algorithms

MethodsAttention Is All You Need · Linear Layer · Linear Regression · Multi-Head Attention · Softmax