How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression
Xingwu Chen, Lei Zhao, Difan Zou

TL;DR
This paper investigates how trained transformers use multi-head attention in in-context learning for sparse linear regression, revealing layer-specific patterns and providing theoretical explanations for their mechanisms.
Contribution
It offers a comprehensive analysis of multi-head attention utilization in trained transformers, combining experimental observations with theoretical insights for sparse linear regression.
Findings
Multi-heads are essential in the first layer but often only one is used in subsequent layers.
The first layer preprocesses data, and later layers perform simple optimization steps.
Preprocess-then-optimize approach outperforms naive gradient descent and ridge regression.
Abstract
Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of transformers by designing specific parameter constructions, lacking a comprehensive understanding of their inherent working mechanisms post-training. In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers: multiple heads are utilized and essential in the first layer, while usually only a single head is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Softmax · Linear Regression · Linear Layer · Focus · Multi-Head Attention
