In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization
Ruiqi Zhang, Jingfeng Wu, Peter L. Bartlett

TL;DR
This paper demonstrates that a Linear Transformer Block with an MLP component can achieve near-optimal in-context learning for linear regression, outperforming linear attention alone, by implementing gradient descent estimators.
Contribution
It establishes a theoretical connection between Linear Transformer Blocks with MLPs and gradient descent estimators, showing the benefits of the MLP component for in-context learning.
Findings
LTB achieves nearly Bayes optimal ICL risk for linear regression.
The MLP component reduces approximation error compared to linear attention.
LTB can be optimized efficiently with gradient flow despite non-convexity.
Abstract
We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization (), in the sense that every estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a estimator. Finally, we show that …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStructural Health Monitoring Techniques · Neural Networks and Applications · Non-Destructive Testing Techniques
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Adam · Softmax · Layer Normalization · Linear Regression
