In-Context Learning of a Linear Transformer Block: Benefits of the MLP   Component and One-Step GD Initialization

Ruiqi Zhang; Jingfeng Wu; Peter L. Bartlett

arXiv:2402.14951·stat.ML·February 26, 2024·1 cites

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Ruiqi Zhang, Jingfeng Wu, Peter L. Bartlett

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that a Linear Transformer Block with an MLP component can achieve near-optimal in-context learning for linear regression, outperforming linear attention alone, by implementing gradient descent estimators.

Contribution

It establishes a theoretical connection between Linear Transformer Blocks with MLPs and gradient descent estimators, showing the benefits of the MLP component for in-context learning.

Findings

01

LTB achieves nearly Bayes optimal ICL risk for linear regression.

02

The MLP component reduces approximation error compared to linear attention.

03

LTB can be optimized efficiently with gradient flow despite non-convexity.

Abstract

We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ( $GD - β$ ), in the sense that every $GD - β$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $GD - β$ estimator. Finally, we show that $GD - β$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization· slideslive

Taxonomy

TopicsStructural Health Monitoring Techniques · Neural Networks and Applications · Non-Destructive Testing Techniques

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Adam · Softmax · Layer Normalization · Linear Regression