On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning
Haoyuan Sun, Ali Jadbabaie, Navid Azizan

TL;DR
This paper investigates how Transformer feed-forward layers enable nonlinear in-context learning by implementing gradient descent on kernel regression losses, revealing the importance of depth and expressivity in such models.
Contribution
It demonstrates that feed-forward layers are essential for nonlinear ICL, showing how deep Transformers distribute kernel computations to overcome single-block limitations.
Findings
Feed-forward layers enable nonlinear ICL via gradient descent on kernel losses.
Single Transformer blocks have limited expressivity due to their dimensions.
Deep Transformers distribute computations across blocks to achieve richer nonlinear representations.
Abstract
Transformer-based models demonstrate a remarkable ability for in-context learning (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates. Recent research has illuminated how Transformers perform ICL, showing that the optimal linear self-attention (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks. Building on this, we investigate ICL for nonlinear function classes. We first prove that LSA is inherently incapable of outperforming linear predictors on nonlinear tasks, underscoring why prior solutions cannot readily extend to these problems. To overcome this limitation, we analyze a Transformer block consisting of LSA and feed-forward layers inspired by the gated linear units (GLU), which is a standard component of modern Transformers. We show that this block…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Linear Regression · Label Smoothing · Layer Normalization · Softmax · Adam
