Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers
Lei Chen, Joan Bruna, Alberto Bietti

TL;DR
This paper investigates the distinct roles of feed-forward and attention layers in Transformer models, showing feed-forward layers learn distributional associations while attention layers focus on in-context reasoning, supported by empirical and theoretical analysis.
Contribution
It provides the first empirical and theoretical distinction between feed-forward and attention layers in Transformers regarding their roles in knowledge storage and reasoning.
Findings
Feed-forward layers learn simple distributional associations like bigrams.
Attention layers focus on in-context reasoning tasks.
Disparities between layers are linked to gradient noise in training.
Abstract
Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Softmax · Layer Normalization · Pythia · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer
