Distributional Associations vs In-Context Reasoning: A Study of   Feed-forward and Attention Layers

Lei Chen; Joan Bruna; Alberto Bietti

arXiv:2406.03068·cs.LG·March 10, 2025

Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers

Lei Chen, Joan Bruna, Alberto Bietti

PDF

Open Access

TL;DR

This paper investigates the distinct roles of feed-forward and attention layers in Transformer models, showing feed-forward layers learn distributional associations while attention layers focus on in-context reasoning, supported by empirical and theoretical analysis.

Contribution

It provides the first empirical and theoretical distinction between feed-forward and attention layers in Transformers regarding their roles in knowledge storage and reasoning.

Findings

01

Feed-forward layers learn simple distributional associations like bigrams.

02

Attention layers focus on in-context reasoning tasks.

03

Disparities between layers are linked to gradient noise in training.

Abstract

Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies the noise in the gradients as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Softmax · Layer Normalization · Pythia · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer