Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Tiberiu Musat

arXiv:2411.12118·cs.LG·October 29, 2025

Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Tiberiu Musat

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how multi-layer transformers develop attention mechanisms to solve retrieval tasks, revealing that attention heads emerge sequentially guided by an implicit curriculum, which enhances understanding of their internal workings.

Contribution

It introduces the retrieval problem as a reasoning task, demonstrates transformers can solve it without fine-tuning, and uncovers the sequential emergence of attention heads during training.

Findings

01

Transformers solve retrieval tasks with minimal layers.

02

Attention heads emerge in a specific sequence during training.

03

Implicit curriculum guides the emergence of attention mechanisms.

Abstract

In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input size. I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. Successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence guided by the implicit curriculum.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- The paper makes a strong contribution to the field of mechanistic interpretability and furthering our understanding of the transformer architecture and training behavior. Especially the finding on the sequential emergence of attention heads for reasoning circuits is valuable. - The **problem statement** at hand is presented nicely, and the paper follows a logical progression building up to the final insights. - The **structure and flow of the experiments** are sensible, starting with higher-l

Weaknesses

**Readability:** - Section 5 "THEORETICAL ANALYSIS OF INFORMATION FLOW" is quite **hard to follow** and requires some time to understand, especially with limited prior knowledge. Concrete, examples to what "E", "F", ... and so forth might mean in the training and evaluation context could help the reader to grasp the theoretical analysis quicker. Augmenting this section with examples, e.g. from Sec. 4, would make it more accessible to a larger audience. **Missing clarity:** - Large portions

Reviewer 02Rating 6Confidence 3

Strengths

This is a good joy for understanding the ability of LLM, especially the emergent ability of models. 1. a novel idea to study the ability of the reasoning ability of LLM. 2. The finding is exciting and fits with human intuition. 3. The visualization of attention can give describution on how LLM takes retrieval reasoning tasks.

Weaknesses

1. Can you provide a more complex example? 2. In my opinion, I hope I can see a general framework that can unify more tasks with your retrieval task. 3. There are a lot of chapters in the article, and I can't quite understand the relationship between different chapters.

Reviewer 03Rating 5Confidence 3

Strengths

● The paper provides insights into how LLMs perform retrieval using attention heads. ● Introducing these tasks gives a clear way to study transformers' retrieval abilities. ● The study highlights the role of learning curriculum in the development of retrieval mechanisms.

Weaknesses

1. **Lack of Experimental Validation for Theoretical Claims**: Theoretical claims like Theorem 1 lack empirical support, making it hard to verify their practical impact. 2. **Insufficient Details on Experimental Setup**:: The paper lacks detailed explanations for key experimental settings. The implicit curriculum (IC) formulation, which performs better than non-IC, is not clearly defined, and Section 8’s description of manually reverse-engineering circuits lacks detail, making the experiments di

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing

MethodsSoftmax · Attention Is All You Need