Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
Tiberiu Musat

TL;DR
This paper investigates how multi-layer transformers develop attention mechanisms to solve retrieval tasks, revealing that attention heads emerge sequentially guided by an implicit curriculum, which enhances understanding of their internal workings.
Contribution
It introduces the retrieval problem as a reasoning task, demonstrates transformers can solve it without fine-tuning, and uncovers the sequential emergence of attention heads during training.
Findings
Transformers solve retrieval tasks with minimal layers.
Attention heads emerge in a specific sequence during training.
Implicit curriculum guides the emergence of attention mechanisms.
Abstract
In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input size. I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. Successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence guided by the implicit curriculum.
Peer Reviews
Decision·ICLR 2025 Poster
- The paper makes a strong contribution to the field of mechanistic interpretability and furthering our understanding of the transformer architecture and training behavior. Especially the finding on the sequential emergence of attention heads for reasoning circuits is valuable. - The **problem statement** at hand is presented nicely, and the paper follows a logical progression building up to the final insights. - The **structure and flow of the experiments** are sensible, starting with higher-l
**Readability:** - Section 5 "THEORETICAL ANALYSIS OF INFORMATION FLOW" is quite **hard to follow** and requires some time to understand, especially with limited prior knowledge. Concrete, examples to what "E", "F", ... and so forth might mean in the training and evaluation context could help the reader to grasp the theoretical analysis quicker. Augmenting this section with examples, e.g. from Sec. 4, would make it more accessible to a larger audience. **Missing clarity:** - Large portions
This is a good joy for understanding the ability of LLM, especially the emergent ability of models. 1. a novel idea to study the ability of the reasoning ability of LLM. 2. The finding is exciting and fits with human intuition. 3. The visualization of attention can give describution on how LLM takes retrieval reasoning tasks.
1. Can you provide a more complex example? 2. In my opinion, I hope I can see a general framework that can unify more tasks with your retrieval task. 3. There are a lot of chapters in the article, and I can't quite understand the relationship between different chapters.
● The paper provides insights into how LLMs perform retrieval using attention heads. ● Introducing these tasks gives a clear way to study transformers' retrieval abilities. ● The study highlights the role of learning curriculum in the development of retrieval mechanisms.
1. **Lack of Experimental Validation for Theoretical Claims**: Theoretical claims like Theorem 1 lack empirical support, making it hard to verify their practical impact. 2. **Insufficient Details on Experimental Setup**:: The paper lacks detailed explanations for key experimental settings. The implicit curriculum (IC) formulation, which performs better than non-IC, is not clearly defined, and Section 8’s description of manually reverse-engineering circuits lacks detail, making the experiments di
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing
MethodsSoftmax · Attention Is All You Need
