Unveiling Induction Heads: Provable Training Dynamics and Feature   Learning in Transformers

Siyu Chen; Heejune Sheen; Tianhao Wang; Zhuoran Yang

arXiv:2409.10559·cs.LG·September 18, 2024

Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of how transformer models learn in-context learning, revealing the roles of different components in implementing the induction head mechanism through convergence of training dynamics.

Contribution

It offers a rigorous proof that all transformer components collaboratively learn a generalized induction head mechanism during training on Markov chain data.

Findings

01

Transformer components converge to a model performing induction head-like behavior.

02

The first attention layer acts as a copier of past tokens.

03

The feed-forward layer functions as a feature selector.

Abstract

In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on $n$ -gram Markov chain data, where each token in the Markov chain statistically depends on the previous $n$ tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Softmax