Towards Understanding Transformers in Learning Random Walks
Wei Shi, Yuan Cao

TL;DR
This paper provides a theoretical and empirical analysis of how one-layer transformers can learn and interpret random walks, showing they focus on parent states and perform optimal predictions, with insights into their limitations.
Contribution
It offers the first theoretical demonstration that transformers can learn random walks with interpretability, highlighting the role of attention as a token selector and transition executor.
Findings
Transformers achieve optimal accuracy in predicting random walks after training.
Trained attention acts as a token selector focusing on parent states.
Gradient descent with small initialization may fail in simple tasks.
Abstract
Transformers have proven highly effective across various applications, especially in handling sequential data such as natural languages and time series. However, transformer models often lack clear interpretability, and the success of transformers has not been well understood in theory. In this paper, we study the capability and interpretability of transformers in learning a family of classic statistical models, namely random walks on circles. We theoretically demonstrate that, after training with gradient descent, a one-layer transformer model can achieve optimal accuracy in predicting random walks. Importantly, our analysis reveals that the trained model is interpretable: the trained softmax attention serves as a token selector, focusing on the direct parent state; subsequently, the value matrix executes a one-step probability transition to predict the location of the next state based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Big Data and Digital Economy
