Unveiling Transformers with LEGO: a synthetic reasoning task
Yi Zhang, Arturs Backurs, S\'ebastien Bubeck, Ronen Eldan, Suriya, Gunasekar, Tal Wagner

TL;DR
This paper introduces LEGO, a synthetic reasoning task for studying Transformer models, analyzing how data, architecture, and pretraining influence learning, and proposing new attention mechanisms to improve robustness and efficiency.
Contribution
The paper presents LEGO, a novel synthetic reasoning task, and investigates how Transformers learn it, revealing structured attention patterns and proposing a new LEGO attention module.
Findings
Transformers develop structured attention patterns, including a novel association pattern.
Pretraining on unrelated tasks can facilitate LEGO task learning through structured attention.
The LEGO attention module reduces computational cost and can improve performance.
Abstract
We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we manage to understand some of the attention heads as well as how the information flows in the network. In particular, we have identified a novel \emph{association} pattern that globally attends only to identical tokens. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Neural Networks and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Label Smoothing · Dense Connections · Absolute Position Encodings · Adam · Position-Wise Feed-Forward Layer · Dropout
