Structure Development in List-Sorting Transformers
Einar Urdshals, Jasmina Urdshals

TL;DR
This paper investigates how a one-layer attention transformer learns to organize its attention mechanisms for list sorting, revealing natural tendencies towards simpler structures and the influence of training data on internal organization.
Contribution
It demonstrates the emergence of two main attention modes in transformers during sorting tasks and links these to data features and training dynamics, advancing understanding of model development.
Findings
Attention heads organize into vocabulary-splitting and copy-suppression modes.
Vocabulary-splitting occurs regardless of regularization, indicating inherent simplicity bias.
Training data features influence the model's internal structure development.
Abstract
We study how a one-layer attention-only transformer develops relevant structures while learning to sort lists of numbers. At the end of training, the model organizes its attention heads in two main modes that we refer to as vocabulary-splitting and copy-suppression. Both represent simpler modes than having multiple heads handle overlapping ranges of numbers. Interestingly, vocabulary-splitting is present regardless of whether we use weight decay, a common regularization technique thought to drive simplification, supporting the thesis that neural networks naturally prefer simpler solutions. We relate copy-suppression to a mechanism in GPT-2 and investigate its functional role in our model. Guided by insights from a developmental analysis of the model, we identify features in the training data that drive the model's final acquired solution. This provides a concrete example of how the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing · Advanced biosensing and bioanalysis techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Softmax · Dropout · Linear Warmup With Cosine Annealing · Attention Dropout · Linear Layer · Byte Pair Encoding
