The emergence of sparse attention: impact of data distribution and benefits of repetition
Nicolas Zucchet, Francesco d'Angelo, Andrew K. Lampinen, Stephanie C.Y. Chan

TL;DR
This paper investigates how sparse attention emerges in Transformers, revealing that emergence timing depends on data, architecture, and optimization, and that repetition can accelerate this process, supported by theoretical and empirical analysis.
Contribution
It provides a theoretical framework and empirical evidence explaining the mechanics and timing of sparse attention emergence in neural networks.
Findings
Emergence timing follows power laws related to task and model parameters.
Repetition significantly speeds up the emergence process.
Results are validated on an in-context associative recall task.
Abstract
Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
