The Role of Sparsity for Length Generalization in Transformers
Noah Golowich, Samy Jelassi, David Brandfonbrener, Sham M., Kakade, Eran Malach

TL;DR
This paper develops a theoretical framework showing that length generalization in transformers depends on sparse token dependencies, and introduces a new training method to enhance this ability in language models.
Contribution
It formalizes the concept of sparse dependencies for length generalization and proposes Predictive Position Coupling to improve transformer performance on longer sequences.
Findings
Length generalization occurs with fixed small token dependencies.
Sparse dependency structure is key to successful length generalization.
Predictive Position Coupling broadens tasks for effective length generalization.
Abstract
Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call -sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model justifies certain techniques to modify positional embeddings which have been introduced to improve length generalization, such as position…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
