A Formal Framework for Understanding Length Generalization in Transformers
Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas, Krebs, Hattie Zhou, Preetum Nakkiran, Michael Hahn

TL;DR
This paper develops a rigorous theoretical framework to analyze and predict the ability of causal transformers with absolute positional encodings to generalize to longer sequences, explaining empirical successes and failures.
Contribution
It introduces a formal theoretical model for length generalization in transformers, characterizing identifiable functions and enabling provable predictions of generalization capabilities.
Findings
The theory predicts when transformers will succeed or fail at length generalization.
Experimental validation confirms the theory's accuracy across various tasks.
The framework explains many empirical observations about transformer generalization.
Abstract
A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute positional encodings. In particular, we characterize those functions that are identifiable in the limit from sufficiently long inputs with absolute positional encodings under an idealized inference scheme using a norm-based regularizer. This enables us to prove the possibility of length generalization for a rich family of problems. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsManufacturing Process and Optimization
MethodsSparse Evolutionary Training
