TL;DR
This paper analyzes why recurrent models struggle with length generalization and proposes simple training interventions, like state initialization with noise, that significantly improve their ability to process much longer sequences.
Contribution
It introduces the unexplored states hypothesis and demonstrates effective, low-cost interventions to enhance length generalization in recurrent models.
Findings
Interventions enable models to generalize to sequences 100 times longer.
Training with state initialization improves long-sequence performance.
Empirical and theoretical support for the unexplored states hypothesis.
Abstract
Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths-i.e. they fail to length generalize. In this work, we provide comprehensive empirical and theoretical analysis to support the unexplored states hypothesis, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all attainable states (i.e. states that would be attained if the recurrence was applied to long sequences). Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
