LongSSM: On the Length Extension of State-space Models in Language Modelling
Shida Wang

TL;DR
This paper studies the challenges of extending state-space models in language modeling to longer sequences, identifies the root cause related to polynomial extrapolation, and proposes a simple initialization change to improve length extension capabilities.
Contribution
It introduces a novel hidden state initialization method that enhances length extension in state-space models and reduces the need for long training sequences.
Findings
Changing hidden state initialization improves length extension.
Long training sequences are beneficial but not necessary.
The method enables efficient training of long-memory models.
Abstract
In this paper, we investigate the length-extension of state-space models (SSMs) in language modeling. Length extension involves training models on short sequences and testing them on longer ones. We show that state-space models trained with zero hidden states initialization have difficulty doing length extension. We explain this difficulty by pointing out the length extension is equivalent to polynomial extrapolation. Based on the theory, we propose a simple yet effective method - changing the hidden states initialization scheme - to improve the length extension. Moreover, our method shows that using long training sequence length is beneficial but not necessary to length extension. Changing the hidden state initialization enables the efficient training of long-memory model with a smaller training context length.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
