LongSSM: On the Length Extension of State-space Models in Language   Modelling

Shida Wang

arXiv:2406.02080·cs.CL·June 5, 2024

LongSSM: On the Length Extension of State-space Models in Language Modelling

Shida Wang

PDF

Open Access

TL;DR

This paper studies the challenges of extending state-space models in language modeling to longer sequences, identifies the root cause related to polynomial extrapolation, and proposes a simple initialization change to improve length extension capabilities.

Contribution

It introduces a novel hidden state initialization method that enhances length extension in state-space models and reduces the need for long training sequences.

Findings

01

Changing hidden state initialization improves length extension.

02

Long training sequences are beneficial but not necessary.

03

The method enables efficient training of long-memory models.

Abstract

In this paper, we investigate the length-extension of state-space models (SSMs) in language modeling. Length extension involves training models on short sequences and testing them on longer ones. We show that state-space models trained with zero hidden states initialization have difficulty doing length extension. We explain this difficulty by pointing out the length extension is equivalent to polynomial extrapolation. Based on the theory, we propose a simple yet effective method - changing the hidden states initialization scheme - to improve the length extension. Moreover, our method shows that using long training sequence length is beneficial but not necessary to length extension. Changing the hidden state initialization enables the efficient training of long-memory model with a smaller training context length.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling