DeciMamba: Exploring the Length Extrapolation Potential of Mamba

Assaf Ben-Kish; Itamar Zimerman; Shady Abu-Hussein; Nadav Cohen; Amir; Globerson; Lior Wolf; Raja Giryes

arXiv:2406.14528·cs.LG·April 11, 2025

DeciMamba: Exploring the Length Extrapolation Potential of Mamba

Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir, Globerson, Lior Wolf, Raja Giryes

PDF

Open Access 2 Repos 6 Models 1 Video 3 Reviews

TL;DR

DeciMamba enhances Mamba's ability to process longer sequences by extending context length through a novel mechanism, improving long-range NLP task performance without retraining.

Contribution

The paper introduces DeciMamba, a method that enables Mamba models to extrapolate to longer sequences, overcoming previous limitations related to effective receptive field.

Findings

01

DeciMamba significantly improves sequence length extrapolation.

02

It achieves faster inference on long-range NLP tasks.

03

Empirical results show effective long-context processing without retraining.

Abstract

Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length. A promising alternative is Mamba, which demonstrates high performance and achieves Transformer-level capabilities while requiring substantially fewer computational resources. In this paper we explore the length-generalization capabilities of Mamba, which we find to be relatively limited. Through a series of visualizations and analyses we identify that the limitations arise from a restricted effective receptive field, dictated by the sequence length used during training. To address this constraint, we introduce DeciMamba, a context-extension method specifically designed for Mamba. This mechanism, built on top of a hidden filtering mechanism embedded within the S6 layer, enables the trained model to extrapolate well even without additional training. Empirical…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

Overall, I like this paper, I think that Mamba is a very appealing method due to its low inference cost and getting methods that allow extending the context length for Mamba is a very important question. I appreciate the fact that the authors considered diverse benchmarks and the gap between Mamba and Decimamba is pretty consistent in some cases. I also appreciated the scientific approach in the paper that consists in isolating the problematic component in Mamba and proposing a method to allevi

Weaknesses

I have a few concerns regarding this paper that I list below: - **Hyperparameter choices**: I agree with the fact that the fast decay of the sum of the discrete time steps may explain the lack of length generalization. However, the approach looks a bit hacky in that it introduces multiple novel hyperparameters: the decay factor, the maximal length of the sequence after the first decimating later and the number of layers to decimate. And it does not seem very clear how to make these choices wit

Reviewer 02Rating 3Confidence 4

Strengths

- The investigation of token importance scoring in the context of Mamba represents a valuable contribution to the field, especially given the growing interest in alternatives to attention-based mechanisms - The method achieves substantial improvements while maintaining implementation simplicity, making it readily applicable in practical scenarios

Weaknesses

- Limited scope of application: The method's restriction to the prefilling phase significantly limits its practical impact, especially considering the increasing demand for both long-text prefilling and generation in modern applications - Potential information loss: The token discarding approach may have unintended consequences in scenarios requiring comprehensive context understanding. This is particularly problematic in tasks like document question-answering, where discarded tokens during pref

Reviewer 03Rating 3Confidence 5

Strengths

1. The proposed method is straightforward and can be directly combined with Mamba to improve the length generalization abilities. 2. The authors analyze the shortcomings of Mamba in terms of its length generalization abilities.

Weaknesses

1. The experimental results are limited. The authors only combine the method with the Mamba model and verify the effectiveness of their method. It would be more convincing if the method can be validated on a broader range of SSM-based and linear-attention-based models. 2. The experimental results are not satisfactory. As shown in Table 1, DeciMamba fails to achieve more than a 10% LongBench score on most datasets. In contrast, models with a context window size of only 4k, as reported in the orig

Code & Models

Repositories

Models

Videos

DeciMamba: Exploring the Length Extrapolation Potential of Mamba· slideslive

Taxonomy

TopicsUrban and Rural Development Challenges