Differential Mamba
Nadav Schneider, Itamar Zimerman, Eliya Nachmani

TL;DR
This paper introduces a novel differential mechanism for the Mamba architecture, enhancing its ability to mitigate attention overallocation and improve language modeling performance, especially in retrieval tasks.
Contribution
We adapt differential design techniques to Mamba, developing a new mechanism that improves its efficiency and effectiveness in language modeling tasks.
Findings
Enhanced retrieval capabilities in Mamba-based models
Superior performance over vanilla Mamba on benchmarks
Effective mitigation of attention overallocation issues
Abstract
Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
