Mamba-3: Improved Sequence Modeling using State Space Principles

Aakash Lahoti; Kevin Y. Li; Berlin Chen; Caitlin Wang; Aviv Bick; J. Zico Kolter; Tri Dao; Albert Gu

arXiv:2603.15569·cs.LG·March 17, 2026

Mamba-3: Improved Sequence Modeling using State Space Principles

Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, Albert Gu

PDF

Open Access 1 Models 3 Reviews

TL;DR

Mamba-3 introduces a state space model-inspired approach to sequence modeling, significantly improving inference efficiency and accuracy in language tasks while maintaining hardware efficiency and reducing model size.

Contribution

It presents three methodological innovations based on state space models, enhancing expressiveness and performance of linear models for sequence tasks.

Findings

01

Mamba-3 outperforms previous models in accuracy on retrieval, state tracking, and language modeling tasks.

02

Achieves comparable perplexity with half the state size of Mamba-2.

03

Demonstrates improved efficiency on the performance-efficiency Pareto frontier.

Abstract

Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While the current Transformer-based models deliver strong model quality, their quadratic compute and linear memory make inference expensive. This has spurred the development of sub-quadratic models with reduced linear compute and constant memory requirements. However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 4

Strengths

- **Theoretical Elegance and Insight:** The paper's standout strength is its use of classical SSM theory to motivate architectural changes. The derivation of a data-dependent RoPE from a complex state-space is a beautiful and insightful result that bridges two important lines of work. - **Targeted Problem Solving:** Each of the three methodological changes directly targets a known weakness in prior models. Trapezoidal discretization for expressivity, complexification for state-tracking, and MIM

Weaknesses

My major concern is that retrieval Capabilities Still Lag Transformers: While the paper is honest about this, and Mamba-3 shows improvement over Mamba-2, the results in Table 2 confirm that a fundamental gap in retrieval performance versus Transformer models remains. This is an inherent challenge for fixed-state recurrent models and represents a key limitation. Maybe authors can add some potential exploration about hybrid models? I think we are at the age that Hybrid models are becoming more imp

Reviewer 02Rating 8Confidence 4

Strengths

1. The proposed Generalized Trapezoidal Discretization is more accurate than the Euler's rule used in Mamba2, by using the second-order approximation of the integral. 2. The complex SSM is novel, which is equivalent to a real SSM with data-dependent rotary embeddings (RoPE). Detailed theoretical analyses are provided. 3. The MIMO method efficiently solves the I/O problem, which is a key bottleneck for Mamba-2.

Weaknesses

More complex tasks such as reasoning can be explored to fully demonstrate the capability of Mamba-3.

Reviewer 03Rating 6Confidence 5

Strengths

1. The three core improvements, trapezoidal discretization, complex-valued SSMs, and MIMO formulation, are not isolated tweaks but creative combinations of classical SSM theory and modern LLM needs. Trapezoidal discretization generalizes Euler’s rule (used in Mamba-2) to a second-order accurate recurrence, while the complex-valued SSM recovers rotational dynamics absent in real-valued counterparts. 2. Rigorous Theory & Comprehensive Empirical Validation. The paper includes rigorous proofs for k

Weaknesses

1. Inherent limitations of retrieval capabilities and insufficient comparison. Mamba-3 significantly lags behind Transformer in information extraction tasks from semi-structured/unstructured data, and the root causes and potential solutions are not explored in depth. Table 2 of the paper shows that Mamba-3 with 1.5B parameters has poor accuracy in real-world retrieval tasks such as SWDE and FDA; even in the synthetic "needle-in-a-haystack" task, when the context length exceeds 2048, the accurac

Code & Models

Models

🤗
MarcoDotIO/mighty-giant-checkpoints
model· 19 dl
19 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Topic Modeling · Machine Learning in Healthcare