Mamba-3: Improved Sequence Modeling using State Space Principles
Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, Albert Gu

TL;DR
Mamba-3 introduces a state space model-inspired approach to sequence modeling, significantly improving inference efficiency and accuracy in language tasks while maintaining hardware efficiency and reducing model size.
Contribution
It presents three methodological innovations based on state space models, enhancing expressiveness and performance of linear models for sequence tasks.
Findings
Mamba-3 outperforms previous models in accuracy on retrieval, state tracking, and language modeling tasks.
Achieves comparable perplexity with half the state size of Mamba-2.
Demonstrates improved efficiency on the performance-efficiency Pareto frontier.
Abstract
Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While the current Transformer-based models deliver strong model quality, their quadratic compute and linear memory make inference expensive. This has spurred the development of sub-quadratic models with reduced linear compute and constant memory requirements. However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM…
Peer Reviews
Decision·ICLR 2026 Oral
- **Theoretical Elegance and Insight:** The paper's standout strength is its use of classical SSM theory to motivate architectural changes. The derivation of a data-dependent RoPE from a complex state-space is a beautiful and insightful result that bridges two important lines of work. - **Targeted Problem Solving:** Each of the three methodological changes directly targets a known weakness in prior models. Trapezoidal discretization for expressivity, complexification for state-tracking, and MIM
My major concern is that retrieval Capabilities Still Lag Transformers: While the paper is honest about this, and Mamba-3 shows improvement over Mamba-2, the results in Table 2 confirm that a fundamental gap in retrieval performance versus Transformer models remains. This is an inherent challenge for fixed-state recurrent models and represents a key limitation. Maybe authors can add some potential exploration about hybrid models? I think we are at the age that Hybrid models are becoming more imp
1. The proposed Generalized Trapezoidal Discretization is more accurate than the Euler's rule used in Mamba2, by using the second-order approximation of the integral. 2. The complex SSM is novel, which is equivalent to a real SSM with data-dependent rotary embeddings (RoPE). Detailed theoretical analyses are provided. 3. The MIMO method efficiently solves the I/O problem, which is a key bottleneck for Mamba-2.
More complex tasks such as reasoning can be explored to fully demonstrate the capability of Mamba-3.
1. The three core improvements, trapezoidal discretization, complex-valued SSMs, and MIMO formulation, are not isolated tweaks but creative combinations of classical SSM theory and modern LLM needs. Trapezoidal discretization generalizes Euler’s rule (used in Mamba-2) to a second-order accurate recurrence, while the complex-valued SSM recovers rotational dynamics absent in real-valued counterparts. 2. Rigorous Theory & Comprehensive Empirical Validation. The paper includes rigorous proofs for k
1. Inherent limitations of retrieval capabilities and insufficient comparison. Mamba-3 significantly lags behind Transformer in information extraction tasks from semi-structured/unstructured data, and the root causes and potential solutions are not explored in depth. Table 2 of the paper shows that Mamba-3 with 1.5B parameters has poor accuracy in real-world retrieval tasks such as SWDE and FDA; even in the synthetic "needle-in-a-haystack" task, when the context length exceeds 2048, the accurac
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Topic Modeling · Machine Learning in Healthcare
