Mamba State-Space Models Are Lyapunov-Stable Learners

John T. Halloran; Manbir Gulati; Paul F. Roysdon

arXiv:2406.00209·cs.LG·September 1, 2025·1 cites

Mamba State-Space Models Are Lyapunov-Stable Learners

John T. Halloran, Manbir Gulati, Paul F. Roysdon

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that Mamba state-space models are Lyapunov-stable learners, exhibiting remarkable robustness to fine-tuning methods like MPFT and PEFT, due to their inherently stable recurrent dynamics, which enhances their performance and reliability.

Contribution

The paper proves the Lyapunov stability of Mamba SSMs and empirically shows their robustness to common fine-tuning methods, contrasting with Transformer models.

Findings

01

Mamba LLMs are highly stable under MPFT and PEFT.

02

Transformer LLMs can diverge significantly under the same fine-tuning.

03

Lyapunov stability guarantees robustness of Mamba's recurrent dynamics.

Abstract

Mamba state-space models (SSMs) have recently outperformed state-of-the-art (SOTA) Transformer large language models (LLMs) in various tasks and been widely adapted. However, a major concern for stable learning in recurrent-based deep models (such as SSMs) is the sensitivity of their recurrent dynamics. Despite widespread adaptation, the sensitivity of Mamba's recurrent dynamics under common fine-tuning methods-e.g., mixed-precision fine-tuning (MPFT) and parameter-efficient fine-tuning (PEFT)-remains unexplored. Empirically, we show that Mamba LLMs are extremely stable to changes introduced by combinations of MPFT and PEFT, in stark contrast to Transformer LLMs, which we demonstrate may drastically diverge from their respective full-precision counterparts under different combinations of MPFT and PEFT (despite the near-ubiquitous adaptation of these fine-tuning frameworks for…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

1. The paper is well-written and easy to follow. 2. The paper provides a theoretical analysis to support experimental performance. 3. Comprehensive experiments compare the Mamba model with Transformer-based models across different scenarios.

Weaknesses

Though the theoretical analysis of this paper is really solid, since most researchers nowadays use MPFT and PEFT for Mamba, the contribution of this paper may be limited. It may be better if some weaknesses about MPFT and PEFT for Mamba are figured out and then optimized accordingly.

Reviewer 02Rating 6Confidence 4

Strengths

- The paper addresses a very important problem that, until now, has helped to prevent the wider use of SSMs: training resources for an SSM have been greater than a similarly-sized transformer, because the transformer has been known to be compatible with efficient fine-tuning techniques. Addressing this disparity will help more people to consider working with SSMs, as their computational requirements will suddenly be within reach - The combination of both theory and empirical results is quite per

Weaknesses

- My most major critique is that there aren’t any error bars or sense of the spread/randomness in the empirical results, especially for experiment 2 / figure 1, experiment 4 / figure 3 for the fine-tuned models and experiment 5 / figure 4 for the fine-tuned models. - My second biggest critique is that each experiment is only performed with one dataset (MMLU, fine-tuning with Alpaca for fine-tuning experiments). Including more than one dataset or task would strengthen the claims you’re drawing f

Reviewer 03Rating 3Confidence 5

Strengths

This work provides promising evidence that fine-tuning Mamba on instruction-tuning data can improve its in-context learning abilities.

Weaknesses

- Line 51: For some training framework it does not keep master weight in fp32 which problematic for Mamba. But Mamba was still trained with mixed precision using AMP. - Theory 1 seems shallow: Isn't Lemma 1 trivial to derive? \( dx_{t+1}/dx_t \) directly represents the decay rate, which must be \(\leq 1\) to prevent exponential growth in cumulative products. This insight is obvious for any recurrent model, making Theorem 1’s assertion about exponential decay rather redundant. - Theory 2 seems

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Intelligent Tutoring Systems and Adaptive Learning

MethodsAttention Is All You Need · Dropout · Layer Normalization · Adam · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Absolute Position Encodings