From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Marco Bondaschi; Nived Rajaraman; Xiuying Wei; Kannan Ramchandran,; Razvan Pascanu; Caglar Gulcehre; Michael Gastpar; Ashok Vardhan Makkuva

arXiv:2502.10178·cs.LG·February 17, 2025

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Marco Bondaschi, Nived Rajaraman, Xiuying Wei, Kannan Ramchandran,, Razvan Pascanu, Caglar Gulcehre, Michael Gastpar, Ashok Vardhan Makkuva

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper reveals that Mamba, a structured state space model, can efficiently learn optimal Laplacian smoothing for Markov chains through in-context learning, supported by theoretical and empirical analysis.

Contribution

It provides the first formal connection between Mamba's architecture and optimal statistical estimators for Markov chains, highlighting the role of convolution.

Findings

01

Mamba learns the in-context Laplacian smoothing estimator for all Markovian orders.

02

Theoretical characterization of Mamba's representation capacity explains its effectiveness.

03

Empirical results confirm Mamba's ability to learn optimal estimators efficiently.

Abstract

While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering a surprising phenomenon: unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal, for all Markovian orders. To explain…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 4

Strengths

- I found the idea interesting! Thank you for sharing this from a perspective that is intriguing and different to the typical “we ran a bunch of experiments on the same benchmarks and boosted performance by X%” kind of LM study. It is positive to see Mamba models being connected with optimal statistical estimators. - The paper is well written. Even though it may not be easy to read on a cursory look (I needed to go back and forth to see symbols and notation frequently), it does contain all the

Weaknesses

- For the non-Markovian setting in Section 5, (ppl on WikiText103 being the only result), I feel like the evaluation could have been a bit more diverse and rigorous. This is the setting that I assume most practitioners (for better or worse) would be interested in, and it may have been useful to do this over a slightly wider range of model sizes (number of layers, dimensionality etc) to help build better intuition. - There is a lot of related work on architectures that do sub-quadratic attention

Reviewer 02Rating 8Confidence 2

Strengths

- The paper is clearly written and well motivated. - The methodology considered in this study is sound and made explicit: Building on the empirical observation that the convolution is the key ingredient in the ICL abilities of Mamba, the authors construct a minimal architecture that allows a theoretical analysis all while keeping key similarities with the original architecture. This allows the derivation of sound theoretical results characterizing Mamba's ability to learn in-context, as function

Weaknesses

- Theorem 1 is insightful in proving the in-context ability of Mamba. However, in my understanding it only tackles the model misspecification problem, in the sense that it proves that the optimal solution can be represented by the model (zero approximation error). I think it would be interesting to discuss the intersections between the theoretically chosen parameters that achieve the result in Theorem 1, and those actually obtained by optimizing Eq 3 on a given training set (with or without dist

Reviewer 03Rating 8Confidence 3

Strengths

1. This is the first work to investigate theoretically the ICL ability of the Mamba architecture. 2. The paper is well-written; the logic and arrangement are easy to follow, and the author provides enough background in the main text to help readers understand the scientific question being studied. It starts from an empirical observation, finds convolution plays an important role, and then echoes this importance in the theoretical analysis. 3. The paper is perfectly designed and written to shed

Weaknesses

1. One possible weakness is that, to the best of our current understanding, Mamba has not demonstrated the strong ICL ability seen in large transformer-based language models. While this is not a weakness of the theoretical analysis itself, Mamba's actual ICL capability in real-world applications will influence the impact of this theoretical work on the community. 2. The real-world experiment is only on the language modeling task (e.g., perplexity on WikiText-103) instead of a dedicated, real-wo

Code & Models

Repositories

Bond1995/Markov-Mamba
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Machine Learning and Data Classification

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Convolution · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN