From Markov to Laplace: How Mamba In-Context Learns Markov Chains
Marco Bondaschi, Nived Rajaraman, Xiuying Wei, Kannan Ramchandran,, Razvan Pascanu, Caglar Gulcehre, Michael Gastpar, Ashok Vardhan Makkuva

TL;DR
This paper reveals that Mamba, a structured state space model, can efficiently learn optimal Laplacian smoothing for Markov chains through in-context learning, supported by theoretical and empirical analysis.
Contribution
It provides the first formal connection between Mamba's architecture and optimal statistical estimators for Markov chains, highlighting the role of convolution.
Findings
Mamba learns the in-context Laplacian smoothing estimator for all Markovian orders.
Theoretical characterization of Mamba's representation capacity explains its effectiveness.
Empirical results confirm Mamba's ability to learn optimal estimators efficiently.
Abstract
While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering a surprising phenomenon: unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal, for all Markovian orders. To explain…
Peer Reviews
Decision·ICLR 2026 Oral
- I found the idea interesting! Thank you for sharing this from a perspective that is intriguing and different to the typical “we ran a bunch of experiments on the same benchmarks and boosted performance by X%” kind of LM study. It is positive to see Mamba models being connected with optimal statistical estimators. - The paper is well written. Even though it may not be easy to read on a cursory look (I needed to go back and forth to see symbols and notation frequently), it does contain all the
- For the non-Markovian setting in Section 5, (ppl on WikiText103 being the only result), I feel like the evaluation could have been a bit more diverse and rigorous. This is the setting that I assume most practitioners (for better or worse) would be interested in, and it may have been useful to do this over a slightly wider range of model sizes (number of layers, dimensionality etc) to help build better intuition. - There is a lot of related work on architectures that do sub-quadratic attention
- The paper is clearly written and well motivated. - The methodology considered in this study is sound and made explicit: Building on the empirical observation that the convolution is the key ingredient in the ICL abilities of Mamba, the authors construct a minimal architecture that allows a theoretical analysis all while keeping key similarities with the original architecture. This allows the derivation of sound theoretical results characterizing Mamba's ability to learn in-context, as function
- Theorem 1 is insightful in proving the in-context ability of Mamba. However, in my understanding it only tackles the model misspecification problem, in the sense that it proves that the optimal solution can be represented by the model (zero approximation error). I think it would be interesting to discuss the intersections between the theoretically chosen parameters that achieve the result in Theorem 1, and those actually obtained by optimizing Eq 3 on a given training set (with or without dist
1. This is the first work to investigate theoretically the ICL ability of the Mamba architecture. 2. The paper is well-written; the logic and arrangement are easy to follow, and the author provides enough background in the main text to help readers understand the scientific question being studied. It starts from an empirical observation, finds convolution plays an important role, and then echoes this importance in the theoretical analysis. 3. The paper is perfectly designed and written to shed
1. One possible weakness is that, to the best of our current understanding, Mamba has not demonstrated the strong ICL ability seen in large transformer-based language models. While this is not a weakness of the theoretical analysis itself, Mamba's actual ICL capability in real-world applications will influence the impact of this theoretical work on the community. 2. The real-world experiment is only on the language modeling task (e.g., perplexity on WikiText-103) instead of a dedicated, real-wo
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Machine Learning and Data Classification
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Convolution · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN
