Mimetic Initialization Helps State Space Models Learn to Recall
Asher Trockman, Hrayr Harutyunyan, J. Zico Kolter, Sanjiv Kumar,, Srinadh Bhojanapalli

TL;DR
This paper introduces a structured initialization method for state space models like Mamba, enabling them to better learn recall tasks by mimicking attention mechanisms, thus improving their training efficiency and performance.
Contribution
The paper proposes a mimetic initialization technique that helps state space models learn to recall more effectively, addressing training difficulties rather than capacity limitations.
Findings
Initialization improves Mamba's ability to learn copying tasks
State space models perform better with the proposed initialization
Training becomes easier and more effective for recall tasks
Abstract
Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length. But in practice, state space models have fairly large state sizes, and we conjecture that they should be able to perform much better at these tasks than previously reported. We investigate whether their poor copying and recall performance could be due in part to training difficulties rather than fundamental capacity constraints. Based on observations of their "attention" maps, we propose a structured initialization technique that allows state space layers to more readily mimic attention. Across a variety of architecture settings, our initialization makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.
Peer Reviews
Decision·Submitted to ICLR 2025
The paper addresses an important and timely question on how to improve recall capabilities of modern State Space Models (SSMs). The proposed initialization is thoroughly ablated with many toy experiments designed to test recall/copy abilities.
While the proposed initialization scheme is sound, the novelty of the idea and technical contribution is incremental. Furthermore, the scope of the work is only limited to Mamba-like models and it is not generic to State Space Models (as stated in the title). To increase the scope of this work, given the relatively small scale experiments conducted, I’d suggest the authors extend their results to other SSM models like: GLA [5], LRU [4] and RetNet [8]. If only Mamba-like models are studied it wou
* The paper presents a wide array of experiments that vary state size, vocabulary size, sequence length, and embedding size to validate their results * The paper is well written and easy to follow * The paper provides a way to estimate the capabilities of SSM layers on copying tasks without expensive pretraining
* This paper specifically trains for recall tasks, so it is unclear if this initialization scheme would lead to benefits for pretraining and not lead to performance regressions on non recall tasks * Since hybrid models with attention layers that have inherent copying abilities combined with SSM layers are starting to become more prominent over pure SSM architectures, it is unclear if the copying abilities of the SSM components are of prime importance * The paper does not provide any results for
The paper is well-written and addresses a fundamental limitation of SSMs compared to Transformers, which is still not fully understood. The proposed approach of initializing SSM layers to mimic self-attention is well-motivated, and the observed attention patterns suggest notable similarities in the behavior of the two. The authors present extensive experiments across various tasks and architectural configurations, confirming that the initialization generally improves performance. They also provi
While the authors motivate the problem well, I find that they do not spend enough time discussing the broader implications of their work. From a theoretical perspective, the initialization approach is mainly heuristic, and does not provide a fully satisfactory understanding of why SSMs struggle with such tasks. On the empirical side, it is unclear whether the proposed initialization could improve performance on language-based tasks. Additionally, I have some concerns about the robustness and cla
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
