Mimetic Initialization Helps State Space Models Learn to Recall

Asher Trockman; Hrayr Harutyunyan; J. Zico Kolter; Sanjiv Kumar,; Srinadh Bhojanapalli

arXiv:2410.11135·cs.LG·October 16, 2024

Mimetic Initialization Helps State Space Models Learn to Recall

Asher Trockman, Hrayr Harutyunyan, J. Zico Kolter, Sanjiv Kumar,, Srinadh Bhojanapalli

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a structured initialization method for state space models like Mamba, enabling them to better learn recall tasks by mimicking attention mechanisms, thus improving their training efficiency and performance.

Contribution

The paper proposes a mimetic initialization technique that helps state space models learn to recall more effectively, addressing training difficulties rather than capacity limitations.

Findings

01

Initialization improves Mamba's ability to learn copying tasks

02

State space models perform better with the proposed initialization

03

Training becomes easier and more effective for recall tasks

Abstract

Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length. But in practice, state space models have fairly large state sizes, and we conjecture that they should be able to perform much better at these tasks than previously reported. We investigate whether their poor copying and recall performance could be due in part to training difficulties rather than fundamental capacity constraints. Based on observations of their "attention" maps, we propose a structured initialization technique that allows state space layers to more readily mimic attention. Across a variety of architecture settings, our initialization makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 5

Strengths

The paper addresses an important and timely question on how to improve recall capabilities of modern State Space Models (SSMs). The proposed initialization is thoroughly ablated with many toy experiments designed to test recall/copy abilities.

Weaknesses

While the proposed initialization scheme is sound, the novelty of the idea and technical contribution is incremental. Furthermore, the scope of the work is only limited to Mamba-like models and it is not generic to State Space Models (as stated in the title). To increase the scope of this work, given the relatively small scale experiments conducted, I’d suggest the authors extend their results to other SSM models like: GLA [5], LRU [4] and RetNet [8]. If only Mamba-like models are studied it wou

Reviewer 02Rating 5Confidence 4

Strengths

* The paper presents a wide array of experiments that vary state size, vocabulary size, sequence length, and embedding size to validate their results * The paper is well written and easy to follow * The paper provides a way to estimate the capabilities of SSM layers on copying tasks without expensive pretraining

Weaknesses

* This paper specifically trains for recall tasks, so it is unclear if this initialization scheme would lead to benefits for pretraining and not lead to performance regressions on non recall tasks * Since hybrid models with attention layers that have inherent copying abilities combined with SSM layers are starting to become more prominent over pure SSM architectures, it is unclear if the copying abilities of the SSM components are of prime importance * The paper does not provide any results for

Reviewer 03Rating 5Confidence 3

Strengths

The paper is well-written and addresses a fundamental limitation of SSMs compared to Transformers, which is still not fully understood. The proposed approach of initializing SSM layers to mimic self-attention is well-motivated, and the observed attention patterns suggest notable similarities in the behavior of the two. The authors present extensive experiments across various tasks and architectural configurations, confirming that the initialization generally improves performance. They also provi

Weaknesses

While the authors motivate the problem well, I find that they do not spend enough time discussing the broader implications of their work. From a theoretical perspective, the initialization approach is mainly heuristic, and does not provide a fully satisfactory understanding of why SSMs struggle with such tasks. On the empirical side, it is unclear whether the proposed initialization could improve performance on language-based tasks. Additionally, I have some concerns about the robustness and cla

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces