Revisiting associative recall in modern recurrent models
Destiny Okpekpe, Antonio Orvieto

TL;DR
This paper investigates associative recall in modern recurrent models, highlighting the importance of learning rate, contrasting scaling benefits of recurrent versus attention models, and analyzing training dynamics and architecture effects.
Contribution
It provides a detailed analysis of associative recall in recurrent models, emphasizing the role of learning rate, scaling differences, and architectural components, which were previously underexplored.
Findings
Learning rate critically affects recurrent model performance.
Attention models struggle with associative recall when limited to one layer.
Training dynamics of 1-layer transformers resemble induction head formation.
Abstract
Despite the advantageous subquadratic complexity of modern recurrent deep learning models -- such as state-space models (SSMs) -- recent studies have highlighted their potential shortcomings compared to transformers on reasoning and memorization tasks. In this paper, we dive deeper into one of such benchmarks: associative recall (AR), which has been shown to correlate well with language modeling performance, and inspect in detail the effects of scaling and optimization issues in recently proposed token mixing strategies. We first demonstrate that, unlike standard transformers, the choice of learning rate plays a critical role in the performance of modern recurrent models: an issue that can severely affect reported performance in previous works and suggests further research is needed to stabilize training. Next, we show that recurrent and attention-based models exhibit contrasting…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper demonstrates the differences between SSM and transformer architectures from the perspective of model optimization and stability, providing a new dimension for understanding and analyzing SSM models. 2. The experimental analysis is very interesting, especially in exploring in-context recall-intensive tasks that are a key focus of these linear architectures, which may bring important insights to the development of the RNN community. 3. The paper is well-written, clearly structured, a
1. The paper doesn't adequately explain why even a single-layer Mamba without conv1d can still perform well on MQAR. It would be better to analyze the formation mechanism of induction heads in single-layer recurrent models. 2. The paper presents the various differences between the transformer and SSM in a fragmented manner, seemingly lacking a unified analytical explanation. Is there a connection between the optimization instability, width/depth scaling behavior, and induction heads phenomenon m
### Originality * Re-frames SSM vs. Transformer comparisons through learnability: finds critical LR-sensitivity for modern SSMs on MQAR/copying, contrasting with robust Transformers. ### Quality * Executes large LR sweeps and reports results (3 seeds) that expose narrow “goldilocks” regions for Mamba/Hyena, and wide basins for attention. * Provides scaling experiments showing width helps SSMs while depth helps Transformers; includes ablations (conv on Q/K/V; gating) that isolate drivers.
1. **Most results center on 1–2-layer models on synthetic tasks**. It’s unclear if the LR brittleness persists for deeper stacks (e.g., 12–24 layers) or on standard LM tasks. Adding at least one real-world benchmark (e.g., small LM perplexity) would be better. 2. Many plots average over 3 seeds; for **stability** claims, 3 is thin. 3. Modern recurrent models encompass many SSM (and non-SSM) variants; so the paper should name explicit implementations used in each figure (e.g., Mamba, Hyena, R
1. Understanding the optimization difference between SSMs and Transformers is an important and timely research direction. 2. The authors conduct extensive experiments, with a fine-grained grid search over learning rate hyperparameters.
1. Although the paper presents empirical evidence on the optimization difference between SSMs and Transformers, it lacks theoretical analysis or discussions to explain such observations. Some theoretical grounding can also be found in [1] [2] (which give concrete constructions on 2-layer Transformers and 1-layer Mamba solving associative recall, respectively). 2. Some claims are not well supported, such as the optimization difference in SSM and transformers (see Question 1) and the benefit of w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
