Memory Caching: RNNs with Growing Memory

Ali Behrouz; Zeman Li; Yuan Deng; Peilin Zhong; Meisam Razaviyayn; Vahab Mirrokni

arXiv:2602.24281·cs.LG·March 2, 2026

Memory Caching: RNNs with Growing Memory

Ali Behrouz, Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni

PDF

Open Access 3 Models 3 Reviews

TL;DR

This paper introduces Memory Caching (MC), a technique that enhances recurrent neural networks by allowing their memory capacity to grow with sequence length, bridging the performance gap with Transformers in recall tasks.

Contribution

The paper proposes Memory Caching (MC), a novel method that enables RNNs to have scalable memory, improving their performance on long-context tasks and reducing the gap with Transformers.

Findings

01

MC improves RNN performance on language modeling tasks.

02

MC variants achieve competitive results in recall tasks.

03

Memory capacity growth enhances RNNs' ability to handle long sequences.

Abstract

Transformers have been established as the de-facto backbones for most recent advances in sequence modeling, mainly due to their growing memory capacity that scales with the context length. While plausible for retrieval tasks, it causes quadratic complexity and so has motivated recent studies to explore viable subquadratic recurrent alternatives. Despite showing promising preliminary results in diverse domains, such recurrent architectures underperform Transformers in recall-intensive tasks, often attributed to their fixed-size memory. In this paper, we introduce Memory Caching (MC), a simple yet effective technique that enhances recurrent models by caching checkpoints of their memory states (a.k.a. hidden states). Memory Caching allows the effective memory capacity of RNNs to grow with sequence length, offering a flexible trade-off that interpolates between the fixed memory (i.e.,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. Designing a trade-off architecture between the computational complexity of RNNs and transformers is a valuable research area. 2. The proposed method is simple yet effective, and appears to be applicable to various RNN architectures. 3. The paper is clearly structured and includes sufficient experiments.

Weaknesses

1. The paper seems to lack analysis on the impact of the number of input sequence segments on model performance. This could help us identify the trade-off between the complexity of RNN and transformer architectures and explore better memory caching lengths when the model performs sequence modeling. 2. A characteristic of RNN model architectures is their potential to generalize to longer sequences, but the paper does not seem to focus on the performance of the proposed method in terms of length e

Reviewer 02Rating 4Confidence 4

Strengths

The results are well presented and the method is intuitive to follow. Experimental results show cases where improvements are significant.

Weaknesses

- The experiments are missing a rather important training and inference efficiency comparison of using the cached memories. In general, if the caching is done at a fixed length (suppose $L$), then the memory will grow linearly with the sequence length, potentially limiting the ability of the memory to train on longer contexts as well as conduct significantly longer inference compared to alternative linear models, which normally operate in $O(1)$ memory at inference time. - The main area where re

Reviewer 03Rating 4Confidence 5

Strengths

1. The paper effectively addresses the bottleneck of fixed-size memory in RNNs by proposing a mechanism that allows memory to grow with the sequence length, providing a flexible middle ground between traditional RNNs and Transformer models. 2. The MC framework, especially with Gated Residual Memory and Sparse Selective Caching, significantly enhances the retrieval ability of RNNs, improving performance on tasks where sequence history plays a crucial role.

Weaknesses

1. The proposed GRM still leads to quadratic complexity in the worst case when applied to long sequences, making it no better than vanilla attention in terms of computational complexity. If SWA (Sliding Window Attention) being treated as an RNN, SWA+GRM essentially reduces to vanilla attention, undermining the claim of significant computational efficiency improvements over attention-based approaches like Transformers. 2. Despite the enhancement in memory efficiency and retrieval performance, the

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Information Retrieval and Search Behavior · Natural Language Processing Techniques