Towards Compressive and Scalable Recurrent Memory
Yunchong Song, Jushi Kai, Liming Lu, Kaixi Qiu, Zhouhan Lin

TL;DR
Elastic Memory introduces a novel, scalable recurrent memory architecture based on the HiPPO framework, enabling efficient long-context processing and outperforming existing methods in memory usage and speed.
Contribution
The paper presents Elastic Memory, a new memory architecture that applies optimal online compression and flexible sampling, improving long-context modeling beyond current recurrent memory approaches.
Findings
Outperforms baselines on 32k+ long-context datasets
Uses 16x less memory than Memorizing Transformer at equal parameters
Faster and more effective than Melodi at larger scales
Abstract
Transformers face a quadratic bottleneck in attention when scaling to long contexts. Recent approaches introduce recurrent memory to extend context beyond the current window, yet these often face a fundamental trade-off between theoretical principles and practical scalability. To address this, we introduce Elastic Memory, a novel memory architecture grounded in the HiPPO framework for online function approximation. Elastic Memory treats historical sequence as samples from continuous signals, applying optimal online compression to encode them into a fixed-size memory state. For retrieval, we propose a flexible \textit{polynomial sampling} mechanism that reconstructs a history summary from this compressed state. Elastic Memory consistently outperformed baselines on long-context (32k+) datasets across three domains. With equal parameters, it beat Memorizing Transformer by 16x memory and…
Peer Reviews
Decision·Submitted to ICLR 2026
- Casting memory as HiPPO-based online compression gives a clear mathematical objective (optimal incremental polynomial projection) rather than an ad-hoc summarization heuristic; this grounds architecture choices in prior theory. - The paper reports state-of-the-art performance on multiple 32k+ datasets, beating the Memorizing Transformer by large memory-efficiency margins and outperforming Melodi across memory sizes (including when Melodi has more parameters). These are high-impact claims if r
- The HiPPO-projection order/size (N), the polynomial sampling schedule, and the weighting/measure choices can materially change performance. The paper gives strong aggregate wins, but I want systematic ablations (sensitivity to N, sampling density, training stability, and memory reconstruction quality metrics) to show results are robust and not brittle. - The paper makes speed and memory-efficiency claims (e.g., 16× memory advantage; 50% faster at 4× scale) — but details on hardware, batch siz
(1) The use of the HiPPO framework provides a solid mathematical basis for memory compression, moving beyond heuristic or ad hoc designs. (2) Elastic Memory demonstrates state-of-the-art performance across multiple long-context benchmarks, outperforming strong baselines in both accuracy and efficiency. (3) The architecture achieves its gains without adding extra trainable parameters, making it attractive for practical deployment. (4) The ability to inject inductive biases at test time v
(1) While the experiments are comprehensive within the long-context language modeling domain, the method is not evaluated on other important tasks such as vision or multimodal settings (e.g., vision-language models), which typically need long context window due to large number of image tokens from high-resolution images. (2) The method is evaluated on models trained from scratch using metrics such as loss and ppl; its effectiveness when integrated into specific down-stream tasks remains to be d
The paper is well-written and clear, with well-defined algorithms and equations. Unfortunately, I am not an expert in HiPPO, or state-space models in general, so it is hard for me to assess novelty and contribution.
The biggest weakness, as far as I am concerned, pertains to the writing style. The method section presents a mathematical progression from HiPPO to elastic memory, but has almost no citations within it. Thus, I cannot tell which parts of the math are novel derivations done by the authors, and which parts are drawn from prior literature. I personally am not an expert in HiPPO or state space models, so I cannot determine contribution without citations. The format of the equations differs enoug
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Generative Adversarial Networks and Image Synthesis
