ATLAS: Learning to Optimally Memorize the Context at Test Time
Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni

TL;DR
ATLAS introduces a novel long-term memory module that enhances context memorization in sequence models, surpassing Transformers and recurrent models in long-context tasks, with significant improvements demonstrated across various benchmarks.
Contribution
The paper proposes ATLAS, a high-capacity memory module that learns to memorize context by optimizing memory with current and past tokens, and introduces DeepTransformers, a new family of Transformer-like architectures.
Findings
ATLAS outperforms Transformers and recurrent models in long-context tasks.
Achieves +80% accuracy on 10M context length in BABILong benchmark.
Improves long-term context understanding in various language tasks.
Abstract
Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input;…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Theory is interesting and clearly framed. The capacity analysis (matrix vs deep memory; effect of polynomial mappings) and the Omega sliding-window objective are neat, well-motivated contributions. 2. Broad empirical sweep with solid baselines from modern RNN families; results on RULER NIAH and standard LM/CS tasks are competitive, often best among non-hybrids.
1. Key claims are not directly stress-tested. The headline contribution - “a long-term neural memory module with high capacity and the ability to memorize the context, instead of tokens" - is asserted, but experiments do not isolate these mechanisms. Per-benchmark wins and one ablation (showing c=1 hurts) are helpful, yet there is no targeted evaluation that would uniquely validate context-level memorization vs token-wise memorization (e.g., tasks constructed so token-level memorization provably
- Nice idea and clean motivation. The paper clearly explains the problems with memory state capacity, state update locality, and state optimization and how Atlas addresses them. - Systematization of prior work. It provides a helpful, unified perspective on earlier memory modules through attentional bias and test-time optimization (particularly Table 3) - Results on language modeling, language understanding and needle-in-a-haystack tasks show strong performance. Ablation study supports design cho
- Impact of polynomial mapping (value of p) on benchmarks results is not clear, especially on long context ones and MAD tasks, where larger memory capacity should show its benefit. - Features claimed for Omega (sliding window, global context-aware memory updates) are already in Block-Recurrent transformers, RMT, ARMT, and MELODI, which use the Transformer as a recurrent cell. Therefore, the novelty in this case is unclear. The text still lacks this discussion. - On BABILong, results are shown (F
The paper has made contributions with a focus of long term tasks for recurrent architecture memories. The investigation into expressivity seems to be the main novelty of the paper since it contains a novel technical solution (polynomial kernel) with solid theoretical motivations. The sliding window idea has already been implemented in architectures such as MAG and MAL but the paper makes further steps to integrate the idea into a single architectures. All the contributions have been tested not
From Table 2, Muon does not offer great improvement and it seems this is not a central contribution from the paper. If my understanding is correct, it seems this part can be removed from the main section of the paper. Maybe this is more of a question, why Atlas++ results are not shown in Table 2?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsLinear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Byte Pair Encoding
