ATLAS: Learning to Optimally Memorize the Context at Test Time

Ali Behrouz; Zeman Li; Praneeth Kacham; Majid Daliri; Yuan Deng; Peilin Zhong; Meisam Razaviyayn; Vahab Mirrokni

arXiv:2505.23735·cs.CL·May 30, 2025

ATLAS: Learning to Optimally Memorize the Context at Test Time

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni

PDF

Open Access 4 Models 3 Reviews

TL;DR

ATLAS introduces a novel long-term memory module that enhances context memorization in sequence models, surpassing Transformers and recurrent models in long-context tasks, with significant improvements demonstrated across various benchmarks.

Contribution

The paper proposes ATLAS, a high-capacity memory module that learns to memorize context by optimizing memory with current and past tokens, and introduces DeepTransformers, a new family of Transformer-like architectures.

Findings

01

ATLAS outperforms Transformers and recurrent models in long-context tasks.

02

Achieves +80% accuracy on 10M context length in BABILong benchmark.

03

Improves long-term context understanding in various language tasks.

Abstract

Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input;…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Theory is interesting and clearly framed. The capacity analysis (matrix vs deep memory; effect of polynomial mappings) and the Omega sliding-window objective are neat, well-motivated contributions. 2. Broad empirical sweep with solid baselines from modern RNN families; results on RULER NIAH and standard LM/CS tasks are competitive, often best among non-hybrids.

Weaknesses

1. Key claims are not directly stress-tested. The headline contribution - “a long-term neural memory module with high capacity and the ability to memorize the context, instead of tokens" - is asserted, but experiments do not isolate these mechanisms. Per-benchmark wins and one ablation (showing c=1 hurts) are helpful, yet there is no targeted evaluation that would uniquely validate context-level memorization vs token-wise memorization (e.g., tasks constructed so token-level memorization provably

Reviewer 02Rating 6Confidence 4

Strengths

- Nice idea and clean motivation. The paper clearly explains the problems with memory state capacity, state update locality, and state optimization and how Atlas addresses them. - Systematization of prior work. It provides a helpful, unified perspective on earlier memory modules through attentional bias and test-time optimization (particularly Table 3) - Results on language modeling, language understanding and needle-in-a-haystack tasks show strong performance. Ablation study supports design cho

Weaknesses

- Impact of polynomial mapping (value of p) on benchmarks results is not clear, especially on long context ones and MAD tasks, where larger memory capacity should show its benefit. - Features claimed for Omega (sliding window, global context-aware memory updates) are already in Block-Recurrent transformers, RMT, ARMT, and MELODI, which use the Transformer as a recurrent cell. Therefore, the novelty in this case is unclear. The text still lacks this discussion. - On BABILong, results are shown (F

Reviewer 03Rating 6Confidence 2

Strengths

The paper has made contributions with a focus of long term tasks for recurrent architecture memories. The investigation into expressivity seems to be the main novelty of the paper since it contains a novel technical solution (polynomial kernel) with solid theoretical motivations. The sliding window idea has already been implemented in architectures such as MAG and MAL but the paper makes further steps to integrate the idea into a single architectures. All the contributions have been tested not

Weaknesses

From Table 2, Muon does not offer great improvement and it seems this is not a central contribution from the paper. If my understanding is correct, it seems this part can be removed from the main section of the paper. Maybe this is more of a question, why Atlas++ results are not shown in Table 2?

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsLinear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Byte Pair Encoding