Just read twice: closing the recall gap for recurrent language models
Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri, Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher R\'e

TL;DR
This paper addresses the recall limitations of recurrent language models by analyzing information order effects and proposing methods like repeated prompts and non-causal attention to improve long-context understanding and efficiency.
Contribution
It formalizes the impact of data order on recurrent LMs' recall ability, linking it to set disjointness complexity, and introduces novel prompt techniques to mitigate order sensitivity and enhance performance.
Findings
Repeated prompts improve ICL performance by 11 points across models and tasks.
Non-causal prefix-linear attention achieves near-transformer quality with higher throughput.
Memory efficiency and long-context recall are significantly enhanced by proposed methods.
Abstract
Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due to the limited memory, recurrent LMs cannot recall and use all the information in long contexts leading to brittle in-context learning (ICL) quality. A key challenge for efficient LMs is selecting what information to store versus discard. In this work, we observe the order in which information is shown to the LM impacts the selection difficulty. To formalize this, we show that the hardness of information recall reduces to the hardness of a problem called set disjointness (SD), a quintessential problem in communication complexity that requires a streaming algorithm (e.g., recurrent model) to decide whether inputted sets are disjoint. We empirically and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training · Linear Layer · Multi-Head Attention · Attention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer
