LM2: Large Memory Models
Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser, Greenlee, George Thomas, Marvin Purtorab, Andy Toulis

TL;DR
LM2 introduces a memory-augmented Transformer architecture that significantly improves multi-step reasoning, reasoning over long contexts, and task performance without sacrificing general capabilities, demonstrated through benchmark results.
Contribution
The paper presents LM2, a novel decoder-only Transformer with an auxiliary memory module that enhances reasoning and long-context processing while maintaining general performance.
Findings
Outperforms previous memory-augmented models by 37.1%.
Achieves 86.3% improvement over baseline Llama-3.2.
Enhances multi-hop inference and numerical reasoning capabilities.
Abstract
This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general-purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Absolute Position Encodings · Dropout · Label Smoothing · Byte Pair Encoding
