LM2: Large Memory Models

Jikun Kang; Wenqi Wu; Filippos Christianos; Alex J. Chan; Fraser; Greenlee; George Thomas; Marvin Purtorab; Andy Toulis

arXiv:2502.06049·cs.CL·February 11, 2025

LM2: Large Memory Models

Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser, Greenlee, George Thomas, Marvin Purtorab, Andy Toulis

PDF

Open Access 2 Repos

TL;DR

LM2 introduces a memory-augmented Transformer architecture that significantly improves multi-step reasoning, reasoning over long contexts, and task performance without sacrificing general capabilities, demonstrated through benchmark results.

Contribution

The paper presents LM2, a novel decoder-only Transformer with an auxiliary memory module that enhances reasoning and long-context processing while maintaining general performance.

Findings

01

Outperforms previous memory-augmented models by 37.1%.

02

Achieves 86.3% improvement over baseline Llama-3.2.

03

Enhances multi-hop inference and numerical reasoning capabilities.

Abstract

This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general-purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Absolute Position Encodings · Dropout · Label Smoothing · Byte Pair Encoding