Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory
Runxi Cheng, Yuchen Guan, Yongxian Wei, Qianpu Sun, Qixiu Li, Sinan Du, Feng Xiong, Chun Yuan, Yan Lu, Yeyun Gong

TL;DR
Memory Grafting introduces an efficient method for scaling language model memory by offline grafting of frozen hidden states, enhancing capacity with minimal overhead and outperforming existing approaches.
Contribution
It proposes a novel offline conditional memory scaling technique that leverages frozen model states and exact lookup for efficient memory expansion in language models.
Findings
Memory Grafting improves benchmark scores over MoE and vanilla Engram baselines.
It scales effectively with limited training and inference overhead.
The method enhances external latent capacity in language models.
Abstract
Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
