Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng; Wangding Zeng; Damai Dai; Qinyu Chen; Bingxuan Wang; Zhenda Xie; Kezhao Huang; Xingkai Yu; Zhewen Hao; Yukun Li; Han Zhang; Huishuai Zhang; Dongyan Zhao; Wenfeng Liang

arXiv:2601.07372·cs.CL·January 13, 2026

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, Wenfeng Liang

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces Engram, a scalable conditional memory module for large language models that improves knowledge retrieval, reasoning, and long-context understanding by combining static memory with neural computation, guided by a novel sparsity law.

Contribution

The paper proposes Engram, a scalable, O(1) lookup memory module, and formulates the Sparsity Allocation problem, revealing a U-shaped scaling law to optimize model capacity and efficiency.

Findings

01

Engram scales to 27B parameters with superior performance.

02

Significant gains in reasoning and domain-specific tasks.

03

Enhanced long-context retrieval and efficiency through deterministic addressing.

Abstract

While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$ -gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Kevletesteur/chimere-system
model

Datasets

openclaw-huy/maxtext-main
dataset· 902 dl
902 dl

Videos

DeepSeek Just Fixed One Of The Biggest Problems With AI· youtube

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems