Mixture of Chapters: Scaling Learnt Memory in Transformers

Tasmay Pankaj Tibrewal; Pritish Saha; Ankit Meda; Kunal Singh; Pradeep Moturi

arXiv:2603.21096·cs.LG·March 24, 2026

Mixture of Chapters: Scaling Learnt Memory in Transformers

Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi

PDF

Open Access

TL;DR

This paper introduces learnable sparse memory banks with chapter-based routing for transformers, enabling scalable knowledge storage and retrieval, improving performance, and robustness in training and fine-tuning.

Contribution

It proposes a novel scalable memory architecture with chapter-based routing, enhancing transformer capacity and knowledge retention beyond standard models.

Findings

01

Scales memory to 262K tokens with manageable computation.

02

Outperforms iso-FLOP baseline transformers on pre-training and fine-tuning.

03

Shows improved knowledge retention and robustness to forgetting.

Abstract

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Graph Neural Networks · Multimodal Machine Learning Applications