Mixture of Chapters: Scaling Learnt Memory in Transformers
Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi

TL;DR
This paper introduces learnable sparse memory banks with chapter-based routing for transformers, enabling scalable knowledge storage and retrieval, improving performance, and robustness in training and fine-tuning.
Contribution
It proposes a novel scalable memory architecture with chapter-based routing, enhancing transformer capacity and knowledge retention beyond standard models.
Findings
Scales memory to 262K tokens with manageable computation.
Outperforms iso-FLOP baseline transformers on pre-training and fine-tuning.
Shows improved knowledge retention and robustness to forgetting.
Abstract
Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Graph Neural Networks · Multimodal Machine Learning Applications
