Pre-training Limited Memory Language Models with Internal and External Knowledge
Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Ryan Thomas Noonan, Dongyoung Go, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun

TL;DR
This paper proposes Limited Memory Language Models (LMLMs) that externalize factual knowledge to external databases during pre-training, enabling more transparent, editable, and verifiable knowledge while maintaining competitive performance.
Contribution
Introduction of LMLMs that externalize knowledge, allowing targeted lookups instead of memorization, improving transparency and editability of factual information.
Findings
LMLMs perform comparably to larger models on benchmarks.
LMLMs enable explicit knowledge editing.
External knowledge retrieval improves model transparency.
Abstract
Neural language models are black-boxes--both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We introduce Limited Memory Language Models (LMLM), a new class of language models that externalizes factual knowledge to external database during pre-training rather than memorizing them. Our pre-training approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases.
Peer Reviews
Decision·ICLR 2026 Poster
- The 355M small model trained by the authors has achieved better results on metrics such as FactScore, even when compared to similar models augmented with RAG. - The training paradigm proposed by the authors is highly innovative, radical yet cost-effective. This is because the cost of searching is far lower than that of memorizing knowledge using large amounts of pre-training data.
- For a pre-training paradigm, it is clearly unreasonable for the authors to evaluate and compare models of the same scale solely using factuality-related evaluation methods. I believe the authors should at least add evaluations on aspects representing instruction following and reasoning capabilities, though I understand this is difficult for an ultra-small-scale model. - The authors’ training scale is too small to allow us to determine whether the advantages currently achieved can be overwritte
* Introducing masking of retrieved factual values during pre-training to encourage reliance on external lookups rather than weight memorization is a compelling idea. It contributes to ongoing discussions around modularizing knowledge in language models. * The paper presents a complete pipeline, from data annotation to model training and inference, which may facilitate adoption or extension by other practitioners.
* The method’s success relies heavily on an annotation pipeline using GPT-4o and a trained annotator model to extract factual triples. The manuscript would benefit from deeper analysis of annotation accuracy, coverage, biases introduced by the seed annotations, and scalability to larger or more diverse corpora. * The focus on (entity, relation → value) triples means the approach externalizes only certain types of knowledge (birth dates, titles, etc.). More complex or contextual knowledge (e.g.,
S1: A lot of experimental analyses are well executed and presented, supporting the core advantages of the proposed method well S2: The limitations of the proposed method and future research directions are well stated in the discussion section
W1: **Limited scope of usage** - While the proposed method can be utilized in knowledge-intensive tasks, whether it can be extended to broader usage is unclear. This is because further tuning of LMLMs will likely require a similar formatting of fine-tuning data by design. For example, I'm not clear about whether the proposed method can be deployed and maintained under instruction tuning. W2: **Brittleness of DB-style modeling of factual knowledge** - While the proposed method inherits many nice
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Big Data and Digital Economy · Natural Language Processing Techniques
