MLP Memory: A Retriever-Pretrained Memory for Large Language Models
Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin

TL;DR
MLP Memory is a lightweight, pretrained parametric module that internalizes retrieval patterns, improving language model accuracy and efficiency without explicit document access, outperforming traditional retrieval-augmented methods.
Contribution
We introduce MLP Memory, a novel pretrained parametric memory that imitates retrieval behavior, bridging the gap between retrieval-based and fine-tuning methods in language models.
Findings
Achieves 17.5 ext{ and }24.1 ext{ extbackslash }% scaling gains on datasets
Improves question-answering benchmarks by 12.3 ext{ extbackslash }%
Reduces hallucinations by up to 10 points on HaluEval
Abstract
Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers from high inference latency and shallow integration, while parametric fine-tuning methods like LoRA risk catastrophic forgetting and degraded general capabilities. In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access. By pretraining an MLP to imitate a NN retriever's behavior on the entire pretraining dataset, we create a differentiable memory component that captures the benefits of retrieval-based knowledge access in a fully parametric form. Our architecture integrates this pretrained MLP Memory with Transformer decoders through simple probability…
Peer Reviews
Decision·ICLR 2026 Poster
1. The main idea is to make kNN-LM more storage efficient. It trains an MLP to replace the costly K-V storage. The idea is interesting. 2. The paper describes well the approach. It contains good analyses about different hyperparameters and settings. 3. The approach is compared to the relevant baselines (except the missing comparison mentioned in weakness).
1. The contribution of the paper is limited. It does not change the basic idea of kNN-LM, but tries to provide a better implementation of it. 2. As the main contribution of MLP-memory is to reduce the storage cost of kNN-LM, the main comparison should be done with kNN-LM. However, in the main results on QA, this comparison is missing. The comparison is only done on general NLP tasks. This is insufficient. 3. The goal of MLP-memory is to reproduce kNN-LM at lost storage cost. The key research q
The paper addresses an important problem in retrieval-augmented modeling: maintaining factual recall benefits without external retrieval or large storage. MLP Memory provides a clean parametric alternative that approximates kNN-LM behavior through distribution imitation. The method is conceptually clear and empirically well evaluated, with strong results on factual QA and hallucination benchmarks. The use of TTFT and TPS as efficiency metrics is appropriate and provides a runtime-based compariso
Despite strong results, several aspects limit the clarity and generality of the approach. The training cost is not clearly quantified: even if inference is faster, pretraining the MLP Memory requires generating kNN distributions and performing supervised training on them, which likely increases per-step computation and total cost. It is also unclear whether the backbone model is frozen or jointly updated, which affects the practical difficulty of integration. Moreover, the MLP Memory seems to re
(1) The motivation of the paper is clear. The paper identifies a clear gap in the literature and proposes an elegant hybrid approach that bridges parametric and non-parametric methods. (2) Comprehensive evaluation: The experimental design is thorough, evaluating the method across multiple critical dimensions: factual question answering, general NLP capability preservation, hallucination reduction, and inference efficiency. (3) Compelling performance: The results show that MLP Memory not only
(1) The method requires a pre-training phase for the MLP Memory module, which involves constructing a large datastore and optimizing a large model. The computational cost and time of this initial step could be a barrier for some. (2) Unlike RAG, the knowledge stored in the MLP Memory is fixed after pre-training (just like the generative retrieval models). This limits the model's ability to access updated or real-time information without a costly retraining cycle, making it less suitable for
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
