MLP Memory: A Retriever-Pretrained Memory for Large Language Models

Rubin Wei; Jiaqi Cao; Jiarui Wang; Jushi Kai; Qipeng Guo; Bowen Zhou; Zhouhan Lin

arXiv:2508.01832·cs.CL·March 2, 2026

MLP Memory: A Retriever-Pretrained Memory for Large Language Models

Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin

PDF

3 Models 2 Datasets 3 Reviews

TL;DR

MLP Memory is a lightweight, pretrained parametric module that internalizes retrieval patterns, improving language model accuracy and efficiency without explicit document access, outperforming traditional retrieval-augmented methods.

Contribution

We introduce MLP Memory, a novel pretrained parametric memory that imitates retrieval behavior, bridging the gap between retrieval-based and fine-tuning methods in language models.

Findings

01

Achieves 17.5 ext{ and }24.1 ext{ extbackslash }% scaling gains on datasets

02

Improves question-answering benchmarks by 12.3 ext{ extbackslash }%

03

Reduces hallucinations by up to 10 points on HaluEval

Abstract

Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers from high inference latency and shallow integration, while parametric fine-tuning methods like LoRA risk catastrophic forgetting and degraded general capabilities. In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access. By pretraining an MLP to imitate a $k$ NN retriever's behavior on the entire pretraining dataset, we create a differentiable memory component that captures the benefits of retrieval-based knowledge access in a fully parametric form. Our architecture integrates this pretrained MLP Memory with Transformer decoders through simple probability…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 2

Strengths

1. The main idea is to make kNN-LM more storage efficient. It trains an MLP to replace the costly K-V storage. The idea is interesting. 2. The paper describes well the approach. It contains good analyses about different hyperparameters and settings. 3. The approach is compared to the relevant baselines (except the missing comparison mentioned in weakness).

Weaknesses

1. The contribution of the paper is limited. It does not change the basic idea of kNN-LM, but tries to provide a better implementation of it. 2. As the main contribution of MLP-memory is to reduce the storage cost of kNN-LM, the main comparison should be done with kNN-LM. However, in the main results on QA, this comparison is missing. The comparison is only done on general NLP tasks. This is insufficient. 3. The goal of MLP-memory is to reproduce kNN-LM at lost storage cost. The key research q

Reviewer 02Rating 6Confidence 4

Strengths

The paper addresses an important problem in retrieval-augmented modeling: maintaining factual recall benefits without external retrieval or large storage. MLP Memory provides a clean parametric alternative that approximates kNN-LM behavior through distribution imitation. The method is conceptually clear and empirically well evaluated, with strong results on factual QA and hallucination benchmarks. The use of TTFT and TPS as efficiency metrics is appropriate and provides a runtime-based compariso

Weaknesses

Despite strong results, several aspects limit the clarity and generality of the approach. The training cost is not clearly quantified: even if inference is faster, pretraining the MLP Memory requires generating kNN distributions and performing supervised training on them, which likely increases per-step computation and total cost. It is also unclear whether the backbone model is frozen or jointly updated, which affects the practical difficulty of integration. Moreover, the MLP Memory seems to re

Reviewer 03Rating 4Confidence 4

Strengths

(1) The motivation of the paper is clear. The paper identifies a clear gap in the literature and proposes an elegant hybrid approach that bridges parametric and non-parametric methods. (2) Comprehensive evaluation: The experimental design is thorough, evaluating the method across multiple critical dimensions: factual question answering, general NLP capability preservation, hallucination reduction, and inference efficiency. (3) Compelling performance: The results show that MLP Memory not only

Weaknesses

(1) The method requires a pre-training phase for the MLP Memory module, which involves constructing a large datastore and optimizing a large model. The computational cost and time of this initial step could be a barrier for some. (2) Unlike RAG, the knowledge stored in the MLP Memory is fixed after pre-training (just like the generative retrieval models). This limits the model's ability to access updated or real-time information without a costly retraining cycle, making it less suitable for

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.