MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Zecheng Tang; Baibei Ji; Ruoxi Sun; Haitian Wang; WangJie You; Zhang Yijun; Wenpeng Zhu; Ji Qi; Juntao Li; Min Zhang

arXiv:2601.11969·cs.CL·January 27, 2026

MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Zecheng Tang, Baibei Ji, Ruoxi Sun, Haitian Wang, WangJie You, Zhang Yijun, Wenpeng Zhu, Ji Qi, Juntao Li, Min Zhang

PDF

Open Access 1 Datasets

TL;DR

MemoryRewardBench is a new benchmark designed to evaluate reward models' ability to assess long-term memory management in large language models across various tasks and context lengths.

Contribution

This work introduces the first comprehensive benchmark for evaluating reward models' effectiveness in long-term memory assessment in LLMs.

Findings

01

Newer models outperform older ones in memory evaluation tasks.

02

Performance gap between open-source and proprietary models is decreasing.

03

Current reward models have fundamental limitations in memory evaluation.

Abstract

Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce MemoryRewardBench, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. MemoryRewardBench covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

LCM-Lab/MemRewardBench
dataset· 60 dl
60 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Topic Modeling · Personal Information Management and User Behavior