RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Haonan Bian; Zhiyuan Yao; Sen Hu; Zishan Xu; Shaolei Zhang; Yifu Guo; Ziliang Yang; Xueran Han; Huacan Wang; Ronghao Chen

arXiv:2601.06966·cs.CL·January 13, 2026

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen

PDF

Open Access

TL;DR

RealMem introduces a comprehensive benchmark for evaluating large language models' ability to manage long-term, project-oriented memory interactions in realistic scenarios, highlighting current system limitations.

Contribution

This paper presents the first benchmark grounded in realistic project scenarios, with a synthesis pipeline for simulating dynamic memory evolution in LLMs.

Findings

01

Current memory systems struggle with long-term project state management.

02

RealMem includes over 2,000 cross-session dialogues across eleven scenarios.

03

Experiments show significant challenges in dynamic context dependency handling.

Abstract

As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals. To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Speech and dialogue systems