LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive   Memory

Di Wu; Hongwei Wang; Wenhao Yu; Yuwei Zhang; Kai-Wei Chang; Dong Yu

arXiv:2410.10813·cs.CL·March 6, 2025·3 cites

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces LongMemEval, a benchmark for evaluating long-term memory in chat assistants, revealing significant performance gaps and proposing optimized memory design strategies to enhance recall and reasoning over sustained interactions.

Contribution

We present LongMemEval, a comprehensive benchmark for long-term memory in chat assistants, and propose a unified framework with design optimizations to improve memory performance.

Findings

01

Commercial chat assistants drop 30% accuracy on long-term memory tasks.

02

Memory optimizations significantly improve recall and question answering.

03

The benchmark provides a challenging testbed for future memory system improvements.

Abstract

Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. We introduce LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaowu0162/longmemeval
pytorchOfficial

Datasets

kellyhongg/cleaned-longmemeval-s
dataset· 33 dl
33 dl

Videos

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory· slideslive

Taxonomy

TopicsRecommender Systems and Techniques · Personal Information Management and User Behavior · AI in Service Interactions