Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues

Chuanrui Hu; Tong Li; Xingze Gao; Hongda Chen; Yi Bai; Dannong Xu; Tianwei Lin; Xiaohong Li; Yunyun Han; Jian Pei; and Yafeng Deng

arXiv:2602.01313·cs.CL·March 12, 2026

Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues

Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Yi Bai, Dannong Xu, Tianwei Lin, Xiaohong Li, Yunyun Han, Jian Pei, and Yafeng Deng

PDF

Open Access

TL;DR

This paper introduces EverMemBench, a novel benchmark for evaluating long-term memory in multi-party collaborative dialogues, highlighting current limitations and guiding future development of more capable LLM memory systems.

Contribution

The paper presents EverMemBench, the first benchmark specifically designed for long-horizon multi-party collaborative memory evaluation, with comprehensive multi-dimensional assessment.

Findings

01

Current systems struggle with multi-hop reasoning in multi-party contexts (26% accuracy).

02

Temporal reasoning requires explicit version semantics beyond timestamps.

03

Memory awareness is limited by retrieval methods missing implicit relevance.

Abstract

Long-term conversational memory in practical LLM applications is inherently collaborative: information is produced by multiple participants, scattered across groups and channels, revised over time, and implicitly grounded in roles and social context. Yet there is currently no established benchmark that evaluates memory under interaction patterns resembling real-world deployment, as existing benchmarks largely focus on dyadic or single-topic dialogues. In this paper, we introduce EverMemBench, the first benchmark designed for long-horizon collaborative memory, built from multi-party, multi-group conversations spanning over one million tokens with dense cross-topic interleaving, temporally evolving decisions, and role-conditioned personas. EverMemBench evaluates memory systems using 2400 QA pairs across three dimensions essential for real applications: fine-grained recall, memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Persona Design and Applications