GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari, Evgeniy Gabrilovich

TL;DR
This paper introduces GroupMemBench, a comprehensive benchmark for evaluating memory systems of LLM agents in multi-party conversations, revealing significant gaps in current memory capabilities.
Contribution
It presents a novel benchmark that captures group dynamics, speaker-grounded belief tracking, and audience-adapted language, filling gaps in existing single-user focused benchmarks.
Findings
Leading memory systems achieve only 46.0% accuracy
Knowledge update accuracy is 27.1%
A simple BM25 baseline outperforms most memory systems
Abstract
Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
