Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG
Egor Pakhomov, Erik Nijkamp, Caiming Xiong

TL;DR
This paper introduces a large benchmark for evaluating conversational memory, revealing that simple full-context methods outperform complex RAG systems in early conversations, and highlighting practical transition points for different approaches.
Contribution
The work provides a comprehensive benchmark for conversational memory evaluation and analyzes the effectiveness of simple versus RAG-based methods across conversation lengths.
Findings
Simple full-context approaches achieve 70-82% accuracy on early conversations.
RAG-based systems like Mem0 reach only 30-45% accuracy on short histories.
Long context methods are most effective within the first 30 to 150 conversations.
Abstract
We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Personal Information Management and User Behavior
