Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG

Egor Pakhomov; Erik Nijkamp; Caiming Xiong

arXiv:2511.10523·cs.CL·November 14, 2025

Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG

Egor Pakhomov, Erik Nijkamp, Caiming Xiong

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a large benchmark for evaluating conversational memory, revealing that simple full-context methods outperform complex RAG systems in early conversations, and highlighting practical transition points for different approaches.

Contribution

The work provides a comprehensive benchmark for conversational memory evaluation and analyzes the effectiveness of simple versus RAG-based methods across conversation lengths.

Findings

01

Simple full-context approaches achieve 70-82% accuracy on early conversations.

02

RAG-based systems like Mem0 reach only 30-45% accuracy on short histories.

03

Long context methods are most effective within the first 30 to 150 conversations.

Abstract

We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Salesforce/ConvoMem
dataset· 673 dl
673 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Personal Information Management and User Behavior