Evaluating Very Long-Term Conversational Memory of LLM Agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal,, Francesco Barbieri, Yuwei Fang

TL;DR
This paper introduces LoCoMo, a dataset of very long-term dialogues with 300 turns over 35 sessions, and evaluates LLMs' ability to maintain long-term memory and coherence in extended conversations.
Contribution
It presents a novel pipeline for generating and annotating very long-term dialogues, creating the LoCoMo dataset, and establishing benchmarks for evaluating long-term memory in LLMs.
Findings
LLMs struggle with understanding lengthy conversations.
Strategies like long-context models or RAG improve performance.
Models still lag behind human capabilities in long-term dialogue understanding.
Abstract
Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · AI in Service Interactions
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Residual Connection · Linear Layer · Byte Pair Encoding · Weight Decay · Dropout · Multi-Head Attention · Linear Warmup With Linear Decay
