Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana; Dong-Ho Lee; Sergey Tulyakov; Mohit Bansal,; Francesco Barbieri; Yuwei Fang

arXiv:2402.17753·cs.CL·February 28, 2024·3 cites

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal,, Francesco Barbieri, Yuwei Fang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces LoCoMo, a dataset of very long-term dialogues with 300 turns over 35 sessions, and evaluates LLMs' ability to maintain long-term memory and coherence in extended conversations.

Contribution

It presents a novel pipeline for generating and annotating very long-term dialogues, creating the LoCoMo dataset, and establishing benchmarks for evaluating long-term memory in LLMs.

Findings

01

LLMs struggle with understanding lengthy conversations.

02

Strategies like long-context models or RAG improve performance.

03

Models still lag behind human capabilities in long-term dialogue understanding.

Abstract

Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danny911kr/REALTALK
none

Videos

Evaluating Very Long-Term Conversational Memory of LLM Agents· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · AI in Service Interactions

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Residual Connection · Linear Layer · Byte Pair Encoding · Weight Decay · Dropout · Multi-Head Attention · Linear Warmup With Linear Decay