MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara,, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, Marina, Danilevsky

TL;DR
MTRAG is a comprehensive multi-turn RAG benchmark that evaluates LLMs' retrieval and generation capabilities across diverse, realistic conversational scenarios, revealing current systems' limitations.
Contribution
This paper introduces MTRAG, a new multi-turn RAG benchmark with human and automatic evaluation methods, highlighting the challenges faced by state-of-the-art LLM RAG systems.
Findings
State-of-the-art LLM RAG systems struggle on MTRAG.
Strong retrieval and generation are needed for multi-turn conversations.
MTRAG covers multiple domains and complex question types.
Abstract
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Power Systems and Technologies · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Dense Connections · Linear Warmup With Linear Decay · WordPiece · Attention Dropout · Adam · Residual Connection · Dropout
