MTRAG: A Multi-Turn Conversational Benchmark for Evaluating   Retrieval-Augmented Generation Systems

Yannis Katsis; Sara Rosenthal; Kshitij Fadnis; Chulaka Gunasekara,; Young-Suk Lee; Lucian Popa; Vraj Shah; Huaiyu Zhu; Danish Contractor; Marina; Danilevsky

arXiv:2501.03468·cs.CL·January 8, 2025·2 cites

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara,, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, Marina, Danilevsky

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

MTRAG is a comprehensive multi-turn RAG benchmark that evaluates LLMs' retrieval and generation capabilities across diverse, realistic conversational scenarios, revealing current systems' limitations.

Contribution

This paper introduces MTRAG, a new multi-turn RAG benchmark with human and automatic evaluation methods, highlighting the challenges faced by state-of-the-art LLM RAG systems.

Findings

01

State-of-the-art LLM RAG systems struggle on MTRAG.

02

Strong retrieval and generation are needed for multi-turn conversations.

03

MTRAG covers multiple domains and complex question types.

Abstract

Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ibm/mt-rag-benchmark
noneOfficial

Models

Videos

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems· underline

Taxonomy

TopicsSpeech and dialogue systems · Power Systems and Technologies · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Dense Connections · Linear Warmup With Linear Decay · WordPiece · Attention Dropout · Adam · Residual Connection · Dropout