MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal; Yannis Katsis; Vraj Shah; Lihong He; Lucian Popa; Marina Danilevsky

arXiv:2602.23184·cs.CL·February 27, 2026

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa, Marina Danilevsky

PDF

Open Access

TL;DR

MTRAG-UN is a comprehensive benchmark designed to evaluate and address open challenges in multi-turn retrieval augmented generation, highlighting areas where current models struggle in complex conversational scenarios.

Contribution

This paper introduces MTRAG-UN, a new benchmark with 666 tasks across 6 domains to evaluate multi-turn RAG models on challenging conversation types.

Findings

01

Models struggle with unanswerable questions

02

Models have difficulty with underspecified queries

03

Models perform poorly on unclear responses

Abstract

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques