MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

Junhao Ruan; Abudukeyumu Abudula; Bei Li; Yongjing Yin; Xinyu Liu; Kechen Jiao; Xin Chen; Jingang Wang; Xunliang Cai; Tong Xiao; Jingbo Zhu

arXiv:2605.20729·cs.CL·May 21, 2026

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

Junhao Ruan, Abudukeyumu Abudula, Bei Li, Yongjing Yin, Xinyu Liu, Kechen Jiao, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, Jingbo Zhu

PDF

1 Repo

TL;DR

MTR-Suite introduces a comprehensive framework for evaluating, synthesizing, and benchmarking conversational retrieval systems, addressing limitations of existing benchmarks through innovative auditing, dialogue generation, and a new benchmark dataset.

Contribution

It presents MTR-Eval, MTR-Pipeline, and MTR-Bench, a unified approach for assessing and creating high-fidelity conversational retrieval benchmarks with reduced human effort.

Findings

01

MTR-Pipeline generates dialogues at 1/400th human cost.

02

MTR-Bench mimics production challenges with high discriminative power.

03

Code and data are publicly available for research use.

Abstract

Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rangehow/mtr-suite
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.