TL;DR
MTR-Suite introduces a comprehensive framework for evaluating, synthesizing, and benchmarking conversational retrieval systems, addressing limitations of existing benchmarks through innovative auditing, dialogue generation, and a new benchmark dataset.
Contribution
It presents MTR-Eval, MTR-Pipeline, and MTR-Bench, a unified approach for assessing and creating high-fidelity conversational retrieval benchmarks with reduced human effort.
Findings
MTR-Pipeline generates dialogues at 1/400th human cost.
MTR-Bench mimics production challenges with high discriminative power.
Code and data are publicly available for research use.
Abstract
Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
