MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation
Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

TL;DR
MTR-Bench is a comprehensive, fully-automated benchmark designed to evaluate multi-turn reasoning capabilities of large language models across diverse, interactive tasks.
Contribution
It introduces a large-scale, multi-class, multi-turn reasoning benchmark with automated evaluation framework, filling a critical gap in interactive AI system assessment.
Findings
State-of-the-art models underperform on multi-turn reasoning tasks.
MTR-Bench covers 40 diverse reasoning tasks with 3600 instances.
Automated evaluation enables scalable, human-free assessment.
Abstract
Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus
