MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

Xiaoyuan Li; Keqin Bao; Yubo Ma; Moxin Li; Wenjie Wang; Rui Men; Yichang Zhang; Fuli Feng; Dayiheng Liu

arXiv:2505.17123·cs.CL·May 22, 2026

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

PDF

TL;DR

MTR-Bench is a comprehensive, fully-automated benchmark designed to evaluate multi-turn reasoning capabilities of large language models across diverse, interactive tasks.

Contribution

It introduces a large-scale, multi-class, multi-turn reasoning benchmark with automated evaluation framework, filling a critical gap in interactive AI system assessment.

Findings

01

State-of-the-art models underperform on multi-turn reasoning tasks.

02

MTR-Bench covers 40 diverse reasoning tasks with 3600 instances.

03

Automated evaluation enables scalable, human-free assessment.

Abstract

Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus