TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Yiran Zhang; Mo Wang; Xiaoyang Li; Kaixuan Ren; Chencheng Zhu; Usman Naseem

arXiv:2506.01341·cs.CL·November 26, 2025

TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Yiran Zhang, Mo Wang, Xiaoyang Li, Kaixuan Ren, Chencheng Zhu, Usman Naseem

PDF

Open Access

TL;DR

TurnBench-MS introduces a comprehensive benchmark for evaluating large language models' ability to perform multi-turn, multi-step reasoning through an interactive code-breaking game, highlighting current models' limitations in complex reasoning tasks.

Contribution

This paper presents TurnBench-MS, a novel benchmark with interactive, multi-turn tasks and annotated reasoning steps, designed to assess and improve multi-step reasoning in large language models.

Findings

01

State-of-the-art LLMs achieve 84% in Classic mode but only 18% in Nightmare mode.

02

Humans achieve 100% accuracy in both modes.

03

TurnBench-MS reveals significant gaps in current models' reasoning capabilities.

Abstract

Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by the "Turing Machine Board Game." In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques