Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

Kartikeya Badola; Jonathan Simon; Arian Hosseini; Sara Marie Mc Carthy; Tsendsuren Munkhdalai; Abhimanyu Goyal; Tom\'a\v{s} Ko\v{c}isk\'y; Shyam Upadhyay; Bahare Fatemi; Mehran Kazemi

arXiv:2508.10142·cs.CL·August 26, 2025

Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

Kartikeya Badola, Jonathan Simon, Arian Hosseini, Sara Marie Mc Carthy, Tsendsuren Munkhdalai, Abhimanyu Goyal, Tom\'a\v{s} Ko\v{c}isk\'y, Shyam Upadhyay, Bahare Fatemi, Mehran Kazemi

PDF

1 Datasets

TL;DR

This paper introduces a new benchmark for evaluating large language models on multi-turn reasoning, dialogue, and information-seeking tasks, highlighting current limitations and guiding future improvements.

Contribution

The paper presents a novel, deterministic benchmark for testing LLMs on complex interactive reasoning tasks, addressing a gap in existing evaluation methods.

Findings

01

Most models struggle with instruction following and reasoning.

02

Significant errors are due to poor planning and incomplete data handling.

03

Current models show substantial room for improvement in interactive scenarios.

Abstract

Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

arianhosseini/mt_puzzles
dataset· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.