Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
Kartikeya Badola, Jonathan Simon, Arian Hosseini, Sara Marie Mc Carthy, Tsendsuren Munkhdalai, Abhimanyu Goyal, Tom\'a\v{s} Ko\v{c}isk\'y, Shyam Upadhyay, Bahare Fatemi, Mehran Kazemi

TL;DR
This paper introduces a new benchmark for evaluating large language models on multi-turn reasoning, dialogue, and information-seeking tasks, highlighting current limitations and guiding future improvements.
Contribution
The paper presents a novel, deterministic benchmark for testing LLMs on complex interactive reasoning tasks, addressing a gap in existing evaluation methods.
Findings
Most models struggle with instruction following and reasoning.
Significant errors are due to poor planning and incomplete data handling.
Current models show substantial room for improvement in interactive scenarios.
Abstract
Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
