How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment
Rebecca Ansell, Autumn Toney-Wails

TL;DR
This study evaluates the deductive reasoning abilities of large language models in a text-based game environment, revealing their limited success and the ineffectiveness of fine-tuning for improving multi-step reasoning.
Contribution
Introduces a novel rule-based text game environment for assessing multi-step deductive reasoning in LLMs and analyzes the impact of fine-tuning on reasoning performance.
Findings
LLMs achieved only four wins in 18 games, showing limited deductive success.
Fine-tuning did not consistently enhance reasoning accuracy or gameplay performance.
In some cases, fine-tuning increased reasoning activity without improving correctness.
Abstract
Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Multi-Agent Systems and Negotiation · Explainable Artificial Intelligence (XAI)
