Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?
Nemika Tyagi, Mihir Parmar, Mohith Kulkarni, Aswin RRV, Nisarg Patel,, Mutsumi Nakamura, Arindam Mitra, Chitta Baral

TL;DR
This paper introduces GridPuzzle, a new dataset and evaluation framework for analyzing the reasoning chains of large language models in solving grid puzzles, revealing that current prompting methods do not improve reasoning performance.
Contribution
It presents a novel dataset, an error taxonomy, and a framework for detailed evaluation of LLM reasoning in grid puzzles, highlighting gaps in current prompting techniques.
Findings
Existing prompting methods do not enhance LLM performance on GridPuzzle.
Analysis reveals common reasoning errors made by LLMs.
The framework enables fine-grained assessment of LLM reasoning chains.
Abstract
Solving grid puzzles involves a significant amount of logical reasoning. Hence, it is a good domain to evaluate the reasoning capability of a model which can then guide us to improve the reasoning ability of models. However, most existing works evaluate only the final predicted answer of a puzzle, without delving into an in-depth analysis of the LLMs' reasoning chains (such as where they falter) or providing any finer metrics to evaluate them. Since LLMs may rely on simple heuristics or artifacts to predict the final answer, it is crucial to evaluate the generated reasoning chain beyond overall correctness measures, for accurately evaluating the reasoning abilities of LLMs. To this end, we first develop GridPuzzle, an evaluation dataset comprising 274 grid-based puzzles with different complexities. Second, we propose a new error taxonomy derived from manual analysis of reasoning chains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsConstraint Satisfaction and Optimization · AI-based Problem Solving and Planning · Business Process Modeling and Analysis
MethodsAttention Is All You Need · Adam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections
