Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?

Nemika Tyagi; Mihir Parmar; Mohith Kulkarni; Aswin RRV; Nisarg Patel,; Mutsumi Nakamura; Arindam Mitra; Chitta Baral

arXiv:2407.14790·cs.CL·October 7, 2024

Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?

Nemika Tyagi, Mihir Parmar, Mohith Kulkarni, Aswin RRV, Nisarg Patel,, Mutsumi Nakamura, Arindam Mitra, Chitta Baral

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces GridPuzzle, a new dataset and evaluation framework for analyzing the reasoning chains of large language models in solving grid puzzles, revealing that current prompting methods do not improve reasoning performance.

Contribution

It presents a novel dataset, an error taxonomy, and a framework for detailed evaluation of LLM reasoning in grid puzzles, highlighting gaps in current prompting techniques.

Findings

01

Existing prompting methods do not enhance LLM performance on GridPuzzle.

02

Analysis reveals common reasoning errors made by LLMs.

03

The framework enables fine-grained assessment of LLM reasoning chains.

Abstract

Solving grid puzzles involves a significant amount of logical reasoning. Hence, it is a good domain to evaluate the reasoning capability of a model which can then guide us to improve the reasoning ability of models. However, most existing works evaluate only the final predicted answer of a puzzle, without delving into an in-depth analysis of the LLMs' reasoning chains (such as where they falter) or providing any finer metrics to evaluate them. Since LLMs may rely on simple heuristics or artifacts to predict the final answer, it is crucial to evaluate the generated reasoning chain beyond overall correctness measures, for accurately evaluating the reasoning abilities of LLMs. To this end, we first develop GridPuzzle, an evaluation dataset comprising 274 grid-based puzzles with different complexities. Second, we propose a new error taxonomy derived from manual analysis of reasoning chains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mihir3009/gridpuzzle
noneOfficial

Datasets

Alex-Guha/CogSwitch-GridPuzzle-Reasoning
dataset· 67 dl
67 dl

Videos

Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?· underline

Taxonomy

TopicsConstraint Satisfaction and Optimization · AI-based Problem Solving and Planning · Business Process Modeling and Analysis

MethodsAttention Is All You Need · Adam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections