FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving

Guizhen Chen; Weiwen Xu; Hao Zhang; Hou Pong Chan; Chaoqun Liu; Lidong Bing; Deli Zhao; Anh Tuan Luu; Yu Rong

arXiv:2502.20238·cs.CL·June 3, 2025

FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving

Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong

PDF

Open Access 1 Repo 1 Video

TL;DR

FINEREASON introduces a logic-puzzle benchmark to evaluate and improve large language models' deliberate reasoning, focusing on intermediate steps and reflection, leading to better mathematical reasoning performance.

Contribution

The paper presents FINEREASON, a new benchmark with tasks for assessing and enhancing LLMs' intermediate reasoning and reflection capabilities.

Findings

01

Models trained on our data improve math reasoning by up to 5.1%.

02

FINEREASON enables detailed evaluation of reasoning steps.

03

New tasks promote better reflection and correction in LLMs.

Abstract

Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DAMO-NLP-SG/FineReason
noneOfficial

Videos

FineReason: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing

MethodsSparse Evolutionary Training