Evaluating Step-by-step Reasoning Traces: A Survey

Jinu Lee; Julia Hockenmaier

arXiv:2502.12289·cs.CL·September 23, 2025

Evaluating Step-by-step Reasoning Traces: A Survey

Jinu Lee, Julia Hockenmaier

PDF

Open Access 1 Video

TL;DR

This survey reviews the current landscape of evaluating step-by-step reasoning in large language models, highlighting inconsistencies and proposing a taxonomy to guide future research in assessment methods and benchmarks.

Contribution

It introduces a comprehensive taxonomy for reasoning evaluation criteria and reviews existing datasets, evaluators, and findings to address evaluation inconsistencies.

Findings

01

Identifies four key evaluation categories: factuality, validity, coherence, utility.

02

Highlights fragmented evaluation practices across the field.

03

Suggests promising directions for standardized reasoning assessment.

Abstract

Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility). Based on the taxonomy, we review different datasets, evaluator implementations, and recent findings, leading to promising directions for future research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Evaluating Step-by-step Reasoning Traces: A Survey· underline

Taxonomy

TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Natural Language Processing Techniques