Can Large Language Models Detect Errors in Long Chain-of-Thought   Reasoning?

Yancheng He; Shilong Li; Jiaheng Liu; Weixun Wang; Xingyuan Bu; Ge; Zhang; Zhongyuan Peng; Zhaoxiang Zhang; Zhicheng Zheng; Wenbo Su; Bo Zheng

arXiv:2502.19361·cs.CL·April 1, 2025

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge, Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, Bo Zheng

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This paper introduces DeltaBench, a benchmark for evaluating Large Language Models' ability to detect errors in long Chain-of-Thought reasoning, analyzing different models and critic systems to understand their effectiveness and limitations.

Contribution

The paper presents DeltaBench, a new benchmark dataset for assessing error detection in long CoT reasoning by LLMs, along with comprehensive analysis of model performance.

Findings

01

Different o1-like models vary in effectiveness for long CoT generation.

02

Existing critic models have limitations in error detection accuracy.

03

DeltaBench provides insights for improving LLM reasoning and critique abilities.

Abstract

Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openstellarteam/deltabench
noneOfficial

Datasets

Videos

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Software Engineering Research