DeepCritic: Deliberate Critique with Large Language Models

Wenkai Yang; Jingwen Chen; Yankai Lin; Ji-Rong Wen

arXiv:2505.00662·cs.CL·May 2, 2025

DeepCritic: Deliberate Critique with Large Language Models

Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen

PDF

1 Repo 3 Reviews

TL;DR

DeepCritic introduces a two-stage framework leveraging large language models to generate deliberate, step-wise critiques of math solutions, significantly improving feedback quality and judgment accuracy for automated oversight.

Contribution

The paper presents a novel two-stage training approach for LLM critics, combining supervised fine-tuning with reinforcement learning to enhance critique depth and accuracy.

Findings

01

Outperforms existing LLM critics on error identification benchmarks.

02

Provides more detailed and effective feedback for correcting math solutions.

03

Significantly improves the critique ability of LLMs for mathematical reasoning.

Abstract

As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper innovatively addresses the superficiality of LLM critiques by introducing a two-stage pipeline that combines iterative critique generation (initial + in-depth meta-critiquing) with RL, creatively adapting Monte Carlo sampling for automated RL data in math domains, which extends prior work on scalable oversight (e.g., Saunders et al., 2022) to deliberate reasoning without relying solely on human labels. The methodology is rigorously evaluated across multiple benchmarks, showing substan

Weaknesses

While RL improves performance, the auto-generated data via Monte Carlo sampling discards certain solutions (e.g., fully correct/incorrect ones), potentially introducing biases toward medium-difficulty problems; this could be quantified with diversity metrics to ensure the data represents a wide range of math complexities. Test-time scaling results focus on majority voting and refinement, but lack comparisons with advanced baselines like outcome reward models (ORMs) or hybrid PRM-ORM setups, whi

Reviewer 02Rating 4Confidence 4

Strengths

1. Throughout experiments: the paper contains very throughout experiments to prove the idea. Although this is a well explored scope (SFT + RL), the full set of experiment is still nice. 2. The paper demonstrates that RL with automatically constructed data (DeepCritic-7B-RL-Numina) also yields substantial gains, this confirms with finding from other papers, and proving that auto rating is valuable.

Weaknesses

1. Lack of novelty: the framework adopted in this paper is well established in reasoning world, this work can be viewed as an application in the critique capability. 2. Limited Domain: The paper focuses solely on mathematical reasoning. While a standard testbed, it's unclear if this deliberate critique approach generalizes well to more subjective or less structured domains (e.g., creative writing, complex instruction following). 3. Dependency on Strong Teacher: The seed data generation relies

Reviewer 03Rating 6Confidence 4

Strengths

- Importance and Timeliness of the Problem: The paper tackles a fundamental challenge at the forefront of LLM development. As the community shifts from outcome-based to process-based supervision, improving the quality of automated feedback (i.e., critique) is paramount for achieving scalable oversight and building more reliable and trustworthy LLMs. This work directly addresses a core bottleneck in this research direction. - Novel and Insightful Methodology: The paper's primary contribution is

Weaknesses

- Scalability and Cost of the Data Generation Pipeline: The framework's main strength—its high-quality data—is also a potential weakness in terms of scalability. The data curation process is computationally intensive, requiring multiple long-sequence inference passes from a very large teacher model for each data point. This makes the cost of data generation exceedingly high, posing a significant challenge for scaling the dataset to millions of examples and potentially limiting its feasibility fo

Code & Models

Repositories

rucbm/deepcritic
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus