TL;DR
DeepCritic introduces a two-stage framework leveraging large language models to generate deliberate, step-wise critiques of math solutions, significantly improving feedback quality and judgment accuracy for automated oversight.
Contribution
The paper presents a novel two-stage training approach for LLM critics, combining supervised fine-tuning with reinforcement learning to enhance critique depth and accuracy.
Findings
Outperforms existing LLM critics on error identification benchmarks.
Provides more detailed and effective feedback for correcting math solutions.
Significantly improves the critique ability of LLMs for mathematical reasoning.
Abstract
As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper innovatively addresses the superficiality of LLM critiques by introducing a two-stage pipeline that combines iterative critique generation (initial + in-depth meta-critiquing) with RL, creatively adapting Monte Carlo sampling for automated RL data in math domains, which extends prior work on scalable oversight (e.g., Saunders et al., 2022) to deliberate reasoning without relying solely on human labels. The methodology is rigorously evaluated across multiple benchmarks, showing substan
While RL improves performance, the auto-generated data via Monte Carlo sampling discards certain solutions (e.g., fully correct/incorrect ones), potentially introducing biases toward medium-difficulty problems; this could be quantified with diversity metrics to ensure the data represents a wide range of math complexities. Test-time scaling results focus on majority voting and refinement, but lack comparisons with advanced baselines like outcome reward models (ORMs) or hybrid PRM-ORM setups, whi
1. Throughout experiments: the paper contains very throughout experiments to prove the idea. Although this is a well explored scope (SFT + RL), the full set of experiment is still nice. 2. The paper demonstrates that RL with automatically constructed data (DeepCritic-7B-RL-Numina) also yields substantial gains, this confirms with finding from other papers, and proving that auto rating is valuable.
1. Lack of novelty: the framework adopted in this paper is well established in reasoning world, this work can be viewed as an application in the critique capability. 2. Limited Domain: The paper focuses solely on mathematical reasoning. While a standard testbed, it's unclear if this deliberate critique approach generalizes well to more subjective or less structured domains (e.g., creative writing, complex instruction following). 3. Dependency on Strong Teacher: The seed data generation relies
- Importance and Timeliness of the Problem: The paper tackles a fundamental challenge at the forefront of LLM development. As the community shifts from outcome-based to process-based supervision, improving the quality of automated feedback (i.e., critique) is paramount for achieving scalable oversight and building more reliable and trustworthy LLMs. This work directly addresses a core bottleneck in this research direction. - Novel and Insightful Methodology: The paper's primary contribution is
- Scalability and Cost of the Data Generation Pipeline: The framework's main strength—its high-quality data—is also a potential weakness in terms of scalability. The data curation process is computationally intensive, requiring multiple long-sequence inference passes from a very large teacher model for each data point. This makes the cost of data generation exceedingly high, posing a significant challenge for scaling the dataset to millions of examples and potentially limiting its feasibility fo
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
