TL;DR
RefCritic introduces a reinforcement learning-based critic module for large language models that generates high-quality, actionable feedback to improve reasoning and refinement, outperforming supervised methods across multiple benchmarks.
Contribution
The paper presents RefCritic, a novel long-chain-of-thought critic model trained with dual rule-based rewards, significantly enhancing critique quality and model refinement over supervised approaches.
Findings
RefCritic achieves 6.8% and 7.2% gains on AIME25 benchmarks.
Policy models filtered by RefCritic outperform in majority voting scenarios.
RefCritic surpasses step-level supervised methods on ProcessBench.
Abstract
With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models' critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is logically clear and easy to follow. 2. The use of refinement effectiveness as a direct reward for the critic is conceptually clear and technically reasonable. 3. Experiments are extensive, including test time scaling, comparisons with multiple base models and several out-of-distribution benchmarks.
1. The introduction and main experiments lack systematic comparison with recent strong critic baselines such as DeepCritic [1] and RealCritic [2], so the position and advantage of RefCritic in this line of work remain unclear. 2. The paper does not report the computational cost of the dual reward RL training, and hence it is difficult to assess the cost effectiveness and scalability of the proposed approach to larger models or broader deployment. 3. The paper lacks qualitative cases that show
1. This paper provides analyses on SFT-based critic models, which gives some meaningful insights. 2. Empirical results show the superior performance of the proposed method. 3. This paper is overall well-organized.
1. This paper misses an important line of work about critique generation for refinement, such as [1]. In my view, the proposed method is similar to [1], especially the design of R_r (Equation 6). The authors should clearly discuss the difference to highlight their core novelty. 2. Although the authors claim that their method is a long-chain-of-thought critic module, I do not find how this method can improve the long-chain-of-thought generation ability. Now, the methodological design is mainly a
1. The paper is well-written and logically structured. The 2. The experimental design and scale are huge 3. While the dual-reward concept is not entirely novel, the specific formulationis a meaningful operationalization.
1. The central idea, using refinement performance as a reward signal for training critics, has been explored in prior work. For instance, Training Language Models to Critique With Multi-agent Feedback also leverages feedback loops where critique quality is tied to downstream correction success. The paper would benefit from a more nuanced discussion of these related approaches in Section 2. 2. The method critically relies on binary, verifiable ground truth (e.g., mathematical answers, code execut
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
