Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning
Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR
Critique-RL introduces a two-stage reinforcement learning method to train critiquing language models without strong supervision, significantly improving their ability to assess and provide feedback on model outputs across various tasks.
Contribution
It proposes a novel two-stage RL approach that enhances critiquing models' discriminability and helpfulness without requiring stronger supervision signals.
Findings
Achieves 9.02% improvement on in-domain tasks
Achieves 5.70% improvement on out-of-domain tasks
Enhances critic helpfulness and discriminability through two-stage optimization
Abstract
Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains.…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed two-stage RL method is sound and well-motivated, which deals with the core problem of critique generation. 2. Extensive experiments show the effectiveness of the proposed method. 3. This paper is overall well-written and easy to follow.
1. The design of indirect rewards based on actor refinement is similar to [1], which is not discussed in the current paper. The authors should further clarify the difference between this work and [1] to highlight their novelty. 2. The quality of generated critiques should be individually measured via automatic metrics or human evaluation. [1] Training Language Model to Critique for Better Refinement. ACL 2025 Findings.
- The paper's core originality is its clear diagnosis of a key failure mode in training critics: baseline RL methods create a conflict between "discriminability" (judging correctness) and "helpfulness" (providing feedback), optimizing the latter at the expense of the former. - The paper's quality is high, with a rigorous methodology. The training dynamics in Figure 3 clearly show the baseline's failure , while decisive ablation studies in Table 3 prove that both stages of Critique-RL and its spe
- The paper's primary motivation is to train critics "without stronger supervision"1. However, the entire method, especially the critical Stage I, is heavily reliant on an "oracle reward function" $r_{oracle}(x,y)$ to compute the direct discrimination reward $r_{dis}$. For the main experiments on math tasks, this oracle is a rule-based verifier that knows the correct answer. This oracle is a form of strong, external supervision. - The framework's success, particularly in Stage II, hinges on a cr
- The proposed two-stage RL method is effective to provide constructive critique feedback for better refinement and precise filter for effective time-time scaling. - The experiments contain several benchmarks across different tasks. - This paper is well-written and easy to follow.
My main concern lies in the experimental design, as I am not fully convinced that the current experiments sufficiently demonstrate the proposed method’s advantage on complex reasoning tasks. - Since the authors explicitly state in the Abstract and Introduction that their method targets complex reasoning tasks, more challenging benchmarks such as AIME and GPQA should have been included in the evaluation. - Although the main text reports significant improvements on Qwen-3B and Qwen-7B, the appen
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
