Teaching Language Models to Critique via Reinforcement Learning

Zhihui Xie; Jie Chen; Liyu Chen; Weichao Mao; Jingjing Xu; Lingpeng Kong

arXiv:2502.03492·cs.LG·December 2, 2025

Teaching Language Models to Critique via Reinforcement Learning

Zhihui Xie, Jie Chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong

PDF

Open Access 1 Models

TL;DR

This paper introduces CTRL, a reinforcement learning framework for training critic models that improve code generation by providing effective feedback, leading to higher success rates and better iterative refinement without human supervision.

Contribution

The paper presents a novel RL-based critic training method that enhances code generation models' performance and enables scalable iterative critique-revision without human-labeled data.

Findings

01

Critics trained with CTRL improve pass rates on code benchmarks.

02

CTRL critics act as accurate reward models for code generation.

03

Iterative critique-revision with CTRL yields up to 106.1% improvements.

Abstract

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $CTRL$ , a framework for $C$ ritic $T$ raining via $R$ einforcement $L$ earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $CTRL$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Zhihui/CTRL-32B
model· 16 dl· ♡ 5
16 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsBalanced Selection