Teaching Language Models to Critique via Reinforcement Learning
Zhihui Xie, Jie Chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong

TL;DR
This paper introduces CTRL, a reinforcement learning framework for training critic models that improve code generation by providing effective feedback, leading to higher success rates and better iterative refinement without human supervision.
Contribution
The paper presents a novel RL-based critic training method that enhances code generation models' performance and enables scalable iterative critique-revision without human-labeled data.
Findings
Critics trained with CTRL improve pass rates on code benchmarks.
CTRL critics act as accurate reward models for code generation.
Iterative critique-revision with CTRL yields up to 106.1% improvements.
Abstract
Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose , a framework for ritic raining via einforcement earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection
