Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

TL;DR
Critique-GRPO is a novel reinforcement learning framework that combines natural language and numerical feedback to significantly improve large language models' reasoning abilities and self-refinement capabilities.
Contribution
It introduces Critique-GRPO, an online RL method integrating natural language critiques with numerical rewards, enhancing LLM reasoning and self-improvement over existing methods.
Findings
Achieves 15-21.6% Pass@1 improvements on Qwen models.
Attains 7.3% Pass@1 improvement on Llama-3.2-3B-Instruct.
Enables effective self-critiquing and substantial performance gains.
Abstract
Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications
