GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models
Zhijie Wang

TL;DR
This paper introduces a four-stage framework combining Group Relative Policy Optimization (GRPO) with reflection rewards to enhance mathematical reasoning in large language models, achieving state-of-the-art results.
Contribution
It proposes a novel training framework that proactively encourages reflection in LLMs, improving their reasoning capabilities beyond existing methods.
Findings
GRPO achieves state-of-the-art performance in mathematical reasoning tasks.
Reflection rewards significantly improve reasoning accuracy.
Full-parameter supervised fine-tuning outperforms low-rank adaptation.
Abstract
The enhancement of reasoning capabilities in large language models (LLMs) has garnered significant attention, with supervised fine-tuning (SFT) and reinforcement learning emerging as dominant paradigms. While recent studies recognize the importance of reflection in reasoning processes, existing methodologies seldom address proactive reflection encouragement during training. This study focuses on mathematical reasoning by proposing a four-stage framework integrating Group Relative Policy Optimization (GRPO) with reflection reward mechanisms to strengthen LLMs' self-reflective capabilities. Besides, this approach incorporates established accuracy and format reward. Experimental results demonstrate GRPO's state-of-the-art performance through reflection-encouraged training, with ablation studies confirming the reflection reward's pivotal role. Comparative evaluations demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning
