GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

Zhijie Wang

arXiv:2603.14041·cs.AI·March 17, 2026

GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

Zhijie Wang

PDF

Open Access

TL;DR

This paper introduces a four-stage framework combining Group Relative Policy Optimization (GRPO) with reflection rewards to enhance mathematical reasoning in large language models, achieving state-of-the-art results.

Contribution

It proposes a novel training framework that proactively encourages reflection in LLMs, improving their reasoning capabilities beyond existing methods.

Findings

01

GRPO achieves state-of-the-art performance in mathematical reasoning tasks.

02

Reflection rewards significantly improve reasoning accuracy.

03

Full-parameter supervised fine-tuning outperforms low-rank adaptation.

Abstract

The enhancement of reasoning capabilities in large language models (LLMs) has garnered significant attention, with supervised fine-tuning (SFT) and reinforcement learning emerging as dominant paradigms. While recent studies recognize the importance of reflection in reasoning processes, existing methodologies seldom address proactive reflection encouragement during training. This study focuses on mathematical reasoning by proposing a four-stage framework integrating Group Relative Policy Optimization (GRPO) with reflection reward mechanisms to strengthen LLMs' self-reflective capabilities. Besides, this approach incorporates established accuracy and format reward. Experimental results demonstrate GRPO's state-of-the-art performance through reflection-encouraged training, with ablation studies confirming the reflection reward's pivotal role. Comparative evaluations demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning