Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models
Fei Ding, Baiqiao Wang, Zijian Zeng, Youwei Wang

TL;DR
This paper introduces MGRPO, a multi-layer approach to improve reasoning and self-correction in large language models, leading to better performance on mathematical reasoning tasks.
Contribution
MGRPO adds a second layer for error correction, providing implicit supervision and enhancing reasoning and self-correction in LLMs.
Findings
MGRPO outperforms standard GRPO on mathematical benchmarks.
The two-layer structure improves reasoning accuracy.
Self-correction significantly boosts training stability.
Abstract
The Group Relative Policy Optimization (GRPO) algorithm has demonstrated considerable success in enhancing the reasoning capabilities of large language models (LLMs), as evidenced by DeepSeek-R1. However, the absence of intermediate supervision in GRPO frequently leads to inefficient exploration dynamics. A single error in a complex reasoning chain can invalidate the entire solution, resulting in abrupt reward vanishing and compromising training stability.To address these challenges, we propose MGRPO (Multi-layer GRPO). MGRPO operates in two layers: the first layer employs standard GRPO to generate an initial response. This response, along with the original query, is then fed into a second-layer GRPO process. This second layer is specifically trained to identify and correct errors in the initial response, effectively creating a self-correction loop. This mechanism provides implicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
