iGRPO: Self-Feedback-Driven LLM Reasoning
Ali Hatamizadeh, Shrimai Prabhumoye, Igor Gitman, Ximing Lu, Seungju Han, Wei Ping, Yejin Choi, Jan Kautz

TL;DR
The paper introduces iGRPO, a self-feedback-driven reinforcement learning method that enhances large language models' mathematical reasoning by iterative draft refinement, achieving state-of-the-art accuracy on reasoning benchmarks.
Contribution
iGRPO extends GRPO with dynamic self-conditioning, enabling models to iteratively improve reasoning drafts and outperform previous methods on diverse mathematical benchmarks.
Findings
iGRPO outperforms GRPO across multiple models and benchmarks.
Achieves new state-of-the-art results on AIME24 and AIME25.
Refinement wrapper benefits from a generative judge and delays entropy collapse.
Abstract
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Multimodal Machine Learning Applications · Topic Modeling
