Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models
Khiem Le, Phuc Nguyen, Youssef Mroueh, Chi-Heng Lin, Shangqian Gao, Ting Hua, Nitesh V. Chawla

TL;DR
TA-GRPO enhances exploration in large language models by using question rephrasing to generate diverse responses, addressing gradient vanishing and diversity collapse issues in reinforcement learning.
Contribution
It introduces a simple method that automatically generates question rephrasings to improve exploration and diversity in reinforcement learning for large language models.
Findings
TA-GRPO improves pass@$k$ on multiple benchmarks.
It increases average pass@32 by 4.97 and 4.34 points for two models.
It matches exploration quality of larger-data baselines.
Abstract
Group Relative Policy Optimization (GRPO) has become the dominant method for reinforcement learning with verifiable rewards in large language models, but it suffers from two critical limitations: gradient vanishing and diversity collapse. When training questions are too easy or too hard, all sampled responses receive identical rewards, yielding zero gradients. Meanwhile, the model tends to collapse its responses toward a single reasoning pattern rather than exploring diverse strategies. We propose Transformation-Augmented GRPO (TA-GRPO), a simple but effective method that addresses both issues via question rephrasing. For each training question, we automatically generate multiple problem-equivalent rephrasings that alter wording, format, and information order while preserving the underlying meaning. Because these rephrasings shift the model's perceived difficulty, pooling responses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
