DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization
Xuan Xie, Xuan Wang, Wenjie Wang, Shuai Chen, Wei Lin

TL;DR
DaGRPO improves reasoning in large language models by rectifying gradient conflicts through sequence-level masking and data augmentation, leading to more stable training and better performance on reasoning benchmarks.
Contribution
The paper introduces DaGRPO, a novel method that addresses gradient conflicts in policy optimization for reasoning tasks by incorporating sequence-level rectification and off-policy data augmentation.
Findings
Achieves +4.7% average accuracy on math benchmarks.
Significantly reduces training instability and gradient conflicts.
Enhances long-chain reasoning capabilities in LLMs.
Abstract
The evolution of Large Language Models (LLMs) has catalyzed a paradigm shift from superficial instruction following to rigorous long-horizon reasoning. While Group Relative Policy Optimization (GRPO) has emerged as a pivotal mechanism for eliciting such post-training reasoning capabilities due to its exceptional performance, it remains plagued by significant training instability and poor sample efficiency. We theoretically identify the root cause of these issues as the lack of distinctiveness within on-policy rollouts: for routine queries, highly homogeneous samples induce destructive gradient conflicts; whereas for hard queries, the scarcity of valid positive samples results in ineffective optimization. To bridge this gap, we propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO). DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks
