DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization

Xuan Xie; Xuan Wang; Wenjie Wang; Shuai Chen; Wei Lin

arXiv:2512.06337·cs.AI·January 1, 2026

DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization

Xuan Xie, Xuan Wang, Wenjie Wang, Shuai Chen, Wei Lin

PDF

Open Access 1 Video

TL;DR

DaGRPO improves reasoning in large language models by rectifying gradient conflicts through sequence-level masking and data augmentation, leading to more stable training and better performance on reasoning benchmarks.

Contribution

The paper introduces DaGRPO, a novel method that addresses gradient conflicts in policy optimization for reasoning tasks by incorporating sequence-level rectification and off-policy data augmentation.

Findings

01

Achieves +4.7% average accuracy on math benchmarks.

02

Significantly reduces training instability and gradient conflicts.

03

Enhances long-chain reasoning capabilities in LLMs.

Abstract

The evolution of Large Language Models (LLMs) has catalyzed a paradigm shift from superficial instruction following to rigorous long-horizon reasoning. While Group Relative Policy Optimization (GRPO) has emerged as a pivotal mechanism for eliciting such post-training reasoning capabilities due to its exceptional performance, it remains plagued by significant training instability and poor sample efficiency. We theoretically identify the root cause of these issues as the lack of distinctiveness within on-policy rollouts: for routine queries, highly homogeneous samples induce destructive gradient conflicts; whereas for hard queries, the scarcity of valid positive samples results in ineffective optimization. To bridge this gap, we propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO). DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks