Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Gengsheng Li; Tianyu Yang; Junfeng Fang; Mingyang Song; Mao Zheng; Haiyun Guo; Dan Zhang; Jinqiao Wang; Tat-Seng Chua

arXiv:2604.02288·cs.LG·April 3, 2026

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua

PDF

TL;DR

This paper introduces SRPO, a unified reinforcement learning framework that combines the strengths of GRPO and SDPO, improving performance and stability in large language model training.

Contribution

The paper proposes Sample-Routed Policy Optimization (SRPO), a novel on-policy method that routes samples to different optimization strategies, enhancing stability and performance.

Findings

01

SRPO surpasses both GRPO and SDPO on five benchmarks.

02

SRPO improves average performance by 3.4% over GRPO and 6.3% over SDPO.

03

SRPO reduces per-step compute cost by up to 17.2%.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.