Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR
Zijun Min, Bingshuai Liu, Ante Wang, Long Zhang, Anxiang Zeng, Haibo Zhang, Jinsong Su

TL;DR
This paper introduces Dynamic Hybrid Policy Optimization (DHPO), a novel reinforcement learning method that combines token-level and sequence-level importance ratios to improve reasoning tasks in large language models.
Contribution
DHPO effectively integrates GRPO and GSPO using a weighted surrogate objective and branch-specific clipping, enhancing training stability and performance in reasoning benchmarks.
Findings
DHPO outperforms GRPO and GSPO on seven reasoning benchmarks.
DHPO improves training stability through branch-specific clipping.
Experiments demonstrate consistent performance gains across models.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks. However, existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations. Group Relative Policy Optimization (GRPO) updates the policy with token-level importance ratios, which preserves fine-grained credit assignment but often suffers from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies single sequence-level importance ratios across all tokens in a response that better matches sequence-level rewards, but sacrifices token-wise credit assignment. In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications
