Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR

Zijun Min; Bingshuai Liu; Ante Wang; Long Zhang; Anxiang Zeng; Haibo Zhang; Jinsong Su

arXiv:2601.05607·cs.LG·January 12, 2026

Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR

Zijun Min, Bingshuai Liu, Ante Wang, Long Zhang, Anxiang Zeng, Haibo Zhang, Jinsong Su

PDF

Open Access

TL;DR

This paper introduces Dynamic Hybrid Policy Optimization (DHPO), a novel reinforcement learning method that combines token-level and sequence-level importance ratios to improve reasoning tasks in large language models.

Contribution

DHPO effectively integrates GRPO and GSPO using a weighted surrogate objective and branch-specific clipping, enhancing training stability and performance in reasoning benchmarks.

Findings

01

DHPO outperforms GRPO and GSPO on seven reasoning benchmarks.

02

DHPO improves training stability through branch-specific clipping.

03

Experiments demonstrate consistent performance gains across models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks. However, existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations. Group Relative Policy Optimization (GRPO) updates the policy with token-level importance ratios, which preserves fine-grained credit assignment but often suffers from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies single sequence-level importance ratios across all tokens in a response that better matches sequence-level rewards, but sacrifices token-wise credit assignment. In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications