Improving DAPO from a Mixed-Policy Perspective

Hongze Tan; Yuchen Li

arXiv:2507.12931·cs.LG·August 20, 2025

Improving DAPO from a Mixed-Policy Perspective

Hongze Tan, Yuchen Li

PDF

Open Access

TL;DR

This paper enhances the DAPO algorithm by integrating a pre-trained guiding policy and re-utilizing zero-reward samples, resulting in improved stability and sample efficiency in policy optimization.

Contribution

It introduces two novel modifications to DAPO from a mixed-policy perspective, improving stability and efficiency through off-policy guidance and sample reuse.

Findings

01

Improved training stability and convergence speed.

02

Enhanced sample efficiency by reusing zero-reward samples.

03

Theoretical convergence guarantees within reinforcement learning framework.

Abstract

This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm [1], approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample inefficiency, particularly in sparse reward settings. To address this, we first propose a method that incorporates a pre-trained, stable guiding policy ( $\piphi$ ) to provide off-policy experience, thereby regularizing the training of the target policy ( $\pion$ ). This approach improves training stability and convergence speed by adaptively adjusting the learning step size. Secondly, we extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies like DAPO's. By treating these samples as a distinct batch guided by the expert policy, we further enhance sample efficiency. We provide a theoretical analysis for both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification