Improving DAPO from a Mixed-Policy Perspective
Hongze Tan, Yuchen Li

TL;DR
This paper enhances the DAPO algorithm by integrating a pre-trained guiding policy and re-utilizing zero-reward samples, resulting in improved stability and sample efficiency in policy optimization.
Contribution
It introduces two novel modifications to DAPO from a mixed-policy perspective, improving stability and efficiency through off-policy guidance and sample reuse.
Findings
Improved training stability and convergence speed.
Enhanced sample efficiency by reusing zero-reward samples.
Theoretical convergence guarantees within reinforcement learning framework.
Abstract
This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm [1], approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample inefficiency, particularly in sparse reward settings. To address this, we first propose a method that incorporates a pre-trained, stable guiding policy () to provide off-policy experience, thereby regularizing the training of the target policy (). This approach improves training stability and convergence speed by adaptively adjusting the learning step size. Secondly, we extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies like DAPO's. By treating these samples as a distinct batch guided by the expert policy, we further enhance sample efficiency. We provide a theoretical analysis for both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
