Enhancing Sample Efficiency and Exploration in Reinforcement Learning through the Integration of Diffusion Models and Proximal Policy Optimization
Tianci Gao, Konstantin A. Neusypin, Dmitry D. Dmitriev, Bo Yang, Shengren Rao

TL;DR
This paper introduces PPO-DAP, a novel on-policy reinforcement learning framework that integrates diffusion models to enhance exploration and sample efficiency without altering the core PPO algorithm.
Contribution
It proposes a two-stage method combining offline diffusion pretraining with online adaptation, improving exploration and efficiency in continuous control tasks.
Findings
Consistently improves early learning efficiency across eight MuJoCo tasks.
Matches or exceeds top on-policy baselines in final performance on most tasks.
Maintains modest computational overhead compared to standard PPO.
Abstract
Proximal Policy Optimization (PPO) is widely used in continuous control due to its robustness and stable training, yet it remains sample-inefficient in tasks with expensive interactions and high-dimensional action spaces. This paper proposes PPO-DAP (PPO with Diffusion Action Prior), a strictly on-policy framework that improves exploration quality and learning efficiency without modifying the PPO objective. PPO-DAP follows a two-stage protocol. Offline, we pretrain a conditional diffusion action prior on logged trajectories to cover the action distribution supported by the behavior policy. Online, PPO updates the actor-critic only using newly collected on-policy rollouts, while the diffusion prior is adapted around the on-policy state distribution via parameter-efficient tuning (Adapter/LoRA) over a small parameter subset. For each on-policy state, the prior generates multiple action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElevator Systems and Control · Traffic control and management · Smart Parking Systems Research
MethodsEntropy Regularization · Diffusion · Proximal Policy Optimization
