Enhancing Sample Efficiency and Exploration in Reinforcement Learning through the Integration of Diffusion Models and Proximal Policy Optimization

Tianci Gao; Konstantin A. Neusypin; Dmitry D. Dmitriev; Bo Yang; Shengren Rao

arXiv:2409.01427·cs.LG·December 16, 2025·2 cites

Enhancing Sample Efficiency and Exploration in Reinforcement Learning through the Integration of Diffusion Models and Proximal Policy Optimization

Tianci Gao, Konstantin A. Neusypin, Dmitry D. Dmitriev, Bo Yang, Shengren Rao

PDF

Open Access 1 Repo

TL;DR

This paper introduces PPO-DAP, a novel on-policy reinforcement learning framework that integrates diffusion models to enhance exploration and sample efficiency without altering the core PPO algorithm.

Contribution

It proposes a two-stage method combining offline diffusion pretraining with online adaptation, improving exploration and efficiency in continuous control tasks.

Findings

01

Consistently improves early learning efficiency across eight MuJoCo tasks.

02

Matches or exceeds top on-policy baselines in final performance on most tasks.

03

Maintains modest computational overhead compared to standard PPO.

Abstract

Proximal Policy Optimization (PPO) is widely used in continuous control due to its robustness and stable training, yet it remains sample-inefficient in tasks with expensive interactions and high-dimensional action spaces. This paper proposes PPO-DAP (PPO with Diffusion Action Prior), a strictly on-policy framework that improves exploration quality and learning efficiency without modifying the PPO objective. PPO-DAP follows a two-stage protocol. Offline, we pretrain a conditional diffusion action prior on logged trajectories to cover the action distribution supported by the behavior policy. Online, PPO updates the actor-critic only using newly collected on-policy rollouts, while the diffusion prior is adapted around the on-policy state distribution via parameter-efficient tuning (Adapter/LoRA) over a small parameter subset. For each on-policy state, the prior generates multiple action…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tiancigao/diffppo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsElevator Systems and Control · Traffic control and management · Smart Parking Systems Research

MethodsEntropy Regularization · Diffusion · Proximal Policy Optimization