Exploratory Diffusion Model for Unsupervised Reinforcement Learning
Chengyang Ying, Huayu Chen, Xinning Zhou, Zhongkai Hao, Hang Su, Jun Zhu

TL;DR
This paper introduces ExDM, a diffusion model-based approach for unsupervised reinforcement learning that enhances exploration and enables rapid adaptation to downstream tasks by modeling explored data distribution.
Contribution
The paper proposes the Exploratory Diffusion Model (ExDM), leveraging diffusion models for better data representation and exploration in unsupervised RL, with theoretical analysis and practical algorithms.
Findings
ExDM outperforms state-of-the-art methods in exploration efficiency.
ExDM enables faster adaptation to downstream tasks.
ExDM effectively handles complex environments.
Abstract
Unsupervised reinforcement learning (URL) aims to pre-train agents by exploring diverse states or skills in reward-free environments, facilitating efficient adaptation to downstream tasks. As the agent cannot access extrinsic rewards during unsupervised exploration, existing methods design intrinsic rewards to model the explored data and encourage further exploration. However, the explored data are always heterogeneous, posing the requirements of powerful representation abilities for both intrinsic reward models and pre-trained policies. In this work, we propose the Exploratory Diffusion Model (ExDM), which leverages the strong expressive ability of diffusion models to fit the explored data, simultaneously boosting exploration and providing an efficient initialization for downstream tasks. Specifically, ExDM can accurately estimate the distribution of collected data in the replay buffer…
Peer Reviews
Decision·ICLR 2026 Oral
- The author addressed that this is the first work to successfully integrate diffusion models into the unsupervised exploration phase of RL. The concept of using the diffusion model's density estimation loss as the intrinsic reward is a significant contribution over prior reward mechanisms (like RND or ICM). - It was impressed that the decoupled training scheme (fast Gaussian actor, slow diffusion reward-calculator) is a clever and practical solution to the primary obstacle of using generative m
- There is a limitation in terms of performance gap: The paper's own experiments (Fig. 3) show that fine-tuning the simple Gaussian policy ($\pi_g$) actually achieves better final performance than the proposed new, complex diffusion policy fine-tuning algorithm (Algorithm 2). The reason should be explained and analyzed intensively. Compared with Fig. 3(a) and (b), the expert normalized scores of the proposed algorithm in Fig. 3(c) were small. - The authors stated that the performance degradatio
- Empirical gains across multiple settings: The figure indicates consistent improvements over strong unsupervised exploration baselines in URL, in cross-embodiment transfer, and when initializing diffusion policies. - Potentially general mechanism: A diffusion-based exploratory prior could be a broadly applicable way to induce diverse skills or state coverage that helps downstream RL fine-tuning and transfer. - Sufficient theoretical proof.
None
1. The performance seems to be very strong compared to baselines 2. The presentation is clear and easy to follow
1. The motivation is somewhat weak. I'll put my questions in the following section.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsDiffusion
