Sampling from Energy-based Policies using Diffusion
Vineet Jain, Tara Akhound-Sadegh, Siamak Ravanbakhsh

TL;DR
This paper introduces a diffusion-based sampling method for energy-based policies in reinforcement learning, enabling more expressive, multimodal action distributions and improving sample efficiency in continuous control tasks.
Contribution
It proposes Diffusion Q-Sampling, a novel actor-critic approach that leverages diffusion models to sample from complex energy-based policies, surpassing Gaussian approximations.
Findings
Enhances sample efficiency in continuous control environments.
Captures multimodal action behaviors effectively.
Addresses limitations of Gaussian policy approximations.
Abstract
Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation -- limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The novel approach is able to learn multimodal actions which is valuable especial when multiple optimal trajectory exists. 2. By explicitly sampling from the Boltzmann distribution of the Q function, DQS is shown better abilities for balancing exploration and exploitation. 3. Through experiments on maze tasks and Deepmind control suites benchmarks, results have confirmed the advantages of DQS.
1. As pointed out by the authors, temperature of DQS needs to be manually tuned unlike SAC as it would be computationally very expensive to compute the likelihoods under diffusion model. 2. No ablation study. Maybe beneficial to have some ablation studies, for example, how sensitive DQS is to different temperature values, K (number of monte carlo samples and how is it relates to computation cost)? or isolate the contribution of techniques introduced, etc.
Proposing a novel Boltzmann policy iteration which is more efficiency and still bound to recover the optical policy
Lack of novelty:Simply integrating Diffusion into the traditional SAC which lacks innovation. Benchmark in a custom environment lacks persuasiveness and the test is not quantified to data.
- This article proposes sampling with a diffusion strategy obeying a Boltzmann distribution to balance exploration and exploitation, focusing on a very cutting-edge area; - This paper does a multimodal experiment to show that DQS has some multimodality, a point that may be of interest to the RL community; - The writing of the paper is easy to follow.
- The related work is not presented carefully enough, and some are only cited. In particular, the related work under Online diffusion is particularly scarce, and each needs the author to summarise their approach, and where the flaws lie. In addition, **diffusion & online RL** related work also need you to expand, I found a recent paper accepted in NeurIPS24 is also under this setting Diffusion Actor-Critic with Entropy Regulator (Wang et al.). - You mention that the Q-score method does not hav
Originality - The application of iDEM is (to this reviewer's knowledge) novel; although other methods seek to use diffusion model policies, they typically use other methods for fitting the diffusion model. The application of iDEM is novel. Quality - The empirical results given are strong. The first set of results demonstrates well that DQS can indeed learn a policy which has support on multiple different solution types for problems. The second set of results shows that DQS can learn well, and
046 - The authors give methods of policy representations in the continuous setting. I would suggest that they mention SQL, which allows for the training of expressive policies which come from neither noise injection nor parametric family. These are trained via Stein-variational gradient descent. 071 - The claim is made that "[Diffusion models] have been extensively applied to solve sequential decision-making tasks, especially in offline settings where they can model multimodal datasets from su
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnvironmental Impact and Sustainability · Energy, Environment, Economic Growth
MethodsDiffusion
