Sampling from Energy-based Policies using Diffusion

Vineet Jain; Tara Akhound-Sadegh; Siamak Ravanbakhsh

arXiv:2410.01312·cs.LG·September 9, 2025

Sampling from Energy-based Policies using Diffusion

Vineet Jain, Tara Akhound-Sadegh, Siamak Ravanbakhsh

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a diffusion-based sampling method for energy-based policies in reinforcement learning, enabling more expressive, multimodal action distributions and improving sample efficiency in continuous control tasks.

Contribution

It proposes Diffusion Q-Sampling, a novel actor-critic approach that leverages diffusion models to sample from complex energy-based policies, surpassing Gaussian approximations.

Findings

01

Enhances sample efficiency in continuous control environments.

02

Captures multimodal action behaviors effectively.

03

Addresses limitations of Gaussian policy approximations.

Abstract

Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation -- limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

1. The novel approach is able to learn multimodal actions which is valuable especial when multiple optimal trajectory exists. 2. By explicitly sampling from the Boltzmann distribution of the Q function, DQS is shown better abilities for balancing exploration and exploitation. 3. Through experiments on maze tasks and Deepmind control suites benchmarks, results have confirmed the advantages of DQS.

Weaknesses

1. As pointed out by the authors, temperature of DQS needs to be manually tuned unlike SAC as it would be computationally very expensive to compute the likelihoods under diffusion model. 2. No ablation study. Maybe beneficial to have some ablation studies, for example, how sensitive DQS is to different temperature values, K (number of monte carlo samples and how is it relates to computation cost)? or isolate the contribution of techniques introduced, etc.

Reviewer 02Rating 3Confidence 3

Strengths

Proposing a novel Boltzmann policy iteration which is more efficiency and still bound to recover the optical policy

Weaknesses

Lack of novelty：Simply integrating Diffusion into the traditional SAC which lacks innovation. Benchmark in a custom environment lacks persuasiveness and the test is not quantified to data.

Reviewer 03Rating 3Confidence 4

Strengths

- This article proposes sampling with a diffusion strategy obeying a Boltzmann distribution to balance exploration and exploitation, focusing on a very cutting-edge area; - This paper does a multimodal experiment to show that DQS has some multimodality, a point that may be of interest to the RL community; - The writing of the paper is easy to follow.

Weaknesses

- The related work is not presented carefully enough, and some are only cited. In particular, the related work under Online diffusion is particularly scarce, and each needs the author to summarise their approach, and where the flaws lie. In addition, **diffusion & online RL** related work also need you to expand, I found a recent paper accepted in NeurIPS24 is also under this setting Diffusion Actor-Critic with Entropy Regulator (Wang et al.). - You mention that the Q-score method does not hav

Reviewer 04Rating 3Confidence 4

Strengths

Originality - The application of iDEM is (to this reviewer's knowledge) novel; although other methods seek to use diffusion model policies, they typically use other methods for fitting the diffusion model. The application of iDEM is novel. Quality - The empirical results given are strong. The first set of results demonstrates well that DQS can indeed learn a policy which has support on multiple different solution types for problems. The second set of results shows that DQS can learn well, and

Weaknesses

046 - The authors give methods of policy representations in the continuous setting. I would suggest that they mention SQL, which allows for the training of expressive policies which come from neither noise injection nor parametric family. These are trained via Stein-variational gradient descent. 071 - The claim is made that "[Diffusion models] have been extensively applied to solve sequential decision-making tasks, especially in offline settings where they can model multimodal datasets from su

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEnvironmental Impact and Sustainability · Energy, Environment, Economic Growth

MethodsDiffusion