Enhanced DACER Algorithm with High Diffusion Efficiency

Yinuo Wang; Likun Wang; Mining Tan; Wenjun Zou; Xujie Song; Wenxuan Wang; Tong Liu; Guojian Zhan; Tianze Zhu; Shiqi Liu; Zeyu He; Feihong Zhang; Jingliang Duan; Shengbo Eben Li

arXiv:2505.23426·cs.LG·October 3, 2025

Enhanced DACER Algorithm with High Diffusion Efficiency

Yinuo Wang, Likun Wang, Mining Tan, Wenjun Zou, Xujie Song, Wenxuan Wang, Tong Liu, Guojian Zhan, Tianze Zhu, Shiqi Liu, Zeyu He, Feihong Zhang, Jingliang Duan, Shengbo Eben Li

PDF

Open Access 3 Reviews

TL;DR

DACERv2 enhances diffusion-based online reinforcement learning by introducing a Q-gradient guided denoising process and temporal weighting, significantly improving efficiency and performance with fewer diffusion steps.

Contribution

This paper introduces DACERv2, a novel method that improves diffusion policy efficiency in online RL through auxiliary Q-gradient guidance and temporal weighting mechanisms.

Findings

01

DACERv2 outperforms classical and diffusion-based algorithms on OpenAI Gym benchmarks.

02

It achieves higher performance with only five diffusion steps.

03

Demonstrates greater multimodality in control environments.

Abstract

Due to their expressive capacity, diffusion models have shown great promise in offline RL and imitation learning. Diffusion Actor-Critic with Entropy Regulator (DACER) extended this capability to online RL by using the reverse diffusion process as a policy approximator, achieving state-of-the-art performance. However, it still suffers from a core trade-off: more diffusion steps ensure high performance but reduce efficiency, while fewer steps degrade performance. This remains a major bottleneck for deploying diffusion policies in real-time online RL. To mitigate this, we propose DACERv2, which leverages a Q-gradient field objective with respect to action as an auxiliary optimization target to guide the denoising process at each diffusion step, thereby introducing intermediate supervisory signals that enhance the efficiency of single-step diffusion. Additionally, we observe that the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper addresses a highly important and timely research question, especially as diffusion models are becoming increasingly dominant in the fields of **imitation learning**, **reinforcement learning**, and **Vision-Language-Action (VLA)** modeling. 2. The paper is **well-written** and **easy to follow**, presenting its ideas and methodologies clearly. 3. The **DACER v2** algorithm demonstrates **strong performance** compared to other **online diffusion RL** methods.

Weaknesses

### Major Weaknesses: 1. The authors claim that the **DACER v2** algorithm focuses on improving the diffusion efficiency of the original **DACER**. Accordingly, one would expect **DACER v2** to achieve comparable performance with fewer diffusion denoising steps compared to **DACER** using the full number of steps. However, the experimental results show that **DACER v2** not only maintains efficiency but also exhibits **stronger multi-modality** and **better sample efficiency** with fewer denois

Reviewer 02Rating 4Confidence 4

Strengths

1. The proposed algorithm achieves strong performance on state-based OpenAI Gym environments, with higher training and inference efficiency than most baselines. 2. The paper is well-written.

Weaknesses

1. The score function in a standard diffusion SDE is the score function of the perturbed distribution $\int q_{t|0}(a_t|a_0) \frac{e^{\frac{1}{\alpha}Q(s, a_0)}}{Z(s)}da_0$ and is not in the form of Equation (9). Moreover, the non-annealed Langevin dynamics used in this paper may suffer from slow mixing, as shown in [1]. 2. The method proposed in this paper is a straightforward combination of the QSM [2] policy training loss (with a newly introduced weighting function) and the DACER policy train

Reviewer 03Rating 2Confidence 5

Strengths

The overall idea is clearly presented and straightforward to implement. The proposed method is efficient both in terms of training and inference, which makes it preferable for deployment in embodied scenarios.

Weaknesses

The idea of aligning the score networks with the gradient of Q-value functions has been extensively investigated in QSM [1], DAC [2], QGPO [3], iDEM [4], and [5]. One contribution of DACER-v2 seems to be the time-based weighting. However, this is purely heuristic and theoretically unjustified. On the other hand, QGPO, iDEM, and [5] also estimate the time-dependent score, and their estimations are exact in theory. Therefore, the novelty and insight of this paper are limited. Besides, this paper

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVehicle emissions and performance · Advanced Numerical Analysis Techniques · Tribology and Lubrication Engineering

MethodsDiffusion