Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning
Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Chenjun Xiao, Yang Yu, Zongzhang Zhang

TL;DR
This paper introduces BDPO, a novel offline RL framework that applies behavior regularization to diffusion-based policies, enhancing policy robustness and expressiveness, validated through synthetic and benchmark tasks.
Contribution
It extends behavior-regularized RL to diffusion policies by deriving an analytical KL regularization, enabling effective policy optimization with advanced parameterizations.
Findings
BDPO outperforms existing methods on synthetic 2D tasks.
BDPO achieves superior results on D4RL continuous control benchmarks.
The framework effectively balances policy expressiveness and safety.
Abstract
Behavior regularization, which constrains the policy to stay close to some behavior policy, is widely used in offline reinforcement learning (RL) to manage the risk of hazardous exploitation of unseen actions. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsElevator Systems and Control · Reinforcement Learning in Robotics · Traffic control and management
MethodsDiffusion · Focus
