Score Regularized Policy Optimization through Diffusion Behavior
Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu

TL;DR
This paper introduces a method that combines diffusion models with policy optimization in offline reinforcement learning, significantly speeding up action sampling without sacrificing performance.
Contribution
It proposes a novel approach to extract deterministic policies from diffusion models, avoiding slow sampling while leveraging diffusion's generative power.
Findings
Speeds up action sampling by over 25 times in locomotion tasks
Maintains state-of-the-art performance in D4RL benchmarks
Effectively regularizes policy gradients using diffusion behavior models
Abstract
Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action…
Peer Reviews
Decision·ICLR 2024 poster
In the offline reinforcement learning setting, the paper introduces a novel approach that leverages the diffusion model. This method uses the powerful modeling capabilities of the diffusion model while avoiding the extensive time-consuming iterative inference stage.
The final policy used by the algorithm is still based on a Gaussian distribution. This Gaussian policy might not capture complex distributions as effectively as the diffusion model when dealing with complex offline datasets. The key concern here is whether the complex distribution information modeled by the pretrained diffusion behavior can be adequately captured by a policy based on a Gaussian distribution.
Algorithmically, the paper provides an interesting insight showing that in behavior regularized policy optimization objective, the gradient of the diverse term is indeed related to the score function of the behavior policy distribution. This therefore allows the use of pre-trained diffusion models to be used in these objectives. The challenge of measuring the divergence term in offline regularized objective is generally difficult, where typically a separate model is needed to approximate the be
The paper is a bit hard to follow; while the claims are justified, the paper is not so well written and seems convoluted. I believe this is also because the key idea/trick of the paper is to use pre-trained diffusion models in existing offline rl objectives, so the paper tries to lay out the context for that. However, it makes the paper rather difficult to follow, to completely understand the full contribution of the work. Since the key idea is to use existing pre-trained diffusion models, I
1. The paper clearly states its motivation and presents a clear illustration to demonstrate the derivation of the proposed method, SRPO. 2. The paper provides reproducible details for its experiments, and make a relatively comprehensive comparison with both conventional behavior regularization methods and recent diffusion-based policies in offline RL, in terms of task performance and computational efficiency.
1. The paper mainly aims to improve the computational efficiency of diffusion-based polices, which is highlighted in computation-sensitive contexts such as robotics as stated in the paper, yet there is no experiment concerning the robot scenarios especially with real data. If such experimental results are provided, the central claim made by the paper can be more convincing. 2. The novelty of the proposed method is limited as it seems to be a combination of previous work, and especially an increm
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
