Score Regularized Policy Optimization through Diffusion Behavior

Huayu Chen; Cheng Lu; Zhengyi Wang; Hang Su; Jun Zhu

arXiv:2310.07297·cs.LG·March 18, 2024·2 cites

Score Regularized Policy Optimization through Diffusion Behavior

Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a method that combines diffusion models with policy optimization in offline reinforcement learning, significantly speeding up action sampling without sacrificing performance.

Contribution

It proposes a novel approach to extract deterministic policies from diffusion models, avoiding slow sampling while leveraging diffusion's generative power.

Findings

01

Speeds up action sampling by over 25 times in locomotion tasks

02

Maintains state-of-the-art performance in D4RL benchmarks

03

Effectively regularizes policy gradients using diffusion behavior models

Abstract

Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

In the offline reinforcement learning setting, the paper introduces a novel approach that leverages the diffusion model. This method uses the powerful modeling capabilities of the diffusion model while avoiding the extensive time-consuming iterative inference stage.

Weaknesses

The final policy used by the algorithm is still based on a Gaussian distribution. This Gaussian policy might not capture complex distributions as effectively as the diffusion model when dealing with complex offline datasets. The key concern here is whether the complex distribution information modeled by the pretrained diffusion behavior can be adequately captured by a policy based on a Gaussian distribution.

Reviewer 02Rating 3· reject, not good enoughConfidence 3

Strengths

Algorithmically, the paper provides an interesting insight showing that in behavior regularized policy optimization objective, the gradient of the diverse term is indeed related to the score function of the behavior policy distribution. This therefore allows the use of pre-trained diffusion models to be used in these objectives. The challenge of measuring the divergence term in offline regularized objective is generally difficult, where typically a separate model is needed to approximate the be

Weaknesses

The paper is a bit hard to follow; while the claims are justified, the paper is not so well written and seems convoluted. I believe this is also because the key idea/trick of the paper is to use pre-trained diffusion models in existing offline rl objectives, so the paper tries to lay out the context for that. However, it makes the paper rather difficult to follow, to completely understand the full contribution of the work. Since the key idea is to use existing pre-trained diffusion models, I

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

1. The paper clearly states its motivation and presents a clear illustration to demonstrate the derivation of the proposed method, SRPO. 2. The paper provides reproducible details for its experiments, and make a relatively comprehensive comparison with both conventional behavior regularization methods and recent diffusion-based policies in offline RL, in terms of task performance and computational efficiency.

Weaknesses

1. The paper mainly aims to improve the computational efficiency of diffusion-based polices, which is highlighted in computation-sensitive contexts such as robotics as stated in the paper, yet there is no experiment concerning the robot scenarios especially with real data. If such experimental results are provided, the central claim made by the paper can be more convincing. 2. The novelty of the proposed method is limited as it seems to be a combination of previous work, and especially an increm

Code & Models

Repositories

thu-ml/srpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion