Reasoning with Latent Diffusion in Offline Reinforcement Learning
Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan,, Jeff Schneider, Glen Berseth

TL;DR
This paper introduces a novel offline reinforcement learning method using latent diffusion to model trajectory sequences as latent skills, improving performance on complex tasks by reducing extrapolation errors and handling multi-modal data.
Contribution
The work proposes leveraging latent diffusion to model in-support trajectories as latent skills, enabling better Q-learning and state representation in offline RL.
Findings
Achieves state-of-the-art results on D4RL benchmarks.
Excels in long-horizon, sparse-reward tasks.
Handles multi-modal data effectively.
Abstract
Offline reinforcement learning (RL) holds promise as a means to learn high-reward policies from a static dataset, without the need for further environment interactions. However, a key challenge in offline RL lies in effectively stitching portions of suboptimal trajectories from the static dataset while avoiding extrapolation errors arising due to a lack of support in the dataset. Existing approaches use conservative methods that are tricky to tune and struggle with multi-modal data (as we show) or rely on noisy Monte Carlo return-to-go samples for reward conditioning. In this work, we propose a novel approach that leverages the expressiveness of latent diffusion to model in-support trajectory sequences as compressed latent skills. This facilitates learning a Q-function while avoiding extrapolation error via batch-constraining. The latent space is also expressive and gracefully copes…
Peer Reviews
Decision·ICLR 2024 poster
Exploring how to leverage expressive models such as diffusion models and Transformers for policy learning is an important direction in RL, especially for the offline setting. To the best of my knowledge, the idea of combining high-level diffusion planning and low-level primitive learning is novel. The paper is well-organized and clearly written.
While well-motivated, I have some questions about this work: 1. My major concern is that the results on offline RL benchmarks may be insufficient to show the advantage of planning with diffusion models on latent action space. While I appreciate that the authors have covered the most popular state-of-the-art baselines in Table 1, I think it is necessary to compare LDCQ with some literature that similarly performs planning on the learned action representation space learned by VAE [1, 2, 3, 4] or
# Strengths The paper proposes a novel idea: Combining BCQ and latent diffusion models can lead to an offline RL algorithm that has the best of both worlds. Overall, this idea and the execution and analysis presented in this work is significant and of high quality. 1. The proposed method aims to fill a gap in existing methods such as Diffuser and Decision Diffuser by taking inspiration from latent diffusion models (LDMs): shifting diffusion into the latent space and separating the training pro
# Weaknesses Overall, the paper has some statements/claims that are stretched, the empirical evidence seems like a mixed bag with certain D4RL environments/tasks silently omitted, and some weaknesses inherited from choosing latent diffusion models. ## Stretched claims 1. The phrases “improves credit assignment”, “faster reward propagation” describing the proposed work should be avoided, or backed by empirical evidence. I don’t see how either of these quantities can be measured. I understand ho
This paper proposes an interesting idea that allows decoupling the diffusion model from the policy decoder; allowed the algorithm to be used for both continuous or discrete action environments. Experiment results are primarily for continuous action D4RL offline tasks, and shows relatively improved performance across few benchmarks. The execution of the algorithm is interesting, but I worry about the easy-ness of the approach. I agree with the authors that the reason that the algorithm is structu
1. Experiment results are difficult to follow. The D4RL results are the primary ones, but the appendix claims to have results for CARLA and Goal Conditioning tasks too? It seems the goal conditioned tasks are not the standard ones in GCRL literature, and it is not clear what the key takeaway is other than the constrained offline RL results + qualitative results evaluating the latents across the horizon. 2. The proposed algorithm is interesting, but might have difficulty with the execution and
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsDiffusion
