TL;DR
This paper introduces a method to improve online reinforcement learning by leveraging expert demonstrations and arbitrary reset capabilities, using auxiliary start state distributions to significantly enhance sample efficiency and achieve state-of-the-art results.
Contribution
It proposes a novel approach that uses auxiliary start state distributions informed by safety and episode length to accelerate online RL learning.
Findings
Significant improvement in sample efficiency with auxiliary start state distributions.
State-of-the-art performance on a sparse-reward hard-exploration environment.
Using safety-informed start states accelerates learning process.
Abstract
A long-standing problem in online reinforcement learning (RL) is of ensuring sample efficiency, which stems from an inability to explore environments efficiently. Most attempts at efficient exploration tackle this problem in a setting where learning begins from scratch, without prior information available to bootstrap learning. However, such approaches fail to leverage expert demonstrations and simulators that can reset to arbitrary states. These affordances are valuable resources that offer enormous potential to guide exploration and speed up learning. In this paper, we explore how a small number of expert demonstrations and a simulator allowing arbitrary resets can accelerate learning during online RL. We find that training with a suitable choice of an auxiliary start state distribution that may differ from the true start state distribution of the underlying Markov Decision Process…
Peer Reviews
Decision·Submitted to ICLR 2025
The introduction of the safety state distribution is quite interesting and the experiments in 5.4 and 5.5 provide useful insight into the benefits of this metric. Because of this, the algorithm is motivated in principle and is quite easy to implement.
Unfortunately the baselines in the main experiments are not designed for and do not have access to resetting to arbitrary states and are not directly comparable. JSRL seems to learn a policy that explores, so it will inevitably spend more samples getting to critical states There are other state of the art baselines that also could have been included like simple behavior cloning or [1]. The MuJoCo experiments could have been augmented to be sparse, like ant maze. Showing that the algorithm match
- The problem discussed in this paper is novel. Low sample efficiency is a long-existing challenge in online RL, and this paper provides a novel perspective to further address this problem. - The experiments cover various settings, including both sparse- and dense-reward tasks and different common baselines, showing the effectiveness of the auxiliary start distribution.
See questions.
Overall, the paper is well-written and easy to follow. The method is presented clearly with sufficient notations. The idea is interesting and straightforward. The algorithm is compatible with many RL algorithms and can be applied to many scenarios.
The below paper seems to be highly relevant, but the authors didn't discuss and compare with it: - Contrastive Initial State Buffer for Reinforcement Learning (https://arxiv.org/abs/2309.09752v3), which comes with open-sourced code (https://github.com/uzh-rpg/cl_initial_buffer). Besides, the current experiments only covered a 2D discrete env (lava bridge) and 3 Mujoco task. How would the algorithm perform on high-dimensional tasks with sparse rewards? (the hard tasks in MetaWorld for example)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
