DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Tongzhou Mu; Minghua Liu; Hao Su

arXiv:2404.16779·cs.LG·April 26, 2024

DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Tongzhou Mu, Minghua Liu, Hao Su

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

DrS introduces a data-driven method to learn reusable dense rewards for multi-stage tasks, reducing human effort and improving RL performance across diverse unseen tasks.

Contribution

We propose DrS, a novel approach that learns dense rewards from sparse rewards and demonstrations, enabling reward reuse in unseen multi-stage tasks.

Findings

01

Reused learned rewards improve RL performance and sample efficiency.

02

Rewards achieve comparable results to human-engineered rewards.

03

Method generalizes well across 1000+ task variants.

Abstract

The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demand substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be \textit{reused} in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

* As far as I am aware, the overall design of the algorithm (exploiting stage indicators) as well as the form of the learned reward function are novel. * The method significantly reduces the amount of engineering required to learn reward functions when the task can be broken down into identifiable stages. * DrS handily outperforms several reasonable baselines and competes with the hand-engineered reward in some cases. * In addition to the ablation study in the body of the paper, the appendix inc

Weaknesses

* The paper states that demonstrations are optional, but it sounds like they were used in all experiments. I imagine that sample efficiency would deteriorate substantially if no demonstrations are provided and the first stage cannot be solved easily by random exploration, or more generally if any stage cannot be easily solved by a noisy version of a policy that solves the previous stage. * It is not clear that all sparse-reward tasks can be broken up into stages, and as shown in the ablation stu

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

- This paper makes a contribution towards automating reward design, which is of paramount importance in the field of RL. Having access to dense rewards takes the burden off of exploration, which in turn reduces the number of samples required to solve a task. - The method is a niche application of contrastive discriminator learning, which is well-established in the literature.

Weaknesses

- The method requires success and failure trajectories for each stage in the training data, which can be expensive to collect. - The scope of the method is limited to a family of tasks that can be divided into stages. This prevents it from being applied to other tasks such as locomotion. It also means the method is less general compared to LLM-based rewards with external knowledge [1, 2]. - Similarly, the need for stage indicators prevents the method from scaling to real-world problems, which w

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

- The goal of this work is deriving a dense reward function from an array of training tasks to be repurposed for new, unseen tasks. - The notion of capturing representations for a `task family` is important for enabling RL agents to learn multi-purpose policies. - Operating on the understanding that tasks can be broken down into several segments is also logical. - I like this paper because I think it's important to move away from engineered dense rewards, to more tangible methodologies for lear

Weaknesses

- Some ablation study of the robustness of this method against bad demonstrations (i.e. suboptimal, noisy etc) could be nice.

Code & Models

Repositories

tongzhoumu/DrS
pytorch

Videos

DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks· slideslive

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning