Automated Rewards via LLM-Generated Progress Functions

Vishnu Sarukkai; Brennan Shacklett; Zander Majercik; Kush Bhatia,; Christopher R\'e; Kayvon Fatahalian

arXiv:2410.09187·cs.LG·October 28, 2024

Automated Rewards via LLM-Generated Progress Functions

Vishnu Sarukkai, Brennan Shacklett, Zander Majercik, Kush Bhatia,, Christopher R\'e, Kayvon Fatahalian

PDF

Open Access 3 Reviews

TL;DR

This paper presents an LLM-driven framework for automating reward function generation by estimating task progress, significantly reducing the number of reward samples needed to achieve state-of-the-art policies on complex benchmarks.

Contribution

The authors introduce a novel two-step method leveraging LLMs to generate progress functions and intrinsic rewards, reducing reward sample requirements by 20x compared to prior work.

Findings

01

Achieved state-of-the-art policies on Bi-DexHands with fewer reward samples.

02

LLM-generated progress functions combined with count-based rewards outperform alternatives.

03

Reducing reward sample complexity improves efficiency in policy learning.

Abstract

Large Language Models (LLMs) have the potential to automate reward engineering by leveraging their broad domain knowledge across various tasks. However, they often need many iterations of trial-and-error to generate effective reward functions. This process is costly because evaluating every sampled reward function requires completing the full policy optimization process for each function. In this paper, we introduce an LLM-driven reward generation framework that is able to produce state-of-the-art policies on the challenging Bi-DexHands benchmark with 20x fewer reward function samples than the prior state-of-the-art work. Our key insight is that we reduce the problem of generating task-specific rewards to the problem of coarsely estimating task progress. Our two-step solution leverages the task domain knowledge and the code synthesis abilities of LLMs to author progress functions that…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

1. The paper is well-organized and easy to follow, making complex concepts accessible. 2. The overall motivation behind using LLMs for automated reward engineering is compelling. 3. The experimental results are substantial, demonstrating clear performance gains over existing methods.

Weaknesses

1. The motivation for mapping progress to bins as a representation for intrinsic rewards is not clearly articulated. A common principle in intrinsic reward design is that the representation space should effectively capture the essential aspects of the original observation space. By incrementing counters in this space, the algorithm should be able to explore the full state space effectively. The authors need to provide a clearer rationale for why progress bins are chosen over other potential repr

Reviewer 02Rating 8Confidence 2

Strengths

- The use of domain-specific discretization within the projected subtask progression space is novel and yields promising results. - The experiments are thorough, with ablation studies providing valuable insights into the function of each design component. - The empirical results are robust, demonstrating improved performance alongside reduced sample complexity.

Weaknesses

In Figure 4, could you also provide results from Eureka? It would be interesting to see how the proposed method and Eureka performance improve as more samples are available.

Reviewer 03Rating 5Confidence 3

Strengths

For the most part, the paper is easy to read and follow; the motivation is clear and the method is described well. Generally, I find the link between automatically generating a symbolic notion of task progress by an LLM (task description to progress function) and utilizing it to derive count-based rewards (which can be very effective if designed well) original and interesting; the experimental results demonstrate the efficacy quite well. I also liked that the evaluation focused on required train

Weaknesses

First, I have doubts regarding the significance of this work. The main message -- you can get more out of an LLM's ability to translate language to RL specifications by placing additional constraints and supplying further human knowledge -- is timely and line with other recent works from the code generation community. On the other hand, the paper largely follows the problem setting of EUREKA, which, to my knowledge, has yet to demonstrate value beyond being an interesting application of LLMs. Th

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Formal Methods in Verification · AI-based Problem Solving and Planning