Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma; William Liang; Guanzhi Wang; De-An Huang; Osbert; Bastani; Dinesh Jayaraman; Yuke Zhu; Linxi Fan; Anima Anandkumar

arXiv:2310.12931·cs.RO·May 2, 2024·48 cites

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert, Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, Anima Anandkumar

PDF

Open Access 1 Repo 2 Videos 3 Reviews

TL;DR

Eureka leverages large language models like GPT-4 to automatically generate effective reward functions for complex reinforcement learning tasks, outperforming human-designed rewards across diverse environments.

Contribution

This paper introduces Eureka, a novel method that uses LLMs for zero-shot reward design, enabling learning of complex manipulation skills without task-specific prompts or templates.

Findings

01

Eureka outperforms human experts on 83% of tasks in 29 RL environments.

02

Eureka enables a Shadow Hand to perform pen spinning tricks in simulation.

03

The method improves reward quality and safety through in-context human feedback.

Abstract

Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies,…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The submitted manuscript is very well written and presents a novel and interesting approach to automatically generate reward functions for simulated RL environments, which seemingly could be applied to different scenarios. It presents a clever approach to leveraging recent LLMs' zero-shot code generation ability to both understand a simulation environment and to iteratively improve generated reward functions that would be hard to manually author and tune. Moreover, the described evolutionary s

Weaknesses

One of the main weaknesses of the submitted paper is the lack of a Limitations section/discussion, or such discussion throughout the text. While the authors claim the generality of Eureka, the proposed approach has only been evaluated on a single base simulator (Isaac Gym) and with a fixed RL algorithm. In other words, the claim seems to be overstated. Another weakness is the experiment part, while the submitted text showcases different (and relevant) comparisons with human results, the human

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

I love the idea of using an LLM to provide initial versions of the reward functions, and to then improve it using evolutionary search. Moreover, the evaluation shows that the approach can deal with challenging environments, leading to good solutions or solutions for problems that have not been solved before. The work is also well motivated, and potentially lead to interesting advances in RL itself; it would be quite interesting to see this published and available for further research. The work

Weaknesses

While the paper does do a great job in selling the idea, there's a frustrating lack of technical detail in the main part of the paper. One example to illustrate this problem: The subsection on evolutionary search provides no detail on what is the exact input, outputs, or about the specific method being used. This is one core aspect of the proposed approach, and would require more details to be understandable. I understand some parts of this appear in the appendix or will be clear from code relea

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

While the idea of this paper is rather simple, it yields a surprisingly good performance, which reflects a well-structured system. Being able to bring an easy idea to such a complete and well-considered system is commendable. Moreover, this work brings insight to the reward design community by removing the dependency on collecting expert demonstration data. The study suggests that Large Language Models (LLMs) can serve as an cheap alternative to human expert demonstrations for acquiring domain-

Weaknesses

1. Unrealistic assumption of access to the environment source codes: The reward code generation in this paper critically depends on having access to the source code of the MDP specification as context for the initial reward proposal. The authors have presented this as a benefit, allowing the LLM to exploit code structure to understand task environments. However, it makes an unrealistic assumption, as most reinforcement learning setups only require access to a black-box simulation. A significa

Code & Models

Repositories

eureka-research/Eureka
jaxOfficial

Videos

AI Conquers Gravity: Robo-dog, Trained by GPT-4, Stays Balanced on Rolling, Deflating Yoga Ball· youtube

Eureka: Human-Level Reward Design via Coding Large Language Models· slideslive

Taxonomy

TopicsSoftware Engineering Research · Machine Learning in Materials Science · Reinforcement Learning in Robotics

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Residual Connection · Absolute Position Encodings · Adam · Byte Pair Encoding