Eureka: Human-Level Reward Design via Coding Large Language Models
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert, Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, Anima Anandkumar

TL;DR
Eureka leverages large language models like GPT-4 to automatically generate effective reward functions for complex reinforcement learning tasks, outperforming human-designed rewards across diverse environments.
Contribution
This paper introduces Eureka, a novel method that uses LLMs for zero-shot reward design, enabling learning of complex manipulation skills without task-specific prompts or templates.
Findings
Eureka outperforms human experts on 83% of tasks in 29 RL environments.
Eureka enables a Shadow Hand to perform pen spinning tricks in simulation.
The method improves reward quality and safety through in-context human feedback.
Abstract
Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies,…
Peer Reviews
Decision·ICLR 2024 poster
The submitted manuscript is very well written and presents a novel and interesting approach to automatically generate reward functions for simulated RL environments, which seemingly could be applied to different scenarios. It presents a clever approach to leveraging recent LLMs' zero-shot code generation ability to both understand a simulation environment and to iteratively improve generated reward functions that would be hard to manually author and tune. Moreover, the described evolutionary s
One of the main weaknesses of the submitted paper is the lack of a Limitations section/discussion, or such discussion throughout the text. While the authors claim the generality of Eureka, the proposed approach has only been evaluated on a single base simulator (Isaac Gym) and with a fixed RL algorithm. In other words, the claim seems to be overstated. Another weakness is the experiment part, while the submitted text showcases different (and relevant) comparisons with human results, the human
I love the idea of using an LLM to provide initial versions of the reward functions, and to then improve it using evolutionary search. Moreover, the evaluation shows that the approach can deal with challenging environments, leading to good solutions or solutions for problems that have not been solved before. The work is also well motivated, and potentially lead to interesting advances in RL itself; it would be quite interesting to see this published and available for further research. The work
While the paper does do a great job in selling the idea, there's a frustrating lack of technical detail in the main part of the paper. One example to illustrate this problem: The subsection on evolutionary search provides no detail on what is the exact input, outputs, or about the specific method being used. This is one core aspect of the proposed approach, and would require more details to be understandable. I understand some parts of this appear in the appendix or will be clear from code relea
While the idea of this paper is rather simple, it yields a surprisingly good performance, which reflects a well-structured system. Being able to bring an easy idea to such a complete and well-considered system is commendable. Moreover, this work brings insight to the reward design community by removing the dependency on collecting expert demonstration data. The study suggests that Large Language Models (LLMs) can serve as an cheap alternative to human expert demonstrations for acquiring domain-
1. Unrealistic assumption of access to the environment source codes: The reward code generation in this paper critically depends on having access to the source code of the MDP specification as context for the initial reward proposal. The authors have presented this as a benefit, allowing the LLM to exploit code structure to understand task environments. However, it makes an unrealistic assumption, as most reinforcement learning setups only require access to a black-box simulation. A significa
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Machine Learning in Materials Science · Reinforcement Learning in Robotics
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Residual Connection · Absolute Position Encodings · Adam · Byte Pair Encoding
