REvolve: Reward Evolution with Large Language Models using Human Feedback

Rishi Hazra; Alkis Sygkounas; Andreas Persson; Amy Loutfi; Pedro Zuidberg Dos Martires

arXiv:2406.01309·cs.NE·May 26, 2025·2 cites

REvolve: Reward Evolution with Large Language Models using Human Feedback

Rishi Hazra, Alkis Sygkounas, Andreas Persson, Amy Loutfi, Pedro Zuidberg Dos Martires

PDF

Open Access 1 Video 3 Reviews

TL;DR

REvolve leverages large language models guided by human feedback to automatically generate and refine reward functions for reinforcement learning in complex, subjective tasks, improving agent performance.

Contribution

This work introduces REvolve, an evolutionary framework that uses human feedback to guide LLMs in designing reward functions for challenging RL tasks.

Findings

01

Agents trained with REvolve rewards outperform baselines.

02

LLMs can encode implicit human knowledge into reward functions.

03

Human feedback effectively guides reward evolution.

Abstract

Designing effective reward functions is crucial to training reinforcement learning (RL) algorithms. However, this design is non-trivial, even for domain experts, due to the subjective nature of certain tasks that are hard to quantify explicitly. In recent works, large language models (LLMs) have been used for reward generation from natural language task descriptions, leveraging their extensive instruction tuning and commonsense understanding of human behavior. In this work, we hypothesize that LLMs, guided by human feedback, can be used to formulate reward functions that reflect human implicit knowledge. We study this in three challenging settings -- autonomous driving, humanoid locomotion, and dexterous manipulation -- wherein notions of ``good" behavior are tacit and hard to quantify. To this end, we introduce REvolve, a truly evolutionary framework that uses LLMs for reward design in…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Well written, the technical approach is very clear, especially Figure 1 and Algorithm 1, which can help readers quickly understand the technical route of the paper. - The experiments involve a wide range of task types, covering multiple dimensions of continuous and discrete action spaces and observation spaces for RL agents in virtual simulation environments. - The authors have compared their work with a variety of baseline algorithms.

Weaknesses

The paper does not clearly describe the foundational setup of the experiments and the comparison metrics. In fact, I do not understand the role of the "fitness score" mentioned in the text. In sections 2 and 3, it is used as a supervisory signal to guide the genetic algorithm in generating the reward function. However, in section 4, it becomes an evaluation metric for the experimental results of the paper. I am unsure whether this practice is appropriate because it raises the following concerns:

Reviewer 02Rating 6Confidence 4

Strengths

1. Paper is written very clearly and is easy to understand. 2. Results included good ablations that tested the effect of human ratings as fitness functions for both REvolve and Eureka. 3. Human evaluations were conducted and full Elo scores reported between REvolve, Eureka, and their respective ablations. 4. Appendix contained detailed experimental setup information.

Weaknesses

1. Natural language feedback to the LLM is extracted from a series of checkboxes that the user ticks. This requires a human to predefine the specific characteristics that the agent must exhibit, which may bias the prompting of the model by constraining the natural language feedback to this small discrete list of checkboxes. So while the authors argue that no manual engineering is required for providing the fitness function, there is manual engineering required to provide a good list of attribute

Reviewer 03Rating 6Confidence 3

Strengths

REvolve uses the evolutionary algorithm with genetic operators like mutation, crossover, and selection to overcome the disadvantage in greedy search in Eureka and present good performance. REvolve leverages LLMs to generate reward function explicitly and employ human preference to guide the reward function generation implicitly, which enhances the interpretability and aligns with humans.

Weaknesses

No major weaknesses under the topic of explicit reward function generation, but it lacks the discussion of the comparison to an implicit reward design, for example, a reward model from human preference.

Videos

REvolve: Reward Evolution with Large Language Models using Human Feedback· slideslive

Taxonomy

TopicsTransportation and Mobility Innovations · Topic Modeling · Artificial Intelligence in Law

MethodsALIGN