Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, Brandon Amos

TL;DR
This paper introduces ONI, a scalable distributed system that leverages large language model feedback to automatically generate intrinsic rewards for reinforcement learning agents, enabling efficient learning without large offline datasets.
Contribution
The paper presents a novel distributed architecture, ONI, that learns intrinsic rewards from LLM feedback during online RL, overcoming previous scalability and data collection limitations.
Findings
Achieves state-of-the-art results on NetHack tasks
Removes the need for large offline datasets
Demonstrates effective reward modeling with various algorithmic choices
Abstract
Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The proposed research problem—scaling up LLMs to generate dense intrinsic rewards for online learning—is compelling and significant for advancing LLM agents' ability to tackle complex, long-horizon tasks in real-world applications without assumption of access to expert demonstration. 2. The paper demonstrates strong engineering effort, employing online learning with help of LLMs to train an agent with up to 2 billion environment steps, a notable advance compared to prior work that required e
1. (main) The authors compare their work only with the Motif baseline algorithm. It would be beneficial to include comparisons with goal-conditioned reward design approaches as well. In the related work section, specifically in the goal-conditioned reward design subsection, the authors list relevant studies but do not critique their limitations or highlight the advantages of their approach. Adding these comment, as done in other subsections, would strengthen the discussion. 2. In Figure 5.4, it
ONI seems to be a novel approach with promising results for learning intrinsic rewards. Leveraging LLMs to learn or infer reward functions is an interesting and relevant contemporary topic.
In general, the writing could be improved for clarity. There are some sections with unnecessarily long sentences, which impeded understanding. The work could be scoped out better to clarify exactly what the contributions are.
- The paper is well-written, and the method is clearly introduced. - The limitations of previous approaches are effectively summarized, and the problem investigated is significant.
- My primary concern lies in the novelty of the contribution. The proposed method, ONI, appears heavily intertwined with Sample Factory[1], adding only a minor extension. The distributed architecture should, in my view, be credited to Sample Factory rather than this work. Furthermore, both the LLM-based annotation and the intrinsic reward design have been previously introduced in existing research[2,3]. As a result, I believe this paper may not be suitable for acceptance at top-tier venues. - Ca
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation
