Motif: Intrinsic Motivation from Artificial Intelligence Feedback

Martin Klissarov; Pierluca D'Oro; Shagun Sodhani; Roberta Raileanu,; Pierre-Luc Bacon; Pascal Vincent; Amy Zhang; Mikael Henaff

arXiv:2310.00166·cs.AI·October 3, 2023·6 cites

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

Martin Klissarov, Pierluca D'Oro, Shagun Sodhani, Roberta Raileanu,, Pierre-Luc Bacon, Pascal Vincent, Amy Zhang, Mikael Henaff

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Motif introduces a novel approach where an LLM-derived intrinsic reward guides reinforcement learning in complex environments, leading to superior performance and human-aligned behaviors without environment interaction.

Contribution

The paper presents Motif, a method that uses LLMs to generate intrinsic rewards for agents, enabling effective learning in challenging environments without prior environment interaction.

Findings

01

Motif outperforms direct score maximization in NetHack.

02

Combining intrinsic and environment rewards improves performance.

03

Motif behaviors are human-aligned and easily steerable.

Abstract

Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

- Figure 1 provides a nice clean bird's eye view of the overall approach and helps with readability. - The evaluation of the agent for not just the game score but also other dimensions provides a helpful qualitative assessment of the proposed approach and baselines through the spider graph in Figure 4. - The ablation experiments for the approach are quite exhaustive, covering scaling laws, prior v/s zero knowledge, rewordings of the prompts, etc.

Weaknesses

- Using a 70-billion LLM to generate a preference dataset from given captions is quite expensive; while I understand this is out of the scope of the paper, perhaps using a large VLM to annotate frames without captions might have been more economical? - Given that one of the key contributions of the paper is the intrinsic reward function that is learnt from preferences extracted from the LLM, it might be worthwhile having a baseline that gives preferences using a simpler model (say sentiment anal

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

1. The paper is clear and well presented. 2. The idea of using intrinsic rewards generated from an LLM's preferences is both innovative and practically useful, potentially paving the way for more human-aligned agents. 3. The method scales well with the size of the LLM and is sensitive to prompt modifications, offering flexibility and adaptability. 4. The paper provides a comprehensive analysis, covering not just the quantitative but also the qualitative behaviors of the agents.

Weaknesses

1. The paper could benefit from a more extensive comparison to other methods, especially those that also attempt to integrate LLMs into decision-making agents. 2. There is a lack of discussion on the computational cost and efficiency aspects of implementing Motif. 3. While the paper makes a strong case for Motif, it doesn't delve deeply into the limitations or potential drawbacks of relying on LLMs for intrinsic reward generation.

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

- S1. First of all, this paper is well-written and well-organized. - S2. The idea of using a LLM as a preference annotator for preference-based RL is interesting and promising. - S3. This paper provides a loss function (equation 1) to train an intrinsic reward model. - S4. This paper shows that training agents with intrinsic rewards is very effective.

Weaknesses

- W1. One of my main questions is whether Motif can be generally applied to other environments. Even though the NetHack Learning Environment (NLE) is a very challenging environment, it seems that the NLE may be one of environments that a LLM can easily annotate preferences.

Code & Models

Repositories

facebookresearch/motif
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications