Improving Multimodal Interactive Agents with Reinforcement Learning from   Human Feedback

Josh Abramson; Arun Ahuja; Federico Carnevale; Petko Georgiev; Alex; Goldin; Alden Hung; Jessica Landon; Jirka Lhotka; Timothy Lillicrap; Alistair; Muldal; George Powell; Adam Santoro; Guy Scully; Sanjana Srivastava; Tamara; von Glehn; Greg Wayne; Nathaniel Wong; Chen Yan; Rui Zhu

arXiv:2211.11602·cs.LG·November 22, 2022·5 cites

Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback

Josh Abramson, Arun Ahuja, Federico Carnevale, Petko Georgiev, Alex, Goldin, Alden Hung, Jessica Landon, Jirka Lhotka, Timothy Lillicrap, Alistair, Muldal, George Powell, Adam Santoro, Guy Scully, Sanjana Srivastava, Tamara, von Glehn, Greg Wayne, Nathaniel Wong, Chen Yan

PDF

Open Access

TL;DR

This paper demonstrates how reinforcement learning from human feedback (RLHF) can enhance the behavior of simulated embodied agents by leveraging human judgments to create effective reward models, improving interaction quality.

Contribution

The study introduces a novel 'Inter-temporal Bradley-Terry' (IBT) method to model human judgments and applies RLHF to improve agent performance in complex, embodied environments.

Findings

01

Agents trained with IBT-based rewards outperform baselines.

02

Human judgments effectively guide reinforcement learning in embodied domains.

03

Improved agent behavior aligns better with human preferences.

Abstract

An important goal in artificial intelligence is to create agents that can both interact naturally with humans and learn from their feedback. Here we demonstrate how to use reinforcement learning from human feedback (RLHF) to improve upon simulated, embodied agents trained to a base level of competency with imitation learning. First, we collected data of humans interacting with agents in a simulated 3D world. We then asked annotators to record moments where they believed that agents either progressed toward or regressed from their human-instructed goal. Using this annotation data we leveraged a novel method - which we call "Inter-temporal Bradley-Terry" (IBT) modelling - to build a reward model that captures human judgments. Agents trained to optimise rewards delivered from IBT reward models improved with respect to all of our metrics, including subsequent human judgment during live…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)

MethodsBalanced Selection