Reinforcement Learning From Imperfect Corrective Actions And Proxy   Rewards

Zhaohui Jiang; Xuening Feng; Paul Weng; Yifei Zhu; Yan Song; Tianze; Zhou; Yujing Hu; Tangjie Lv; Changjie Fan

arXiv:2410.05782·cs.LG·October 10, 2024

Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards

Zhaohui Jiang, Xuening Feng, Paul Weng, Yifei Zhu, Yan Song, Tianze, Zhou, Yujing Hu, Tangjie Lv, Changjie Fan

PDF

Open Access

TL;DR

This paper introduces ICoPro, a novel deep reinforcement learning algorithm that effectively combines imperfect proxy rewards with human corrective actions to improve policy alignment and sample efficiency.

Contribution

The paper proposes ICoPro, a new value-based RL method that integrates human corrective feedback with proxy rewards, including pseudo-labels, to enhance learning in imperfect settings.

Findings

01

ICoPro outperforms baseline methods in aligning with human preferences.

02

The method is more sample-efficient across various tasks.

03

It effectively handles different types of feedback imperfection.

Abstract

In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsALIGN