Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards
Zhaohui Jiang, Xuening Feng, Paul Weng, Yifei Zhu, Yan Song, Tianze, Zhou, Yujing Hu, Tangjie Lv, Changjie Fan

TL;DR
This paper introduces ICoPro, a novel deep reinforcement learning algorithm that effectively combines imperfect proxy rewards with human corrective actions to improve policy alignment and sample efficiency.
Contribution
The paper proposes ICoPro, a new value-based RL method that integrates human corrective feedback with proxy rewards, including pseudo-labels, to enhance learning in imperfect settings.
Findings
ICoPro outperforms baseline methods in aligning with human preferences.
The method is more sample-efficient across various tasks.
It effectively handles different types of feedback imperfection.
Abstract
In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsALIGN
