A State Augmentation based approach to Reinforcement Learning from Human Preferences
Mudit Verma, Subbarao Kambhampati

TL;DR
This paper introduces a state augmentation technique for preference-based reinforcement learning that enhances reward robustness and improves early training performance across multiple domains.
Contribution
The proposed state augmentation method significantly improves reward recovery and early training performance in preference-based reinforcement learning.
Findings
Enhanced reward recovery compared to baseline PEBBLE
Improved early training performance across three domains
Method is effective in diverse tasks from simple to robotic manipulation
Abstract
Reinforcement Learning has suffered from poor reward specification, and issues for reward hacking even in simple enough domains. Preference Based Reinforcement Learning attempts to solve the issue by utilizing binary feedbacks on queried trajectory pairs by a human in the loop indicating their preferences about the agent's behavior to learn a reward model. In this work, we present a state augmentation technique that allows the agent's reward model to be robust and follow an invariance consistency that significantly improved performance, i.e. the reward recovery and subsequent return computed using the learned policy over our baseline PEBBLE. We validate our method on three domains, Mountain Car, a locomotion task of Quadruped-Walk, and a robotic manipulation task of Sweep-Into, and find that using the proposed augmentation the agent not only benefits in the overall performance but does…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Autonomous Vehicle Technology and Safety · Data Stream Mining Techniques
