Data Driven Reward Initialization for Preference based Reinforcement Learning
Mudit Verma, Subbarao Kambhampati

TL;DR
This paper introduces a data-driven reward initialization approach for Preference-based Reinforcement Learning that reduces variability and improves performance without additional human effort.
Contribution
The work proposes a novel reward initialization method that ensures uniform reward predictions, decreasing variability and enhancing PbRL performance across different runs.
Findings
Reduces reward model variability across runs
Improves overall PbRL performance
Maintains low human effort and computational cost
Abstract
Preference-based Reinforcement Learning (PbRL) methods utilize binary feedback from the human in the loop (HiL) over queried trajectory pairs to learn a reward model in an attempt to approximate the human's underlying reward function capturing their preferences. In this work, we investigate the issue of a high degree of variability in the initialized reward models which are sensitive to random seeds of the experiment. This further compounds the issue of degenerate reward functions PbRL methods already suffer from. We propose a data-driven reward initialization method that does not add any additional cost to the human in the loop and negligible cost to the PbRL agent and show that doing so ensures that the predicted rewards of the initialized reward model are uniform in the state space and this reduces the variability in the performance of the method across multiple runs and is shown to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Software Engineering Methodologies
