Learning from Suboptimal Demonstration via Self-Supervised Reward Regression
Letian Chen, Rohan Paleja, Matthew Gombolay

TL;DR
This paper introduces a novel self-supervised reward regression method that effectively learns from suboptimal demonstrations, significantly improving reward estimation and policy performance in robotic tasks.
Contribution
The authors develop a new approach that bootstraps suboptimal demonstrations to synthesize optimality-parameterized data for training reward functions, overcoming limitations of previous ranking-based methods.
Findings
Achieves ~0.95 correlation with ground-truth reward
Policy improvements of ~200% over suboptimal demos
Physical robot demonstration with faster, more topspin shots
Abstract
Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. However, modern LfD techniques, e.g. inverse reinforcement learning (IRL), assume users provide at least stochastically optimal demonstrations. This assumption fails to hold in most real-world scenarios. Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings and following the Luce-Shepard rule. However, we show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We overcome these limitations in developing a novel approach that bootstraps off suboptimal demonstrations to synthesize optimality-parameterized data to train an idealized reward function. We empirically validate we learn an idealized reward function with ~0.95 correlation with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Machine Learning and Data Classification
