Learning from Preferences and Mixed Demonstrations in General Settings
Jason R Brown, Carl Henrik Ek, Robert D Mullins

TL;DR
This paper introduces LEOPARD, a scalable algorithm that learns reward functions from diverse human feedback, including preferences and demonstrations, outperforming existing methods in various domains.
Contribution
The paper proposes a new flexible framework for learning from mixed human feedback and introduces LEOPARD, a practical algorithm that effectively combines preferences and demonstrations.
Findings
LEOPARD outperforms baselines with limited feedback.
Combining multiple feedback types improves learning.
Effective across diverse domains.
Abstract
Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are often ad-hoc, rely on domain-specific properties, or won't scale. We develop a new framing for learning from human data, \emph{reward-rational partial orderings over observations}, designed to be flexible and scalable. Based on this we introduce a practical algorithm, LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can learn from a broad range of data, including negative demonstrations, to efficiently learn reward functions across a wide range of domains. We find that when a limited amount of preference and demonstration feedback is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
