Learning the Preferences of a Learning Agent
Karim Abdel Sadek, Mark Bedaywi, Rhys Gould, Stuart Russell

TL;DR
This paper addresses the challenge of inferring the preferences of a learning agent by observing its online behavior, extending inverse reinforcement learning to non-optimal, learning agents with theoretical guarantees.
Contribution
It formalizes the problem of learning preferences of a learning agent, modeling the agent as no-regret or converging to an optimal policy, and provides theoretical analysis of preference learning algorithms.
Findings
Established theoretical guarantees for preference learning algorithms in the no-regret setting.
Demonstrated that certain guarantees are impossible in some learning scenarios.
Abstract
For AI systems to be useful to humans, they must understand and act in accordance with our values and preferences. Since specifying preferences is a hard task, inverse reinforcement learning (IRL) aims to develop methods that allow for inferring preferences from observed behavior. However, IRL assumes the human to be approximately optimal. This is a big limitation in cases where the human themselves may be learning to act optimally in an environment. In this paper, we formalize the problem of learning the preferences of a learning agent: a predictor observes a learner acting online and tries to infer the underlying reward function being (initially suboptimally) optimized by the learner. We model the learner as either being no-regret, or as converging to an optimal Boltzmann policy over time. In each of these settings, we establish theoretical guarantees for various preference learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
