Inverse Reward Design
Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell,, Anca Dragan

TL;DR
This paper introduces inverse reward design (IRD), a method to infer true objectives from designed rewards, helping autonomous agents avoid undesired behaviors caused by reward misspecification.
Contribution
It proposes IRD as a new approach to interpret reward functions in context, with approximate solutions for risk-averse planning in unseen scenarios.
Findings
IRD helps reduce negative side effects of reward misspecification
The approach mitigates reward hacking in autonomous agents
Empirical results demonstrate improved robustness in test environments
Abstract
Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms · AI-based Problem Solving and Planning · Manufacturing Process and Optimization
