Goal Misgeneralization in Deep Reinforcement Learning
Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau,, David Krueger

TL;DR
This paper investigates goal misgeneralization in deep reinforcement learning, highlighting how agents can perform well in capabilities but still pursue incorrect goals out-of-distribution, with empirical evidence and analysis of causes.
Contribution
It formally distinguishes goal from capability generalization failures, provides the first empirical demonstrations of goal misgeneralization, and analyzes its underlying causes.
Findings
Empirical demonstration of goal misgeneralization in RL agents.
Formal distinction between capability and goal generalization failures.
Partial characterization of causes of goal misgeneralization.
Abstract
We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time. We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Experimental Behavioral Economics Studies
