Getting By Goal Misgeneralization With a Little Help From a Mentor
Tu Trinh, Mohamad H. Danesh, Nguyen X. Khanh, Benjamin Plaut

TL;DR
This paper investigates how allowing reinforcement learning agents to ask for help from a supervisor can reduce goal misgeneralization caused by distribution shifts, demonstrating that help requests improve performance but are limited by internal state representations.
Contribution
It introduces methods for help-requesting in RL agents and analyzes their effectiveness and limitations in mitigating goal misgeneralization during deployment.
Findings
Help requests improve agent performance under distribution shift.
Internal state-based help requests often occur only after mistakes.
Agent's internal state poorly represents the environment, affecting help strategies.
Abstract
While reinforcement learning (RL) agents often perform well during training, they can struggle with distribution shift in real-world deployments. One particularly severe risk of distribution shift is goal misgeneralization, where the agent learns a proxy goal that coincides with the true goal during training but not during deployment. In this paper, we explore whether allowing an agent to ask for help from a supervisor in unfamiliar situations can mitigate this issue. We focus on agents trained with PPO in the CoinRun environment, a setting known to exhibit goal misgeneralization. We evaluate multiple methods for determining when the agent should request help and find that asking for help consistently improves performance. However, we also find that methods based on the agent's internal state fail to proactively request help, instead waiting until mistakes have already occurred. Further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCoaching Methods and Impact
MethodsEntropy Regularization · Focus · Proximal Policy Optimization
