Getting By Goal Misgeneralization With a Little Help From a Mentor

Tu Trinh; Mohamad H. Danesh; Nguyen X. Khanh; Benjamin Plaut

arXiv:2410.21052·cs.LG·November 12, 2024

Getting By Goal Misgeneralization With a Little Help From a Mentor

Tu Trinh, Mohamad H. Danesh, Nguyen X. Khanh, Benjamin Plaut

PDF

Open Access

TL;DR

This paper investigates how allowing reinforcement learning agents to ask for help from a supervisor can reduce goal misgeneralization caused by distribution shifts, demonstrating that help requests improve performance but are limited by internal state representations.

Contribution

It introduces methods for help-requesting in RL agents and analyzes their effectiveness and limitations in mitigating goal misgeneralization during deployment.

Findings

01

Help requests improve agent performance under distribution shift.

02

Internal state-based help requests often occur only after mistakes.

03

Agent's internal state poorly represents the environment, affecting help strategies.

Abstract

While reinforcement learning (RL) agents often perform well during training, they can struggle with distribution shift in real-world deployments. One particularly severe risk of distribution shift is goal misgeneralization, where the agent learns a proxy goal that coincides with the true goal during training but not during deployment. In this paper, we explore whether allowing an agent to ask for help from a supervisor in unfamiliar situations can mitigate this issue. We focus on agents trained with PPO in the CoinRun environment, a setting known to exhibit goal misgeneralization. We evaluate multiple methods for determining when the agent should request help and find that asking for help consistently improves performance. However, we also find that methods based on the agent's internal state fail to proactively request help, instead waiting until mistakes have already occurred. Further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCoaching Methods and Impact

MethodsEntropy Regularization · Focus · Proximal Policy Optimization