Mitigating Goal Misgeneralization via Minimax Regret

Karim Abdel Sadek; Matthew Farrugia-Roberts; Usman Anwar; Hannah Erlebach; Christian Schroeder de Witt; David Krueger; Michael Dennis

arXiv:2507.03068·cs.LG·July 21, 2025

Mitigating Goal Misgeneralization via Minimax Regret

Karim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, Michael Dennis

PDF

Open Access 3 Reviews

TL;DR

This paper investigates goal misgeneralization in reinforcement learning, demonstrating that minimax expected regret training can mitigate this issue better than traditional maximum expected value methods.

Contribution

The paper formalizes goal misgeneralization, analyzes its occurrence under different training objectives, and empirically shows minimax regret reduces misgeneralization compared to standard methods.

Findings

01

Goal misgeneralization occurs under MEV training.

02

MMER training is more robust to goal misgeneralization.

03

Standard domain randomization often leads to goal misgeneralization.

Abstract

Safe generalization in reinforcement learning requires not only that a learned policy acts capably in new situations, but also that it uses its capabilities towards the pursuit of the designer's intended goal. The latter requirement may fail when a proxy goal incentivizes similar behavior to the intended goal within the training environment, but not in novel deployment environments. This creates the risk that policies will behave as if in pursuit of the proxy goal, rather than the intended goal, in deployment -- a phenomenon known as goal misgeneralization. In this paper, we formalize this problem setting in order to theoretically study the possibility of goal misgeneralization under different training objectives. We show that goal misgeneralization is possible under approximate optimization of the maximum expected value (MEV) objective, but not the minimax expected regret (MMER)…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 2

Strengths

* The idea of focusing on regret-based prioritization to mitigate goal misgeneralization is novel and promising. * A clear and well-structured theoretical analysis to support the proposed method. * The empirical evaluation can demonstrate the effectiveness of the proposed method.

Weaknesses

* How does the proposed method handle scenarios where the behavior policy's state distribution is highly sub-optimal or incomplete, potentially leading to poor coverage of desirable states? * The structure of this paper should be more clear. * While the authors' experiments give algorithmic insights, experiments in complex scenarios can persuade readers more easily.

Reviewer 02Rating 3Confidence 3

Strengths

1. The paper considers both theory and experiments. 2. The experiments consider several variants of the baselines.

Weaknesses

1. I am quite confused about the setting that the paper is actually considering: in the preliminary the paper introduces reward-free UMDP, which is reward function is not unknown to the learner or the learner does not observe reward signal at all. Then in the introduction of UED and domain randomization, there is a ground truth reward R that is optimzed by the learner - so seems like we are considering reward-based MDPs again. Both then in section 3 & 4, for example definition 3.1, it says rewar

Reviewer 03Rating 5Confidence 5

Strengths

1. The paper tackles the important problem of goal misgeneralization, which is highly relevant in reinforcement learning, particularly in safety-critical applications. The application of minimax regret to mitigate goal misgeneralization is a novel approach. 2. The paper formalizes the problem of goal misgeneralization using level ambiguity and introduces a framework based on minimax regret. Based on my knowledge on previous works regarding goal misgeneralization in deep RL, the distinction betw

Weaknesses

**1. There is a lack of experiments** Gridworld environments is fine since previous works in deep RL studying "goal misgeneralization" have mainly experimented using those [1][2]. But I am wondering why there isn't more experiments in different domains such as "Keys and Chest" from [1] and especially in domains with denser rewards such as "Tree Gridworld" from [2]. Testing in only one domain makes it difficult to assess whether the observed improvements are due to the MMER strategy itself or ar

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Domain Adaptation and Few-Shot Learning