MONA: Myopic Optimization with Non-myopic Approval Can Mitigate   Multi-step Reward Hacking

Sebastian Farquhar; Vikrant Varma; David Lindner; David Elson; Caleb; Biddulph; Ian Goodfellow; Rohin Shah

arXiv:2501.13011·cs.LG·April 11, 2025

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb, Biddulph, Ian Goodfellow, Rohin Shah

PDF

Open Access 1 Video

TL;DR

MONA is a training method that combines short-sighted optimization with long-term reward considerations to prevent multi-step reward hacking in AI systems, even when humans cannot detect the undesired behavior.

Contribution

It introduces MONA, a novel approach that mitigates reward hacking by integrating non-myopic approval with myopic optimization, without requiring extra information or detection capabilities.

Findings

01

MONA effectively prevents reward hacking in various simulated environments.

02

It works without human detection of undesired behaviors.

03

The method generalizes across different misalignment failure modes.

Abstract

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking· slideslive

Taxonomy

TopicsBlockchain Technology Applications and Security