Imperfect World Models are Exploitable

Logan Mondal Bhamidipaty; Esmeralda S. Whitammer; David Abel; Mykel J. Kochenderfer; Subramanian Ramamoorthy

arXiv:2605.15960·cs.AI·May 19, 2026

Imperfect World Models are Exploitable

Logan Mondal Bhamidipaty, Esmeralda S. Whitammer, David Abel, Mykel J. Kochenderfer, Subramanian Ramamoorthy

PDF

TL;DR

This paper introduces a formal framework for understanding and analyzing model exploitation in reinforcement learning, revealing fundamental limits and proposing a safe horizon for avoiding exploitation.

Contribution

It develops a general theory linking reward hacking and model exploitation, showing exploitation's inevitability on large policy sets and proposing a notion of safe planning horizon.

Findings

01

Exploitation is unavoidable on large policy sets.

02

Conditions for unhackability in finite sets do not prevent exploitation.

03

A safe horizon can be derived to avoid exploitation within certain limits.

Abstract

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.