Imperfect World Models are Exploitable
Logan Mondal Bhamidipaty, Esmeralda S. Whitammer, David Abel, Mykel J. Kochenderfer, Subramanian Ramamoorthy

TL;DR
This paper introduces a formal framework for understanding and analyzing model exploitation in reinforcement learning, revealing fundamental limits and proposing a safe horizon for avoiding exploitation.
Contribution
It develops a general theory linking reward hacking and model exploitation, showing exploitation's inevitability on large policy sets and proposing a notion of safe planning horizon.
Findings
Exploitation is unavoidable on large policy sets.
Conditions for unhackability in finite sets do not prevent exploitation.
A safe horizon can be derived to avoid exploitation within certain limits.
Abstract
We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
