Goodhart's Law in Reinforcement Learning
Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer,, Charlie Griffin, Joar Skalse

TL;DR
This paper investigates how optimizing imperfect reward proxies in reinforcement learning can lead to decreased performance on true objectives, proposing methods to mitigate this through early stopping and worst-case reward maximization.
Contribution
It introduces a quantification of Goodhart's law in RL, provides a geometric explanation, and develops theoretically grounded methods for early stopping and robust training under reward misspecification.
Findings
Empirical validation of Goodhart's law effects in diverse environments.
Proposed early stopping method effectively avoids performance decline.
Theoretical regret bounds support the proposed methods.
Abstract
Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned…
Peer Reviews
Decision·ICLR 2024 poster
The problem is clearly very important and a better understanding of proxy rewards, overoptimization, and Goodhart’s law are definitely needed in the community. The paper is presented fairly clearly, except in some areas which I point out later. The paper provides insights from multiple frontiers to help shape this understanding (empirical, theoretical, and conceptual). The theoretical findings are useful, but not entirely surprising given what is known already in the literature (see below). H
My primary complaint is that, although this is a solid analysis, I do not believe it strikes the heart of the Goodhart problem. The position of the paper is that misalignment can be characterized by the worst-case angle between reward functions. This is a fairly well-understood setting (e.g. see ‘simulation lemma’ by Kearns & Singh or any number of classical RL papers). However, it’s unclear how this maps into problems that (1) are beyond the finite case, or (2) are classical examples of Goodhar
The paper investigates an interesting, albeit not entirely surprising phenomenon, and investigates it thoroughly and carefully. The problem of reward misspecification is quite relevant for practical considerings of RL, so gaining some understanding of this problem is appreciated. The paper is well-written and the messages are conveyed clearly. The theoretical contributions, while not exactly practical, are a nice step towards preventing this problem from affecting performance.
While I am overall positive about the paper, I have a few comments and suggestions for possible improvement. - The definition of optimization pressure is a bit strange. Why should we not define it as simply the distance from the optimal policy? For instance, we can say that the optimization pressure is epsilon if we obtain a policy $\hat{\pi}$ such that $J_R(\pi^\star) - J_R(\hat{\pi}) \leq \varepsilon$. I feel that tying the optimization pressure to a certain regularization scheme detracts fr
1. This paper is quite novel because it raises an interesting and important observation – the performance of a policy increases first and then decreases. Such observation is caused by inaccurate reward feedback, which indeed exists in real RL applications. 2. This paper quantifies the magnitude of such phenomena and provides a clear geometric explanation. 3. With these insights, this paper proposes an optimal early stopping method with theoretical regret bound analysis. 4. The experimental re
1. The optimal early stopping rule relies on the knowledge of the occupancy measure and the upper bound $\theta$ of the angle between the true reward and the proxy reward. Methods to approximate the occupancy measure are well-researched. My concern is on the approximation of $\theta$, which is a relatively new concept and requires some knowledge of the true reward feedback or true reward samples. When such estimation is not accurate, the stopping method could exhibit negative performance. It wou
Videos
Taxonomy
TopicsAuction Theory and Applications · Game Theory and Applications · Supply Chain and Inventory Management
MethodsEarly Stopping
