RL, but don't do anything I wouldn't do
Michael K. Cohen, Marcus Hutter, Yoshua Bengio, Stuart Russell

TL;DR
This paper investigates the limitations of KL regularization in reinforcement learning, especially when the base policy is a Bayesian predictive model, and proposes an alternative principle to better control agent behavior.
Contribution
It reveals theoretical flaws in current KL regularization methods when used with Bayesian base policies and proposes a new principle to improve behavior control in RL agents.
Findings
Theoretically demonstrates the failure of KL regularization with Bayesian models.
Empirically shows evidence of the failure in RL-finetuned language models.
Proposes an alternative principle to replace 'Don't do anything I wouldn't do'.
Abstract
In reinforcement learning, if the agent's reward differs from the designers' true utility, even only rarely, the state distribution resulting from the agent's policy can be very bad, in theory and in practice. When RL policies would devolve into undesired behavior, a common countermeasure is KL regularization to a trusted policy ("Don't do anything I wouldn't do"). All current cutting-edge language models are RL agents that are KL-regularized to a "base policy" that is purely predictive. Unfortunately, we demonstrate that when this base policy is a Bayesian predictive model of a trusted policy, the KL constraint is no longer reliable for controlling the behavior of an advanced RL agent. We demonstrate this theoretically using algorithmic information theory, and while systems today are too weak to exhibit this theorized failure precisely, we RL-finetune a language model and find evidence…
Peer Reviews
Decision·UAI 2025 Poster
The paper proposes novel, and technically solid ideas. In many points along the paper the authors provide additional intuitions on the significance of the results, which have been useful to grasp the subtleties of the theoretical results. The continual links between the claims made, the results showed and the implications that they have made the paper easy to follow and clear to understand, although this is not precisely my area of expertise. Also good to include a proof outline as now, e.g. for
The main weakness of the paper regards the assumptions made in Section 4 (see also questions below). Most of them are sound, but I believe part of the community would not necessarily agree with this. I think it could be beneficial for the paper to have additional results where some of the assumptions are weaken, and investigate which theoretical results would derive in tis case. For example, what happens if we drop the assumption that all overoptimized policies lead to unsafe behaviours? Anoth
1. The paper is very well written: The ideas behind the theory are easy to follow and the authors argue compellingly why this might be a good model of the KL-constrained RL problem 2. The paper is very relevant to modern problems. The paper sheds light on a possible explanation of the performance of RL fine-tuning of imitation policies and they suggest a possible avenue to solve the problem. 3. The empirical evidence is compelling and useful to further understand the theory. The authors use int
The argument made by the authors is certainly compelling and the empirical evidence seems to suggest that the theory applies. I wonder if algorithmic information theory arguments fit the real-world problem. I guess that there are hypotheses that neural networks approximate such algorithmic theoretical simple programs, however we do not know enough about it. This might be the subject of future research, however, I'd like the authors to comment more on this.
The studied problem is closely related to large language model training, which is a popular topic currently. The results of the paper are novel, non-trivial, and explains certain observation found in large language model training. The paper shows the authors understanding of the root cause of the problem.
My main criticism is the quality of the writing of this paper. The paper reads like the flow of the authors' thoughts, instead of an academic paper. The poor writing makes it hard to evaluate the paper's contributions. Examples: 1. Paragraph right after proposition 2. "Developers of self-driving cars are learning the hard way that this bit of algorithmic information theory has practical analogs: Even with enormous datasets, unprecedented road conditions occur all the time. These results sugge
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics
MethodsBalanced Selection
