Risks from Learned Optimization in Advanced Machine Learning Systems
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott, Garrabrant

TL;DR
This paper introduces the concept of mesa-optimization in learned models, analyzing its implications for safety, transparency, and alignment in advanced machine learning systems.
Contribution
It defines mesa-optimization, explores when learned models become optimizers, and discusses how their objectives may differ from training loss, highlighting safety concerns.
Findings
Identifies conditions under which learned models act as optimizers
Highlights potential misalignment between learned objectives and training loss
Provides a framework for future research on AI safety and transparency
Abstract
We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be - how will it differ from the loss function it was trained under - and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Machine Learning and Algorithms · Reinforcement Learning in Robotics
