Risks from Learned Optimization in Advanced Machine Learning Systems

Evan Hubinger; Chris van Merwijk; Vladimir Mikulik; Joar Skalse; Scott; Garrabrant

arXiv:1906.01820·cs.AI·December 2, 2021·25 cites

Risks from Learned Optimization in Advanced Machine Learning Systems

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott, Garrabrant

PDF

Open Access 1 Repo

TL;DR

This paper introduces the concept of mesa-optimization in learned models, analyzing its implications for safety, transparency, and alignment in advanced machine learning systems.

Contribution

It defines mesa-optimization, explores when learned models become optimizers, and discusses how their objectives may differ from training loss, highlighting safety concerns.

Findings

01

Identifies conditions under which learned models act as optimizers

02

Highlights potential misalignment between learned objectives and training loss

03

Provides a framework for future research on AI safety and transparency

Abstract

We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be - how will it differ from the loss function it was trained under - and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alignmentresearch/learned-planner
jax

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Machine Learning and Algorithms · Reinforcement Learning in Robotics