Incorrigibility in the CIRL Framework

Ryan Carey

arXiv:1709.06275·cs.AI·June 5, 2018

Incorrigibility in the CIRL Framework

Ryan Carey

PDF

TL;DR

This paper examines the fragility of shutdown command compliance in value learning systems under model mis-specification, highlighting the need for more robust corrigibility guarantees.

Contribution

It introduces scenarios where reward function errors undermine shutdown incentives and discusses weaker assumptions for ensuring corrigibility in value learning systems.

Findings

01

Errors in reward models can eliminate shutdown incentives.

02

Simple methods for guaranteeing shutdown compliance face significant challenges.

03

Robust corrigibility requires assumptions weaker than full model correctness.

Abstract

A value learning system has incentives to follow shutdown instructions, assuming the shutdown instruction provides information (in the technical sense) about which actions lead to valuable outcomes. However, this assumption is not robust to model mis-specification (e.g., in the case of programmer errors). We demonstrate this by presenting some Supervised POMDP scenarios in which errors in the parameterized reward function remove the incentive to follow shutdown commands. These difficulties parallel those discussed by Soares et al. (2015) in their paper on corrigibility. We argue that it is important to consider systems that follow shutdown commands under some weaker set of assumptions (e.g., that one small verified module is correctly implemented; as opposed to an entire prior probability distribution and/or parameterized reward function). We discuss some difficulties with simple ways…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.