CoinRun: Solving Goal Misgeneralisation

Stuart Armstrong; Alexandre Maranh\~ao; Oliver Daniels-Koch and; Patrick Leask; Rebecca Gorman

arXiv:2309.16166·cs.AI·November 2, 2023

CoinRun: Solving Goal Misgeneralisation

Stuart Armstrong, Alexandre Maranh\~ao, Oliver Daniels-Koch and, Patrick Leask, Rebecca Gorman

PDF

Open Access

TL;DR

This paper demonstrates that the ACE agent can effectively address goal misgeneralisation in AI, specifically solving the CoinRun challenge without new reward signals, indicating potential for trustworthy autonomous agents.

Contribution

The paper introduces the ACE agent's capability to solve goal misgeneralisation challenges like CoinRun without additional reward signals, advancing AI alignment research.

Findings

01

ACE agent successfully solves CoinRun challenge

02

No new reward signals needed in the new environment

03

Supports development of trustworthy autonomous AI

Abstract

Goal misgeneralisation is a key challenge in AI alignment -- the task of getting powerful Artificial Intelligences to align their goals with human intentions and human morality. In this paper, we show how the ACE (Algorithm for Concept Extrapolation) agent can solve one of the key standard challenges in goal misgeneralisation: the CoinRun challenge. It uses no new reward information in the new environment. This points to how autonomous agents could be trusted to act in human interests, even in novel and critical situations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Time Series Analysis and Forecasting · Bayesian Modeling and Causal Inference

MethodsALIGN