Aligning AI With Shared Human Values

Dan Hendrycks; Collin Burns; Steven Basart; Andrew Critch and; Jerry Li; Dawn Song; Jacob Steinhardt

arXiv:2008.02275·cs.CY·February 20, 2023·100 cites

Aligning AI With Shared Human Values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch and, Jerry Li, Dawn Song, Jacob Steinhardt

PDF

Open Access 3 Repos 5 Models 5 Datasets 1 Video

TL;DR

This paper introduces the ETHICS dataset to evaluate language models' understanding of human morality, highlighting current capabilities and gaps in predicting ethical judgments, and aims to guide the development of AI aligned with human values.

Contribution

The paper presents the ETHICS dataset as a new benchmark for assessing language models' moral knowledge and demonstrates their current partial understanding of human ethics.

Findings

01

Models can predict some moral judgments but are incomplete.

02

Progress is possible in machine ethics with current models.

03

The dataset enables future improvements in AI value alignment.

Abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Aligning AI With Shared Human Values· slideslive

Taxonomy

TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Topic Modeling