What are human values, and how do we align AI to them?

Oliver Klingefjord; Ryan Lowe; Joe Edelman

arXiv:2404.10636·cs.CY·April 18, 2024·6 cites

What are human values, and how do we align AI to them?

Oliver Klingefjord, Ryan Lowe, Joe Edelman

PDF

Open Access

TL;DR

This paper introduces Moral Graph Elicitation (MGE), a novel process for synthesizing diverse human values into alignment targets for language models, addressing key challenges in aligning AI with human values.

Contribution

It proposes MGE, a new method using large language models to elicit and reconcile human values, satisfying six criteria for effective AI alignment.

Findings

01

MGE improved representation of diverse values in the moral graph

02

Participants found the process fair and representative

03

Expert values often emerged naturally in the process

Abstract

There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but it remains unclear how to apply this to language models in practice. We split the problem of "aligning to human values" into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are "good" ways to synthesize diverse human inputs about values into a target for aligning language models? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI

MethodsSparse Evolutionary Training · Focus · ALIGN