What are human values, and how do we align AI to them?
Oliver Klingefjord, Ryan Lowe, Joe Edelman

TL;DR
This paper introduces Moral Graph Elicitation (MGE), a novel process for synthesizing diverse human values into alignment targets for language models, addressing key challenges in aligning AI with human values.
Contribution
It proposes MGE, a new method using large language models to elicit and reconcile human values, satisfying six criteria for effective AI alignment.
Findings
MGE improved representation of diverse values in the moral graph
Participants found the process fair and representative
Expert values often emerged naturally in the process
Abstract
There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but it remains unclear how to apply this to language models in practice. We split the problem of "aligning to human values" into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are "good" ways to synthesize diverse human inputs about values into a target for aligning language models? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI
MethodsSparse Evolutionary Training · Focus · ALIGN
