Chain of Alignment: Integrating Public Will with Expert Intelligence for   Language Model Alignment

Andrew Konya; Aviv Ovadya; Kevin Feng; Quan Ze Chen; Lisa Schirch,; Colin Irwin; Amy X. Zhang

arXiv:2411.10534·cs.HC·November 19, 2024

Chain of Alignment: Integrating Public Will with Expert Intelligence for Language Model Alignment

Andrew Konya, Aviv Ovadya, Kevin Feng, Quan Ze Chen, Lisa Schirch,, Colin Irwin, Amy X. Zhang

PDF

Open Access

TL;DR

This paper presents a method called Chain of Alignment that combines public input and expert rules to evaluate and improve language model alignment with societal values, demonstrated in mental health domains.

Contribution

It introduces a novel approach to align language models with public will using normative objectives and expert-crafted rules, validated across mental health prompts.

Findings

01

Public normative objectives achieved with 96% public support

02

Expert-developed rules effectively evaluate model responses

03

High correlation (r=0.841) with human expert judgments

Abstract

We introduce a method to measure the alignment between public will and language model (LM) behavior that can be applied to fine-tuning, online oversight, and pre-release safety checks. Our `chain of alignment' (CoA) approach produces a rule based reward (RBR) by creating model behavior $rules$ aligned to normative $objectives$ aligned to $public will$ . This factoring enables a nonexpert public to directly specify their will through the normative objectives, while expert intelligence is used to figure out rules entailing model behavior that best achieves those objectives. We validate our approach by applying it across three different domains of LM prompts related to mental health. We demonstrate a public input process built on collective dialogues and bridging-based ranking that reliably produces normative objectives supported by at least $96% \pm 2%$ of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies · Topic Modeling