Chain of Alignment: Integrating Public Will with Expert Intelligence for Language Model Alignment
Andrew Konya, Aviv Ovadya, Kevin Feng, Quan Ze Chen, Lisa Schirch,, Colin Irwin, Amy X. Zhang

TL;DR
This paper presents a method called Chain of Alignment that combines public input and expert rules to evaluate and improve language model alignment with societal values, demonstrated in mental health domains.
Contribution
It introduces a novel approach to align language models with public will using normative objectives and expert-crafted rules, validated across mental health prompts.
Findings
Public normative objectives achieved with 96% public support
Expert-developed rules effectively evaluate model responses
High correlation (r=0.841) with human expert judgments
Abstract
We introduce a method to measure the alignment between public will and language model (LM) behavior that can be applied to fine-tuning, online oversight, and pre-release safety checks. Our `chain of alignment' (CoA) approach produces a rule based reward (RBR) by creating model behavior aligned to normative aligned to . This factoring enables a nonexpert public to directly specify their will through the normative objectives, while expert intelligence is used to figure out rules entailing model behavior that best achieves those objectives. We validate our approach by applying it across three different domains of LM prompts related to mental health. We demonstrate a public input process built on collective dialogues and bridging-based ranking that reliably produces normative objectives supported by at least of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Semantic Web and Ontologies · Topic Modeling
