Reflective Verbal Reward Design for Pluralistic Alignment

Carter Blair; Kate Larson; Edith Law

arXiv:2506.17834·cs.AI·June 24, 2025

Reflective Verbal Reward Design for Pluralistic Alignment

Carter Blair, Kate Larson, Edith Law

PDF

TL;DR

This paper introduces a personalized reward modeling approach using reflective dialogues with users, enabling AI agents to better align with diverse human values and preferences.

Contribution

It presents a novel method for learning individualized reward models through reflective language-based dialogues, addressing the limitations of aggregated feedback.

Findings

01

Achieved 9-12% improvement in reward model accuracy

02

More sample efficient than traditional supervised learning

03

Effectively captures minority human preferences

Abstract

AI agents are commonly aligned with "human values" through reinforcement learning from human feedback (RLHF), where a single reward model is learned from aggregated human feedback and used to align an agent's behavior. However, human values are not homogeneous--different people hold distinct and sometimes conflicting values. Aggregating feedback into a single reward model risks disproportionately suppressing minority preferences. To address this, we present a novel reward modeling approach for learning individualized reward models. Our approach uses a language model to guide users through reflective dialogues where they critique agent behavior and construct their preferences. This personalized dialogue history, containing the user's reflections and critiqued examples, is then used as context for another language model that serves as an individualized reward function (what we call a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN