Reflective Verbal Reward Design for Pluralistic Alignment
Carter Blair, Kate Larson, Edith Law

TL;DR
This paper introduces a personalized reward modeling approach using reflective dialogues with users, enabling AI agents to better align with diverse human values and preferences.
Contribution
It presents a novel method for learning individualized reward models through reflective language-based dialogues, addressing the limitations of aggregated feedback.
Findings
Achieved 9-12% improvement in reward model accuracy
More sample efficient than traditional supervised learning
Effectively captures minority human preferences
Abstract
AI agents are commonly aligned with "human values" through reinforcement learning from human feedback (RLHF), where a single reward model is learned from aggregated human feedback and used to align an agent's behavior. However, human values are not homogeneous--different people hold distinct and sometimes conflicting values. Aggregating feedback into a single reward model risks disproportionately suppressing minority preferences. To address this, we present a novel reward modeling approach for learning individualized reward models. Our approach uses a language model to guide users through reflective dialogues where they critique agent behavior and construct their preferences. This personalized dialogue history, containing the user's reflections and critiqued examples, is then used as context for another language model that serves as an individualized reward function (what we call a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
