Pareto-Optimal Learning from Preferences with Hidden Context

Ryan Bahlous-Boldi; Li Ding; Lee Spector; Scott Niekum

arXiv:2406.15599·cs.LG·February 10, 2025

Pareto-Optimal Learning from Preferences with Hidden Context

Ryan Bahlous-Boldi, Li Ding, Lee Spector, Scott Niekum

PDF

Open Access 3 Reviews

TL;DR

This paper introduces POPL, a method for learning policies that balance diverse human preferences without group labels, ensuring fairer and more aligned AI systems through Pareto optimization.

Contribution

It proposes a novel Pareto optimal preference learning framework that handles conflicting preferences and achieves pluralistic alignment without requiring group labels.

Findings

01

POPL outperforms baseline methods in reward and policy learning.

02

Effective in diverse settings: preference learning, RL, robotics, LLM fine-tuning.

03

Supports fairness and safety in AI alignment.

Abstract

Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

POPL can capture the Pareto optimal reward functions.

Weaknesses

- The validity of the proposed algorithm is not theoretically explained. - The scalability of the proposed algorithm remains uncertain. - There is no way to determine the dimensionality of the underlying reward functions or the number of Pareto optimal policies.

Reviewer 02Rating 5Confidence 2

Strengths

1. This work concentrates on an import problem that human feedback involves hidden information. The Pareto-optimal is indeed one possible solution. 2. Experiments on different tasks are given.

Weaknesses

1. I found the presentation for the method is quite hard for me to understand. Lexicase selection, as key idea of the method, is not introduced clearly. I am not clear about how this selection method is conducted. Also, I think a figure for process might be better for readers to understand. 2. Similarly, I found that many concepts are used without a clear explanation. For example, I am not clear what "hypotheses" refers to as it first shows in Sec. 6.1 and also in Alg. 1. Also, as MDPL is menti

Reviewer 03Rating 5Confidence 3

Strengths

- This paper provide a novel way to deal with the heterogenity among different sourses - Various experiments under different settings are conducted to evaluate the performance

Weaknesses

- Noations and concepts are sometimes not well-defined. For instance, $\sigma$ first occur in Section 3 without any definitions. I would suggest the authors to more clearly defined notations. - The presentation of experiment results is quite unclear. For example, Figure 3(b) is really chaotic and I can hardly tell the information here. - Lack of comparision with other methods. From my interpretation, many methods are proposed to solve the similar issue, e,g., Nash learning for RLHF and general p

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Criteria Decision Making · Advanced Bandit Algorithms Research · Water resources management and optimization

MethodsALIGN