TL;DR
This paper introduces a comprehensive evaluation framework for personalized preference learning in language models, emphasizing performance, fairness, safety, and adaptability, and demonstrates significant variability among methods.
Contribution
It provides a multi-faceted assessment approach for personalized preference learning, addressing gaps in standard evaluation and highlighting the importance of holistic metrics.
Findings
Performance differences up to 36% among methods with strong user disagreement
Personalization can cause up to 20% safety misalignment
Evaluation reveals critical variability in method effectiveness
Abstract
While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
