Rethinking Diverse Human Preference Learning through Principal Component Analysis
Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen

TL;DR
This paper introduces Decomposed Reward Models (DRMs), a novel PCA-based method to extract and interpret diverse human preferences from binary comparisons, enabling scalable, personalized, and interpretable alignment of language models.
Contribution
The paper presents DRMs, a new PCA-based approach that captures human preference diversity without fine-grained data, enhancing model interpretability and personalization.
Findings
DRMs effectively identify meaningful preference dimensions like helpfulness and safety.
DRMs can adapt to new users without additional training.
The approach offers an interpretable alternative to traditional reward models.
Abstract
Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsColor perception and design
MethodsALIGN
