Can DPO Learn Diverse Human Values? A Theoretical Scaling Law
Shawn Im, Sharon Li

TL;DR
This paper develops a theoretical framework to analyze how large language models trained with preference learning generalize across diverse human values, highlighting the challenges and limitations in capturing a broad spectrum of human preferences.
Contribution
It introduces a novel theoretical scaling law for preference learning that accounts for value diversity and sample size, providing bounds on generalization error.
Findings
The framework predicts increased difficulty in learning diverse values with limited data.
Empirical validation confirms the theoretical bounds on generalization error.
Highlights the importance of sample quantity in aligning models with broad human preferences.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. An essential part of ensuring that LLMs are aligned for all people is accounting for a diverse set of values. This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity in models trained with direct preference optimization. Our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRough Sets and Fuzzy Logic
MethodsDirect Preference Optimization · ALIGN
