Diverse Preference Learning for Capabilities and Alignment
Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell

TL;DR
This paper introduces Soft Preference Learning, a novel method that enhances diversity and societal viewpoint representation in LLM outputs by decoupling entropy and cross-entropy in preference learning.
Contribution
It proposes a new preference learning approach that improves diversity and societal perspective representation in LLMs, addressing limitations of existing alignment algorithms.
Findings
LLMs trained with Soft Preference Learning show higher accuracy on complex tasks.
Outputs exhibit greater semantic and lexical diversity.
Models better represent diverse societal viewpoints and have improved logit calibration.
Abstract
The ability of LLMs to represent diverse perspectives is critical as they increasingly impact society. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to systematically overweight majority opinions and sacrifice diversity in its outputs. To address this, we propose Soft Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty - allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Soft…
Peer Reviews
Decision·ICLR 2025 Poster
Well organized, lots of empirical results, the suggested approach is simple and principled.
* At a high level, the authors attribute improved problem solving (which is actually not very evident in Figure 3) to improved diversity. Though I could imagine that diversity hypothetically can lead to improved problem-solving performance for methods that use inference-time scaling (because the model is essentially generating more hypotheses for solving a problem), it seems like Figure 3 is not quite telling this story. * First of all, in the case of easy/medium difficulty problems, their sugge
- To the best of my knowledge, the proposed method, while obvious in hindsight (as most good things are) is novel - The problem is well posed and mostly convincingly discussed. Acting on the entropy of the distribution has been gaining momentum in the community - Experiments are quite comprehensive, though some baselines should probably be included. (This is the main reason I am giving a 6 instead of an 8) (edit: improved after rebuttal) - The quality of the writing is outstanding
- The experimental setting could be stronger: DPL is mostly pitted against vanilla temperature scaling, which on its own does lead to bad outputs. However, top-k, top-p, and, more recently, min-p sampling are all relatively established solution that should have been incorporated in the analysis. (edit: improved after rebuttal) - Basic REINFORCE (with no KLD in the reward), which has become very popular again without should also have been compared against. This is more of a desideratum than a har
1. The theoretical analysis on KL-divergence sounds interesting. 2. The experimental design of the article is quite creative.
1. The objective of DPL (Eq. 9) appears interesting, yet the derivation process seems to have many inconsistencies, as detailed in the Questions section. 2. You should define Proposition 3.1, Corollary 3.2, and Proposition A for clarity. 3. The experiments lack comparisons with existing Diverse Preference Learning methods, such as [1,2], and are missing richer alignment baselines for analysis. 4. The experiments lack many implementation details necessary for readers to fully understand or reprod
Videos
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Expert finding and Q&A systems
