Preference Optimization with Multi-Sample Comparisons
Chaoqi Wang, Zhuokai Zhao, Chen Zhu, Karthik Abinav Sankararaman,, Michal Valko, Xuefei Cao, Zhaorun Chen, Madian Khabsa, Yuxin Chen, Hao Ma,, Sinong Wang

TL;DR
This paper introduces multi-sample comparison methods for post-training generative models, enhancing the optimization of diversity and bias by capturing group-wise characteristics, and demonstrating robustness against label noise.
Contribution
It proposes novel multi-sample preference optimization techniques, mDPO and mIPO, extending existing preference methods to better optimize collective generative qualities.
Findings
Multi-sample comparison improves diversity and bias optimization.
Multi-sample methods are more robust to label noise.
Enhanced collective characteristic optimization over single-sample approaches.
Abstract
Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics.…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper is well-motivated and considers a novel use case for preference learning. 2. The method is clearly presented, and the theoretical analysis is clear and informative.
1. The experimental setup could be more clearly described, particularly with respect to the data seen for each method (e.g., is the number of pairings fixed, or the number of outputs or per-sample pairwise comparisons?). For instance, how are the datasets constructed for the single-sample and multi-sample methods in the label noise experiments? What data does SFT see in the uniformity/debiasing/diversity experiments? All the samples from the preferred set, or something else? Without an understan
1. The motivation for incorporating more nuanced human feedback is compelling. 2. The paper is well-written, with a clear and organized presentation. 3. Comprehensive experiments demonstrate the effectiveness of the proposed methodology across diverse applications, including random number generation, fair image generation, varied writing styles, and iterative LLM enhancement.
1. While the problem is well-motivated, the proposed approach appears somewhat straightforward and lacks substantial innovation. 2. The methodology relies on the assumption that humans can easily distinguish between different distributions of items. However, in practice, it may be more cognitively challenging for individuals to differentiate between two distributions, and gathering such data could be more costly.
1. **Clear motivation**: The authors identify an important problem in current alignment paradigms, and the running example of generating random numbers is intuitive and helps the reader understand the problem well. 2. **Thorough experiments**: The authors use several use cases to demonstrate the effectiveness of the proposed methods. The results are intuitive and convincing.
1. **Potentially higher dataset requirements**: It seems that in the datasets for multi-sample comparisons, one prompt should have k chosen-rejected pairs of responses. This makes many readily available datasets not applicable, and the datasets need to be adjusted for different k. 2. **Lack of discussion on k**: I think the selection of k deserves more explanation. Under what circumstances would a larger k benefit? 3. **Comparison on general tasks**: It is clear that under scenarios where dive
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making
MethodsDiffusion
