LIRE: listwise reward enhancement for preference alignment
Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo, Zhendong Mao

TL;DR
LIRE introduces a listwise reward optimization method for preference alignment in large language models, improving stability, scalability, and effectiveness over traditional pairwise approaches, especially in multi-response scenarios.
Contribution
The paper presents LIRE, a novel listwise reward enhancement technique that simplifies implementation, extends to multi-response settings, and includes a self-enhancement algorithm for better reward refinement.
Findings
LIRE outperforms existing methods on dialogue and summarization benchmarks.
LIRE demonstrates good transferability to out-of-distribution data.
The approach is easy to implement with minimal parameter tuning.
Abstract
Recently, tremendous strides have been made to align the generation of Large Language Models (LLMs) with human values to mitigate toxic or unhelpful content. Leveraging Reinforcement Learning from Human Feedback (RLHF) proves effective and is widely adopted by researchers. However, implementing RLHF is complex, and its sensitivity to hyperparameters renders achieving stable performance and scalability challenging. Furthermore, prevailing approaches to preference alignment primarily concentrate on pairwise comparisons, with limited exploration into multi-response scenarios, thereby overlooking the potential richness within the candidate pool. For the above reasons, we propose a new approach: Listwise Reward Enhancement for Preference Alignment (LIRE), a gradient-based reward optimization approach that incorporates the offline rewards of multiple responses into a streamlined listwise…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The proposed methodology for moving from a pairwise approach to a list-wise approach for alignment is well motivated, especially with the ranking information between different LLM model generations becoming increasingly available. 2. The connection drawn between the proposed method and other direct policy improvement methods like DPO (Page 5) is quite informative in providing a different perspective for the gradient update step. 3. The improvements from LIRE are quite impressive, especially o
1. The authors claim that their proposed approach does not require a KL constraint. However, as presented in [1], without a KL constraint, alignment training would lead to a distributional collapse. In my opinion, by training on generations from other (somewhat aligned) LLMs, the authors implicitly leverage the KL constraint, and hence this claim seems a bit strong. 2. The generative distribution defined by the authors is very confusing (Equation 4). As per my understanding, this should be simi
* Alignment is an important problem in AI and newer methods for alignment are welcome since they can potentially engender discussion and help the community progress. * Current methods rely on regularization objectives [PPO, SLiC] to ensure the aligned policy does not deviate from the anchor policy model, the problem of policy divergence and reward over-optimization is an important one. In as much contributions that improve robustness of alignment techniques are welcome. * The authors benchmark
* The paper makes some very strong conjectures without substantial backing of their claims. one such instances are - Section 5.5 The authors claim that their objective implicitly includes the SFT objective? This is a very strong claim, and I do not believe this is the case. Unless the authors can demonstrate this mathematically I would suggest the authors tone down their narrative. * The authors claim that adding SFT loss would prevent the model from reward over-optimization. This is incorrec
1. The paper investigates the impact of a listwise loss function within the Reinforcement Learning with Human Feedback (RLHF) framework. The study elucidates the benefits of the listwise approach, particularly in terms of stability and efficacy. 2. By drawing connections between the proposed listwise loss method and Distributional Policy Optimization (DPO), the authors provide theoretical insights. This comparison helps in positioning the proposed method within the broader landscape of reinfor
1. The main novelty of this paper lies in introduction of listwise loss and incorporating reward score into optimizations. For Listwise loss, DPO appendix also introduces how to extend from binary preference to multiple examples. The authors ignore this extension case and simply treat DPO as limited to binary preference. Without this comparison, the novelty of this paper is not clear. 2. The experimental results cannot clearly attribute performance improvements to the proposed components. More
Code & Models
Videos
Taxonomy
TopicsData Management and Algorithms
MethodsALIGN
