Direct Alignment of Language Models via Quality-Aware Self-Refinement

Runsheng Yu; Yong Wang; Xiaoqi Jiao; Youzhi Zhang; James T. Kwok

arXiv:2405.21040·cs.CL·June 3, 2024

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Runsheng Yu, Yong Wang, Xiaoqi Jiao, Youzhi Zhang, James T. Kwok

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a quality-aware self-refinement method that enhances direct policy optimization for aligning large language models with human preferences by leveraging the model's intrinsic knowledge.

Contribution

It proposes a novel refinement function that uses the LLM's own knowledge to improve the training process of DPO and IPO, leading to better alignment results.

Findings

01

Improved model performance over standard DPO and IPO.

02

Effective use of intrinsic LLM knowledge for self-refinement.

03

Enhanced alignment with human preferences.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of Large Language Models (LLMs) with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an LLM-based reward model with the policy itself, thus obviating the need for extra memory and training time to learn the reward model. However, DPO does not consider the relative qualities of the positive and negative responses, and can lead to sub-optimal training outcomes. To alleviate this problem, we investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function. Specifically, we leverage the knowledge of the LLM to design a refinement function to estimate the quality of both the positive and negative responses. We show that the constructed refinement function can…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 3

Strengths

- Relatively simple modification of DPO (and IPO) objective to incorporate per-preference reweighting using a minor change in the input prompt.

Weaknesses

- It is unintuitive why a mere prompt would enforce/align with the margins of an oracle reward. In the previous literature, it is well known that language models are poor at assigning numerical feedback like humans. This makes the choice of prompt more unclear in the refinement operator. Moreover, the results from Table 4, also suggest that the wording of the prompt doesn't matter too much on the performance. - The naive variation of the refinement operator in Table 4, effectively reduces the $\

Reviewer 02Rating 5Confidence 4

Strengths

1. The paper introduces a novel method for refining the loss function using the intrinsic knowledge of LLMs, which is a good contribution to the field of model alignment. 2. The authors provide extensive experimental results that demonstrate the effectiveness of the proposed methods across different datasets, reinforcing the validity of their approach. 3. The paper includes a theoretical analysis that supports the design of the refinement function, providing a solid foundation for the proposed

Weaknesses

1. The compared baselines are not enough. Many improvements to DPO introduce a gap between positive and negative samples; I won't list them all here, but the author needs to compare their approach with such methods. 2. Introduce a gap value into the DPO objective is not a very novel idea. The author introduce the gap value by a prompt-augmented query $p+x$, but I argue that the chosen of $p$ has a large influence to the final results. The relative analysis is not enough. 3. According to my ow

Reviewer 03Rating 5Confidence 4

Strengths

1. This paper is well-write and easy to follow. 2. The proposed refinement function can be integrated into existing optimization frameworks like DPO and IPO, making it compatible with current research and practical applications. 3. The self-refined methods (Sr-DPO and Sr-IPO) show performance improvements compared to traditional DPO and IPO, indicating the potential for future RLHF developments.

Weaknesses

1. Lack of logical consistency. In the Abstract and Introduction sections, the authors emphasize utilizing the intrinsic knowledge of LLMs for self-refinement. However, in the Method section, they only introduce the addition of an extra prompt and the definition of a $\Delta$ objective based on the prompt, which is not convincing in demonstrating the application of LLMs' intrinsic knowledge. 2. Lack of novelty. As previously stated, the authors' method heavily relies on the introduction of an ex

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsDirect Preference Optimization · ALIGN