Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Banghua Zhu; Hiteshi Sharma; Felipe Vieira Frujeri; Shi Dong,; Chenguang Zhu; Michael I. Jordan; Jiantao Jiao

arXiv:2306.02231·cs.CL·November 6, 2023·5 cites

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong,, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao

PDF

Open Access 1 Repo 10 Models 3 Reviews

TL;DR

This paper introduces Advantage-Induced Policy Alignment (APA), a novel reinforcement learning algorithm that improves the stability, sample efficiency, and performance of language models trained with human feedback, outperforming PPO.

Contribution

The paper proposes a new algorithm, APA, that addresses PPO's limitations by using advantage-based squared error loss, with both empirical and theoretical validation.

Findings

01

APA outperforms PPO in language tasks

02

APA provides more stable control over policy deviation

03

APA enhances sample efficiency and model performance

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The research objective is clear and the paper is well-motivated. I do notice that PPO is sometimes unstable for training language models, and the model performance may drop with the training goes on. The proposed algorithm is simple and seems effective on the evaluation tasks. I appreciate the authors' effort in providing theoretical justification for the design of the loss function and the convergence of the algorithm.

Weaknesses

The major reason I tend to reject is the scope of evaluate tasks. As RLHF (with PPO) has been well verified on ChatGPT, which has great generalization ability, PPO can be utilized to train language models in scale. In addition to training language models, PPO has stand the test of time in many other area such as robotics control. I believe that is why researchers use PPO to fine-tune many large language models. However, the evaluation tasks in this paper locate in a specific domain. I think it i

Reviewer 02Rating 8· accept, good paperConfidence 5

Strengths

- Good presentation of the idea in a clear way. The paper is well written and easy to follow. - Experimentation. Multiple tasks are evaluated with a good discussion/comparison between the RLHF tradeoff of KL from SFT model as well as reward optimization.

Weaknesses

Minor comments: - In the main text, there is a reference to a connection to soft-q learning but it seems like only the f-divergence interpretation is discussed. - For the 125M parameter experiments, it seems like PPO is not properly tuned to optimize for the reward.

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

1 The presentation and intuition is clear, following the theoretical solution of KL-regularized optimization problem with several reasonable modifications. 2 Some of the empirical results consistently demonstrate that the proposed algorithm is promising in terms of stability, sample efficiency, and also reward optimization. The considered datasets are also standard in RLHF literature. Overall, I feel that the authors have proposed a promising alternative approach to the PPO algorithm. And it

Weaknesses

1 While I can understand the mathematical derivations along the line and it is great to see that the loss of APA is provably convergent, I am curious why the square loss is better than the KL-divergence (it is because of the guarantee provided in theorem 1?). In a more general sense, as the new loss function can be viewed as a different f-divergence, and we know that with a f-divergence as the regularizer, we can also obtain another variants of (3), do these f-divergences work better than KL? I

Code & Models

Repositories

microsoft/rlhf-apa
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics

MethodsAdaptive Pseudo Augmentation · Entropy Regularization · Proximal Policy Optimization