Fine-Tuning Language Models with Advantage-Induced Policy Alignment
Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong,, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao

TL;DR
This paper introduces Advantage-Induced Policy Alignment (APA), a novel reinforcement learning algorithm that improves the stability, sample efficiency, and performance of language models trained with human feedback, outperforming PPO.
Contribution
The paper proposes a new algorithm, APA, that addresses PPO's limitations by using advantage-based squared error loss, with both empirical and theoretical validation.
Findings
APA outperforms PPO in language tasks
APA provides more stable control over policy deviation
APA enhances sample efficiency and model performance
Abstract
Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its…
Peer Reviews
Decision·Submitted to ICLR 2024
The research objective is clear and the paper is well-motivated. I do notice that PPO is sometimes unstable for training language models, and the model performance may drop with the training goes on. The proposed algorithm is simple and seems effective on the evaluation tasks. I appreciate the authors' effort in providing theoretical justification for the design of the loss function and the convergence of the algorithm.
The major reason I tend to reject is the scope of evaluate tasks. As RLHF (with PPO) has been well verified on ChatGPT, which has great generalization ability, PPO can be utilized to train language models in scale. In addition to training language models, PPO has stand the test of time in many other area such as robotics control. I believe that is why researchers use PPO to fine-tune many large language models. However, the evaluation tasks in this paper locate in a specific domain. I think it i
- Good presentation of the idea in a clear way. The paper is well written and easy to follow. - Experimentation. Multiple tasks are evaluated with a good discussion/comparison between the RLHF tradeoff of KL from SFT model as well as reward optimization.
Minor comments: - In the main text, there is a reference to a connection to soft-q learning but it seems like only the f-divergence interpretation is discussed. - For the 125M parameter experiments, it seems like PPO is not properly tuned to optimize for the reward.
1 The presentation and intuition is clear, following the theoretical solution of KL-regularized optimization problem with several reasonable modifications. 2 Some of the empirical results consistently demonstrate that the proposed algorithm is promising in terms of stability, sample efficiency, and also reward optimization. The considered datasets are also standard in RLHF literature. Overall, I feel that the authors have proposed a promising alternative approach to the PPO algorithm. And it
1 While I can understand the mathematical derivations along the line and it is great to see that the loss of APA is provably convergent, I am curious why the square loss is better than the KL-divergence (it is because of the guarantee provided in theorem 1?). In a more general sense, as the new loss function can be viewed as a different f-divergence, and we know that with a f-divergence as the regularizer, we can also obtain another variants of (3), do these f-divergences work better than KL? I
Code & Models
- 🤗berkeley-nest/Starling-LM-7B-alphamodel· 3.8k dl· ♡ 5593.8k dl♡ 559
- 🤗LoneStriker/Starling-LM-7B-alpha-3.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/Starling-LM-7B-alpha-4.0bpw-h6-exl2model· 4 dl· ♡ 14 dl♡ 1
- 🤗LoneStriker/Starling-LM-7B-alpha-5.0bpw-h6-exl2model· 2 dl· ♡ 22 dl♡ 2
- 🤗LoneStriker/Starling-LM-7B-alpha-6.0bpw-h6-exl2model· 4 dl· ♡ 14 dl♡ 1
- 🤗LoneStriker/Starling-LM-7B-alpha-8.0bpw-h8-exl2model· 3 dl· ♡ 23 dl♡ 2
- 🤗TheBloke/Starling-LM-7B-alpha-GGUFmodel· 1.8k dl· ♡ 941.8k dl♡ 94
- 🤗TheBloke/Starling-LM-7B-alpha-AWQmodel· 9 dl· ♡ 99 dl♡ 9
- 🤗TheBloke/Starling-LM-7B-alpha-GPTQmodel· 13 dl· ♡ 1013 dl♡ 10
- 🤗CallComply/Starling-LM-11B-alphamodel· 642 dl· ♡ 15642 dl♡ 15
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics
MethodsAdaptive Pseudo Augmentation · Entropy Regularization · Proximal Policy Optimization
