SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF
Atoosa Chegini, Hamid Kazemi, Iman Mirzadeh, Dong Yin, Maxwell Horton,, Moin Nabi, Mehrdad Farajtabar, Keivan Alizadeh

TL;DR
SALSA introduces a weight-space averaged reference model for RLHF, enabling larger policy deviations and improved exploration, leading to better alignment, robustness, and out-of-distribution performance in large language models.
Contribution
The paper proposes SALSA, a novel weight-space averaging method for creating a flexible reference model that enhances exploration in RLHF, surpassing traditional KL-based constraints.
Findings
SALSA outperforms PPO on multiple benchmarks.
Models trained with SALSA show improved robustness and generalization.
SALSA enables larger policy deviations without sacrificing stability.
Abstract
In Large Language Model (LLM) development, Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning models with human values and preferences. RLHF traditionally relies on the Kullback-Leibler (KL) divergence between the current policy and a frozen initial policy as a reference, which is added as a penalty in policy optimization algorithms like Proximal Policy Optimization (PPO). While this constraint prevents models from deviating too far from the initial checkpoint, it limits exploration of the reward landscape, reducing the model's ability to discover higher-quality solutions. As a result, policy optimization is often trapped in a narrow region of the parameter space, leading to suboptimal alignment and performance. This paper presents SALSA (Soup-based Alignment Learning for Stronger Adaptation), a novel approach designed to overcome these limitations by creating a…
Peer Reviews
Decision·Submitted to ICLR 2025
The idea of this paper is clear and straightforward. The experiment results also seem promising.
The authors didn't provide sufficient explanations about the occurred phenomenon (see the Questions part), which makes their method not convincing enough. It would be great if the authors can provide some theoretical analysis, even for a simple case study.
1. The experiment is well done. It demonstrates the advantage of model soup for reference model in RLHF from different aspects. It explores the different weighting ratio when averaging two or three models. When the ratio is averaged, the aligned model by PPO algorithm has better performance.
1. If the weighted averaged model has better performance than initial model, it is obvious that the aligned model by the PPO algorithm has better performance than others.
1. The paper illustrates a simple phenomenon, that applying model soups to the anchor policy for KL regularization can lead to improved win-rates in many benchmarks. 2. The paper includes some ablations, including on how the policies are souped (finding that uniform souping is best). Also, the authors show that multiple KL terms to each individual policy does not lead to any benefits, whereas a single KL term to a souped policy does.
1. I think some claims made in the paper are not rigorously justified. For example, Figure 2 shows that reward increases with model souping, but it is also important to plot the KL wrt a fixed anchor policy (whether pi_ref or pi_soups or both). In other words, it may be possible to get a higher reward by simply having higher KL and this is not ruled out in the experiments. 2. Related to (1), simply showing higher win-rates in Table 1, without plotting the KL as well, is also not very convincing,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition
MethodsEntropy Regularization · Proximal Policy Optimization
