WARP: On the Benefits of Weight Averaged Rewarded Policies

Alexandre Ram\'e; Johan Ferret; Nino Vieillard; Robert Dadashi,; L\'eonard Hussenot; Pierre-Louis Cedoz; Pier Giuseppe Sessa; Sertan Girgin,; Arthur Douillard; Olivier Bachem

arXiv:2406.16768·cs.LG·June 25, 2024·1 cites

WARP: On the Benefits of Weight Averaged Rewarded Policies

Alexandre Ram\'e, Johan Ferret, Nino Vieillard, Robert Dadashi,, L\'eonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin,, Arthur Douillard, Olivier Bachem

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces WARP, a novel method for aligning language models using weight averaging of policies at multiple stages, improving reward optimization while maintaining pre-trained knowledge.

Contribution

WARP proposes a new strategy that merges policies in weight space at three stages to better balance reward maximization and knowledge retention in RLHF.

Findings

01

WARP outperforms other open-source LLMs in reward and alignment quality.

02

Iterative weight averaging refines the reward-KL Pareto front effectively.

03

WARP improves policy performance at fixed KL divergence.

Abstract

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The paper is clearly written and easy to understand, using exponential moving average of the base policy in RLHF should be less conservative than a fixing target. - The empirical analysis is robust, featuring comparisons against state-of-the-art models and demonstrating WARP's effectiveness in balancing KL regularization and reward optimization. - The paper considers a wide variety of tasks to demonstrate potential improvements.

Weaknesses

- Application of existing techniques to RLHF: The novelty is limited, the techniques used in the paper are almost proposed from the deep learning or reinforcement learning literature. For example, EMA anchor used ideas from trust-region updated reinforcement learning algorithms like TRPO [1]; SLERP uses idea from model merging by weight averaging [2,3]. LITI uses idea from WiSE-FT [4]. Given previous work that already applies similar ideas to reward modeling RLHF [5], the novelty is further weak

Reviewer 02Rating 5Confidence 3

Strengths

- This paper is well-written and easy to follow. Extensive related work is provided to help readers understand the research problem. - The proposed methods seem to be simple yet effective in balancing the trade-off between KL divergence and reward. - Extensive numerical results are provided to justifiy the proposed method.

Weaknesses

- The main drawback of this paper is that techniques like exponential moving average and SLERP are well-known, which renders the technical novelty insufficient. - Despite extensive numerical evaluation, it is unclear why model averaging helps balance the trade-off between KL divergence and reward. According to the reviewer’s understanding, achieving this trade-off effectively requires the policy to explore intelligently and update gradually during initialization. However, it is unclear how this

Reviewer 03Rating 6Confidence 3

Strengths

1. This work has a good presentation and is easy to understand. 2. The method is shown clearly. The intuition is given and the ablation study for each stage is given. 3. Many experimental results are given to support the method.

Weaknesses

1. I think it is better to add some more introduction for weight averaging. Since I am not quite familiar with this, I may wonder if this is a method conducting averaging by each parameter or some other techniques. 2. I hope to get some intuition for why WARP can reach a Pareto front, which seems non-trivial for me.

Code & Models

Repositories

zokost/warp_implementation
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsObesity and Health Practices · Health Promotion and Cardiovascular Prevention