Weak-to-Strong Preference Optimization: Stealing Reward from Weak   Aligned Model

Wenhong Zhu; Zhiwei He; Xiaofeng Wang; Pengfei Liu; Rui Wang

arXiv:2410.18640·cs.CL·March 7, 2025

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

Wenhong Zhu, Zhiwei He, Xiaofeng Wang, Pengfei Liu, Rui Wang

PDF

Open Access

TL;DR

This paper introduces Weak-to-Strong Preference Optimization (WSPO), a novel method that transfers alignment behaviors from weaker to stronger language models, significantly improving their performance on various benchmarks.

Contribution

The paper proposes WSPO, a new approach that leverages weak model alignment signals to enhance the alignment of stronger models, demonstrating superior results across multiple evaluation tasks.

Findings

01

WSPO improves Qwen2-7B-Instruct's win rate from 39.70 to 49.60.

02

Achieves 47.04 win rate on AlpacaEval 2 with length control.

03

Scores 7.33 on MT-bench, outperforming previous methods.

Abstract

Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Criteria Decision Making