Xwin-LM: Strong and Scalable Alignment Practice for LLMs
Bolin Ni, JingCheng Hu, Yixuan Wei, Houwen Peng, Zheng Zhang, Gaofeng, Meng, Han Hu

TL;DR
Xwin-LM introduces a comprehensive, scalable suite of alignment techniques for large language models, combining supervised finetuning, reward modeling, rejection sampling, and preference optimization to enhance model alignment and performance.
Contribution
The paper presents a novel, integrated alignment framework for LLMs that combines multiple techniques and large-scale datasets, demonstrating significant improvements over existing methods.
Findings
Consistent performance improvements on AlpacaEval and MT-bench
Effective scaling of reward models up to 70B parameters
Successful application of DPO for further model optimization
Abstract
In this work, we present Xwin-LM, a comprehensive suite of alignment methodologies for large language models (LLMs). This suite encompasses several key techniques, including supervised finetuning (SFT), reward modeling (RM), rejection sampling finetuning (RS), and direct preference optimization (DPO). The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt is linked to 64 unique responses generated by Xwin-LM-SFT and scored by Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
MethodsDirect Preference Optimization
