BPO: Staying Close to the Behavior LLM Creates Better Online LLM   Alignment

Wenda Xu; Jiachen Li; William Yang Wang; Lei Li

arXiv:2406.12168·cs.LG·October 23, 2024

BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment

Wenda Xu, Jiachen Li, William Yang Wang, Lei Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces online Preference Optimization (BPO), a method that improves large language model alignment by maintaining proximity to a behavior LLM, leading to significant performance gains across various tasks.

Contribution

The paper proposes a novel online DAP algorithm called BPO that emphasizes trust region construction to enhance LLM alignment, outperforming traditional offline methods.

Findings

01

BPO significantly improves performance on multiple tasks.

02

Online BPO increases win rates against human references.

03

Integrating BPO with existing DAP methods yields consistent gains.

Abstract

Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of online training. Specifically, we identify that the learned LLM should adhere to the proximity of the behavior LLM, which collects the training samples. To this end, we propose online Preference Optimization in proximity to the Behavior LLM (BPO), emphasizing the importance of constructing a proper trust region for LLM alignment. We conduct extensive experiments to validate the effectiveness and applicability of our approach by integrating it with various DAP methods, resulting in significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xu1998hz/bpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Data Mining Algorithms and Applications · Recommender Systems and Techniques