BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment
Wenda Xu, Jiachen Li, William Yang Wang, Lei Li

TL;DR
This paper introduces online Preference Optimization (BPO), a method that improves large language model alignment by maintaining proximity to a behavior LLM, leading to significant performance gains across various tasks.
Contribution
The paper proposes a novel online DAP algorithm called BPO that emphasizes trust region construction to enhance LLM alignment, outperforming traditional offline methods.
Findings
BPO significantly improves performance on multiple tasks.
Online BPO increases win rates against human references.
Integrating BPO with existing DAP methods yields consistent gains.
Abstract
Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of online training. Specifically, we identify that the learned LLM should adhere to the proximity of the behavior LLM, which collects the training samples. To this end, we propose online Preference Optimization in proximity to the Behavior LLM (BPO), emphasizing the importance of constructing a proper trust region for LLM alignment. We conduct extensive experiments to validate the effectiveness and applicability of our approach by integrating it with various DAP methods, resulting in significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Data Mining Algorithms and Applications · Recommender Systems and Techniques
