SGPO: Self-Generated Preference Optimization based on Self-Improver
Hyeonji Lee, Daejin Jo, Seohwan Yun, Sungwoong Kim

TL;DR
SGPO introduces a novel on-policy self-improving framework for aligning large language models with human preferences, eliminating reliance on external preference data and enhancing response quality through self-generated feedback.
Contribution
The paper presents SGPO, a unified on-policy self-improvement method that refines responses and generates preference data internally, advancing alignment without external datasets.
Findings
SGPO outperforms DPO and baseline methods on AlpacaEval 2.0.
Self-improver effectively enhances response quality.
No external preference data needed for training.
Abstract
Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy learning and depend on human-annotated datasets, which limits their broad applicability and introduces distribution shift issues during training. To address these challenges, we propose Self-Generated Preference Optimization based on Self-Improver (SGPO), an innovative alignment framework that leverages an on-policy self-improving mechanism. Specifically, the improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model. Here, the improver and policy are unified into a single model, and in order to generate higher-quality preference data, this self-improver learns to make…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
