SGPO: Self-Generated Preference Optimization based on Self-Improver

Hyeonji Lee; Daejin Jo; Seohwan Yun; Sungwoong Kim

arXiv:2507.20181·cs.CL·July 29, 2025

SGPO: Self-Generated Preference Optimization based on Self-Improver

Hyeonji Lee, Daejin Jo, Seohwan Yun, Sungwoong Kim

PDF

TL;DR

SGPO introduces a novel on-policy self-improving framework for aligning large language models with human preferences, eliminating reliance on external preference data and enhancing response quality through self-generated feedback.

Contribution

The paper presents SGPO, a unified on-policy self-improvement method that refines responses and generates preference data internally, advancing alignment without external datasets.

Findings

01

SGPO outperforms DPO and baseline methods on AlpacaEval 2.0.

02

Self-improver effectively enhances response quality.

03

No external preference data needed for training.

Abstract

Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy learning and depend on human-annotated datasets, which limits their broad applicability and introduces distribution shift issues during training. To address these challenges, we propose Self-Generated Preference Optimization based on Self-Improver (SGPO), an innovative alignment framework that leverages an on-policy self-improving mechanism. Specifically, the improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model. Here, the improver and policy are unified into a single model, and in order to generate higher-quality preference data, this self-improver learns to make…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.