Self-Evolution Fine-Tuning for Policy Optimization
Ruijun Chen, Jiehao Liang, Shiping Gao, Fanqi Wan, Xiaojun Quan

TL;DR
This paper introduces self-evolution fine-tuning (SEFT), a novel method for aligning large language models that uses unannotated data and an adaptive reviser to improve responses without requiring costly annotations.
Contribution
SEFT eliminates the need for annotated samples in policy optimization by using an adaptive reviser and unannotated data, improving stability and efficiency over existing methods.
Findings
SEFT outperforms traditional fine-tuning and RLHF on benchmarks.
SEFT effectively leverages unlimited unannotated data.
SEFT maintains high response quality with reduced annotation effort.
Abstract
The alignment of large language models (LLMs) is crucial not only for unlocking their potential in specific tasks but also for ensuring that responses meet human expectations and adhere to safety and ethical principles. Current alignment methodologies face considerable challenges. For instance, supervised fine-tuning (SFT) requires extensive, high-quality annotated samples, while reinforcement learning from human feedback (RLHF) is complex and often unstable. In this paper, we introduce self-evolution fine-tuning (SEFT) for policy optimization, with the aim of eliminating the need for annotated samples while retaining the stability and efficiency of SFT. SEFT first trains an adaptive reviser to elevate low-quality responses while maintaining high-quality ones. The reviser then gradually guides the policy's optimization by fine-tuning it with enhanced responses. One of the prominent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications
MethodsShrink and Fine-Tune
