Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization
Jianing Wang, Yang Zhou, Xiaocheng Zhang, Mengjiao Bao, Peng Yan

TL;DR
This paper introduces UPO, a framework that enhances large language model self-evolution by reducing noisy preference data through uncertainty estimation, leading to more reliable feedback and improved optimization performance.
Contribution
The paper proposes an uncertainty-enhanced preference optimization method using Bayesian neural networks to mitigate noise and bias in iterative LLM training.
Findings
Significant performance improvements over existing methods.
Effective reduction of noisy preference data.
Enhanced robustness in LLM self-evolution.
Abstract
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs), but the performance is still underwhelming due to too much noisy preference data yielded in the loop. To combat this issue, we present an \textbf{U}ncertainty-enhanced \textbf{P}reference \textbf{O}ptimization (UPO) framework to make the LLM self-evolve with reliable feedback. The key idea is mitigating the noisy preference data derived from the current policy and reward models by performing pair-wise uncertainty estimation and judiciously reliable feedback sampling. To reach this goal, we thus introduce an estimator model, which incorporates Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation for the preference data derived from the LLM policy. Compared to the existing methods that directly filter generated responses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
MethodsDropout
