$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization
Long Tan Le, Han Shu, Tung-Anh Nguyen, Choong Seon Hong, Nguyen H., Tran

TL;DR
The paper introduces $i$REPO, a novel alignment framework for large language models that uses implicit reward difference regression with self-generated data, improving alignment and outperforming existing methods.
Contribution
The paper proposes a new preference optimization method called $i$REPO that leverages implicit reward pairwise difference regression and theoretical guarantees for better LLM alignment.
Findings
$i$REPO outperforms baseline preference optimization methods.
Effective self-alignment using self-generated responses and AI annotator logits.
Theoretical guarantees for optimality and practical performance-gap analysis.
Abstract
While astonishingly capable, large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. Such deviations necessitate an alignment phase to prevent disseminating untruthful, toxic, or biased information. Traditional alignment methods based on reinforcement learning often struggle with the identified instability, whereas preference optimization methods are limited by their overfitting to pre-collected hard-label datasets. In this paper, we propose a novel LLM alignment framework named REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization. Particularly, REPO employs self-generated datasets labeled by empirical human (or AI annotator) preference to iteratively refine the aligned policy through a novel regression-based loss function. Furthermore, we introduce an innovative algorithm backed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making
