Just Say What You Want: Only-prompting Self-rewarding Online Preference   Optimization

Ruijie Xu; Zhihan Liu; Yongfei Liu; Shipeng Yan; Zhaoran Wang; Zhi; Zhang; Xuming He

arXiv:2409.17534·cs.AI·October 15, 2024

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Ruijie Xu, Zhihan Liu, Yongfei Liu, Shipeng Yan, Zhaoran Wang, Zhi, Zhang, Xuming He

PDF

Open Access

TL;DR

This paper introduces a novel online RLHF method that uses only prompts for self-rewarding, reducing reliance on judgment models, and improves model alignment by generating challenging negatives to better capture human preferences.

Contribution

The paper proposes an only-prompting self-rewarding online algorithm that generates preference data without judgment models and employs fine-grained control over training difficulty.

Findings

01

Achieved 34.5% win rate on AlpacaEval 2.0

02

Significantly improved performance of base models

03

Demonstrated effectiveness on Mistral-7B and Mistral-Instruct-7B

Abstract

We address the challenge of online Reinforcement Learning from Human Feedback (RLHF) with a focus on self-rewarding alignment methods. In online RLHF, obtaining feedback requires interaction with the environment, which can be costly when using additional reward models or the GPT-4 API. Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities, which are effective for large-scale models but challenging to transfer to smaller ones. To address these limitations, we propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities. Additionally, we employ fine-grained arithmetic control over the optimality gap between positive and negative examples, generating more hard negatives in the later stages of training to help the model better capture subtle human preferences. Finally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications · Optimization and Search Problems · Consumer Market Behavior and Pricing

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Dense Connections