Uncertainty-Penalized Reinforcement Learning from Human Feedback with   Diverse Reward LoRA Ensembles

Yuanzhao Zhai; Han Zhang; Yu Lei; Yue Yu; Kele Xu; Dawei Feng; Bo; Ding; Huaimin Wang

arXiv:2401.00243·cs.LG·January 2, 2024·1 cites

Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles

Yuanzhao Zhai, Han Zhang, Yu Lei, Yue Yu, Kele Xu, Dawei Feng, Bo, Ding, Huaimin Wang

PDF

Open Access

TL;DR

This paper introduces UP-RLHF, a method that uses uncertainty regularization with diverse reward LoRA ensembles to improve reinforcement learning from human feedback, reducing overoptimization and enhancing alignment of language models.

Contribution

It proposes a novel uncertainty-penalized RLHF framework with diverse reward LoRA ensembles for better reward uncertainty quantification and overoptimization mitigation.

Findings

01

Diverse reward LoRA ensembles effectively quantify reward uncertainty.

02

Uncertainty regularization reduces overoptimization in RLHF.

03

UP-RLHF improves alignment performance on human preference datasets.

Abstract

Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs). However, a notable challenge in RLHF is overoptimization, where beyond a certain threshold, the pursuit of higher rewards leads to a decline in human preferences. In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization. To mitigate this limitation, we scrutinize the RLHF objective in the offline dataset and propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning. To enhance the uncertainty quantification abilities for reward models, we first propose a diverse low-rank adaptation (LoRA) ensemble by maximizing the nuclear norm of LoRA matrix concatenations. Then we optimize policy models utilizing penalized rewards, determined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Speech Recognition and Synthesis