Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles
Yuanzhao Zhai, Han Zhang, Yu Lei, Yue Yu, Kele Xu, Dawei Feng, Bo, Ding, Huaimin Wang

TL;DR
This paper introduces UP-RLHF, a method that uses uncertainty regularization with diverse reward LoRA ensembles to improve reinforcement learning from human feedback, reducing overoptimization and enhancing alignment of language models.
Contribution
It proposes a novel uncertainty-penalized RLHF framework with diverse reward LoRA ensembles for better reward uncertainty quantification and overoptimization mitigation.
Findings
Diverse reward LoRA ensembles effectively quantify reward uncertainty.
Uncertainty regularization reduces overoptimization in RLHF.
UP-RLHF improves alignment performance on human preference datasets.
Abstract
Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs). However, a notable challenge in RLHF is overoptimization, where beyond a certain threshold, the pursuit of higher rewards leads to a decline in human preferences. In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization. To mitigate this limitation, we scrutinize the RLHF objective in the offline dataset and propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning. To enhance the uncertainty quantification abilities for reward models, we first propose a diverse low-rank adaptation (LoRA) ensemble by maximizing the nuclear norm of LoRA matrix concatenations. Then we optimize policy models utilizing penalized rewards, determined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Speech Recognition and Synthesis
