COPR: Continual Human Preference Learning via Optimal Policy   Regularization

Han Zhang; Lin Gui; Yu Lei; Yuanzhao Zhai; Yehong Zhang; Yulan He; Hui; Wang; Yue Yu; Kam-Fai Wong; Bin Liang; Ruifeng Xu

arXiv:2402.14228·cs.LG·December 24, 2024·1 cites

COPR: Continual Human Preference Learning via Optimal Policy Regularization

Han Zhang, Lin Gui, Yu Lei, Yuanzhao Zhai, Yehong Zhang, Yulan He, Hui, Wang, Yue Yu, Kam-Fai Wong, Bin Liang, Ruifeng Xu

PDF

Open Access

TL;DR

This paper introduces COPR, a novel continual learning method for reinforcement learning from human feedback that prevents catastrophic forgetting by regularizing policies based on optimal policy theory, improving alignment with human preferences.

Contribution

COPR is the first method to integrate optimal policy regularization with continual learning for RLHF, effectively maintaining historical preferences and enhancing model alignment.

Findings

01

COPR outperforms strong continual learning baselines in reward and human evaluations.

02

COPR demonstrates robustness across various settings and model architectures.

03

Formal proof of learnability supports the theoretical foundation of COPR.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences. Given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. Nevertheless, making RLHF compatible with Continual Learning (CL) is challenging due to its complex process. Meanwhile, directly learning new human preferences may lead to Catastrophic Forgetting (CF) of historical preferences, resulting in helpless or harmful outputs. To overcome these challenges, we propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory. COPR utilizes a sampling distribution as a demonstration and regularization constraints for CL. It adopts the Lagrangian Duality (LD) method to dynamically regularize the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Human Pose and Action Recognition · Text and Document Classification Technologies

MethodsLinear Layer · Dropout · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection