COPR: Continual Human Preference Learning via Optimal Policy Regularization
Han Zhang, Lin Gui, Yu Lei, Yuanzhao Zhai, Yehong Zhang, Yulan He, Hui, Wang, Yue Yu, Kam-Fai Wong, Bin Liang, Ruifeng Xu

TL;DR
This paper introduces COPR, a novel continual learning method for reinforcement learning from human feedback that prevents catastrophic forgetting by regularizing policies based on optimal policy theory, improving alignment with human preferences.
Contribution
COPR is the first method to integrate optimal policy regularization with continual learning for RLHF, effectively maintaining historical preferences and enhancing model alignment.
Findings
COPR outperforms strong continual learning baselines in reward and human evaluations.
COPR demonstrates robustness across various settings and model architectures.
Formal proof of learnability supports the theoretical foundation of COPR.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences. Given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. Nevertheless, making RLHF compatible with Continual Learning (CL) is challenging due to its complex process. Meanwhile, directly learning new human preferences may lead to Catastrophic Forgetting (CF) of historical preferences, resulting in helpless or harmful outputs. To overcome these challenges, we propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory. COPR utilizes a sampling distribution as a demonstration and regularization constraints for CL. It adopts the Lagrangian Duality (LD) method to dynamically regularize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Human Pose and Action Recognition · Text and Document Classification Technologies
MethodsLinear Layer · Dropout · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection
