SeRA: Self-Reviewing and Alignment of Large Language Models using   Implicit Reward Margins

Jongwoo Ko; Saket Dingliwal; Bhavana Ganesh; Sailik Sengupta; Sravan; Bodapati; Aram Galstyan

arXiv:2410.09362·cs.LG·October 15, 2024

SeRA: Self-Reviewing and Alignment of Large Language Models using Implicit Reward Margins

Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh, Sailik Sengupta, Sravan, Bodapati, Aram Galstyan

PDF

Open Access

TL;DR

SeRA is a novel method that improves large language model alignment by using implicit reward margins for sample selection and preference bootstrapping, reducing overfitting and enhancing training efficiency.

Contribution

SeRA introduces a cost-effective approach combining implicit reward margins and preference bootstrapping to enhance DAA-based LLM alignment.

Findings

01

SeRA improves alignment accuracy on offline datasets.

02

SeRA reduces overfitting to spurious correlations.

03

SeRA demonstrates effectiveness across multiple instruction-following tasks.

Abstract

Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives for Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the preferences used in DAAs are usually collected before the alignment training begins and remain unchanged (off-policy). This can lead to two problems where the policy model (1) picks up on spurious correlations in the dataset (as opposed to learning the intended alignment expressed in the human preference labels), and (2) overfits to feedback on off-policy trajectories that have less likelihood of being generated by an updated policy model. To address these issues, we introduce Self-Reviewing and Alignment (SeRA), a cost-efficient and effective method that can be readily combined with existing DAAs. SeRA comprises of two components: (1) sample selection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques