HPS: Hard Preference Sampling for Human Preference Alignment
Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou

TL;DR
HPS is a new training framework that improves human preference alignment in large language models by focusing on hard dispreferred responses, reducing computational costs, and enhancing safety and control.
Contribution
HPS introduces a novel loss and sampling strategy that prioritizes hard dispreferred responses, improving efficiency and safety in preference alignment for LLMs.
Findings
HPS achieves comparable BLEU and reward scores to existing methods.
HPS significantly increases reward margins, reducing harmful content.
HPS reduces computational overhead compared to traditional PL methods.
Abstract
Aligning Large Language Model (LLM) responses with human preferences is vital for building safe and controllable AI systems. While preference optimization methods based on Plackett-Luce (PL) and Bradley-Terry (BT) models have shown promise, they face challenges such as poor handling of harmful content, inefficient use of dispreferred responses, and, specifically for PL, high computational costs. To address these issues, we propose Hard Preference Sampling (HPS), a novel framework for robust and efficient human preference alignment. HPS introduces a training loss that prioritizes the most preferred response while rejecting all dispreferred and harmful ones. It emphasizes "hard" dispreferred responses -- those closely resembling preferred ones -- to enhance the model's rejection capabilities. By leveraging a single-sample Monte Carlo sampling strategy, HPS reduces computational overhead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensory Analysis and Statistical Methods
