Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang, Philip H.S. Torr, Mohamed Elhoseiny, Adel Bibi

TL;DR
This paper introduces Bi-Factorial Preference Optimization (BFPO), a supervised learning framework that effectively balances safety and helpfulness in language models, outperforming existing methods with less resource use.
Contribution
BFPO re-parameterizes joint safety-helpfulness optimization into a single supervised learning task, reducing resource costs and improving balance in language model fine-tuning.
Findings
BFPO outperforms existing approaches in safety and helpfulness.
BFPO achieves safety levels comparable to human-labor-intensive methods.
BFPO requires less than 10% of the resources used by traditional methods.
Abstract
Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In supervised optimization, a labeling function is used to capture the global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark that includes comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
