Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in   Language Models

Wenxuan Zhang; Philip H.S. Torr; Mohamed Elhoseiny; Adel Bibi

arXiv:2408.15313·cs.AI·April 9, 2025

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Wenxuan Zhang, Philip H.S. Torr, Mohamed Elhoseiny, Adel Bibi

PDF

Open Access 2 Models

TL;DR

This paper introduces Bi-Factorial Preference Optimization (BFPO), a supervised learning framework that effectively balances safety and helpfulness in language models, outperforming existing methods with less resource use.

Contribution

BFPO re-parameterizes joint safety-helpfulness optimization into a single supervised learning task, reducing resource costs and improving balance in language model fine-tuning.

Findings

01

BFPO outperforms existing approaches in safety and helpfulness.

02

BFPO achieves safety levels comparable to human-labor-intensive methods.

03

BFPO requires less than 10% of the resources used by traditional methods.

Abstract

Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In supervised optimization, a labeling function is used to capture the global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark that includes comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques