BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization
Ahmed Allam

TL;DR
This paper presents BiasDPO, a framework that uses Direct Preference Optimization to reduce biases in language models, improving ethical language generation and outperforming baseline models on bias benchmarks.
Contribution
Introduction of BiasDPO, a novel bias mitigation method using preference optimization and a new bias recognition dataset for LLMs.
Findings
Significant reduction in biased outputs in the Microsoft Phi-2 model.
Outperforms baseline and open-source models on bias benchmarks.
Public release of BiasDPO dataset for further research.
Abstract
Large Language Models (LLMs) have become pivotal in advancing natural language processing, yet their potential to perpetuate biases poses significant concerns. This paper introduces a new framework employing Direct Preference Optimization (DPO) to mitigate gender, racial, and religious biases in LLM-generated English text. By developing a loss function that favors less biased over biased completions, our approach cultivates a preference for respectful and non-discriminatory language in LLMs. We also contribute a manually designed dataset for training LLMs to recognize and correct biases. This dataset encompasses a diverse range of prompts paired with both biased and unbiased completions. Implementing this approach on the Microsoft Phi-2 model, we demonstrate substantial reductions in biased outputs as our model outperforms the baseline model on almost all bias benchmarks. Our model also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
