De-amplifying Bias from Differential Privacy in Language Model Fine-tuning
Sanjari Srivastava, Piotr Mardziel, Zhikhun Zhang, Archana Ahlawat,, Anupam Datta, John C Mitchell

TL;DR
This paper investigates how differential privacy during fine-tuning of large language models amplifies social biases, and proposes combining counterfactual data augmentation with DP to mitigate bias while preserving privacy.
Contribution
It reveals that DP amplifies biases in LLM fine-tuning and demonstrates that CDA can reduce this bias amplification, enabling fair and private model training.
Findings
DP amplifies gender, racial, and religious bias in LLMs
CDA mitigates bias amplification caused by DP
Combining DP and CDA maintains fairness and privacy
Abstract
Fairness and privacy are two important values machine learning (ML) practitioners often seek to operationalize in models. Fairness aims to reduce model bias for social/demographic sub-groups. Privacy via differential privacy (DP) mechanisms, on the other hand, limits the impact of any individual's training data on the resulting model. The trade-offs between privacy and fairness goals of trustworthy ML pose a challenge to those wishing to address both. We show that DP amplifies gender, racial, and religious bias when fine-tuning large language models (LLMs), producing models more biased than ones fine-tuned without DP. We find the cause of the amplification to be a disparity in convergence of gradients across sub-groups. Through the case of binary gender bias, we demonstrate that Counterfactual Data Augmentation (CDA), a known method for addressing bias, also mitigates bias amplification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data
