Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation
Huimin Lu, Masaru Isonuma, Junichiro Mori, Ichiro Sakata

TL;DR
This paper introduces a novel unlearning-based debiasing method for large language models that reduces biases and toxicity, with evidence that debiasing one bias type can transfer to mitigate others across domains.
Contribution
It proposes a mask language modeling unlearning technique to selectively forget biased content and demonstrates cross-domain bias mitigation effects.
Findings
Effective bias reduction while preserving language modeling quality
Unlearning one bias can help mitigate other biases across domains
Potential for improved debiasing strategies through transfer unlearning
Abstract
Large language models (LLMs) often inherit biases from vast amounts of training corpora. Traditional debiasing methods, while effective to some extent, do not completely eliminate memorized biases and toxicity in LLMs. In this paper, we study an unlearning-based approach to debiasing in LLMs by performing gradient ascent on hate speech against minority groups, i.e., minimizing the likelihood of biased or toxic content. Specifically, we propose a mask language modeling unlearning technique, which unlearns the harmful part of the text. This method enables LLMs to selectively forget and disassociate from biased and harmful content. Experimental results demonstrate the effectiveness of our approach in diminishing bias while maintaining the language modeling abilities. Surprisingly, the results also unveil an unexpected potential for cross-domain transfer unlearning: debiasing in one bias…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
