MBIAS: Mitigating Bias in Large Language Models While Retaining Context
Shaina Raza, Ananya Raval, Veronica Chatrath

TL;DR
MBIAS is a fine-tuning framework for large language models that effectively reduces bias and toxicity while maintaining contextual accuracy, using a custom safety-focused dataset and human-in-the-loop evaluation.
Contribution
Introduces MBIAS, a novel instruction fine-tuning approach with a specialized dataset to mitigate bias and toxicity in LLMs without losing contextual information.
Findings
Over 30% reduction in bias and toxicity in standard evaluations
More than 90% reduction in bias across diverse demographic tests
Provides datasets and models for community use and reproducibility
Abstract
The deployment of Large Language Models (LLMs) in diverse applications necessitates an assurance of safety without compromising the contextual integrity of the generated content. Traditional approaches, including safety-specific fine-tuning or adversarial testing, often yield safe outputs at the expense of contextual meaning. This can result in a diminished capacity to handle nuanced aspects of bias and toxicity, such as underrepresentation or negative portrayals across various demographics. To address these challenges, we introduce MBIAS, an LLM framework carefully instruction fine-tuned on a custom dataset designed specifically for safety interventions. MBIAS is designed to significantly reduce biases and toxic elements in LLM outputs while preserving the main information. This work also details our further use of LLMs: as annotator under human supervision and as evaluator of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
