Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws
Akshita Jha, Sanchit Kabra, Chandan K. Reddy

TL;DR
This paper presents a method to reduce stereotypes in generative language models by addressing comprehension failures through instruction-tuning, achieving over 60% reduction in stereotypical outputs without explicit debiasing.
Contribution
It introduces a targeted stereotype mitigation framework that disentangles bias from comprehension errors and demonstrates its effectiveness across multiple models and bias dimensions.
Findings
Over 60% reduction in stereotypical outputs
Effective across multiple bias categories
Maintains model utility while reducing bias
Abstract
Recent studies have shown that generative language models often reflect and amplify societal biases in their outputs. However, these studies frequently conflate observed biases with other task-specific shortcomings, such as comprehension failure. For example, when a model misinterprets a text and produces a response that reinforces a stereotype, it becomes difficult to determine whether the issue arises from inherent bias or from a misunderstanding of the given content. In this paper, we conduct a multi-faceted evaluation that distinctly disentangles bias from flaws within the reading comprehension task. We propose a targeted stereotype mitigation framework that implicitly mitigates observed stereotypes in generative models through instruction-tuning on general-purpose datasets. We reduce stereotypical outputs by over 60% across multiple dimensions -- including nationality, age, gender,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
