KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

Seorin Kim; Dongyoung Lee; Jaejin Lee

arXiv:2507.19962·cs.CL·July 29, 2025

KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

Seorin Kim, Dongyoung Lee, Jaejin Lee

PDF

1 Video

TL;DR

KLAAD is a novel attention-based debiasing method for large language models that aligns attention distributions to reduce societal bias without altering model weights, maintaining language quality.

Contribution

It introduces a new attention alignment framework using a composite loss to mitigate bias while preserving language model performance.

Findings

01

Significantly reduces bias on BBQ and BOLD benchmarks.

02

Maintains language modeling quality with minimal impact.

03

Provides a principled attention-level bias mitigation approach.

Abstract

Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models· underline