Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs
Wei Xia

TL;DR
This paper introduces two zero-shot logits-layer debiasing methods for large language models, demonstrating that semantic-aware logit interventions are stable, effective, and outperform hidden-layer approaches in reducing bias with minimal fluency loss.
Contribution
The paper proposes static and dynamic logits-layer debiasing methods that are zero-shot, outperform hidden-layer approaches, and are stable for aligned large language models.
Findings
Dynamic method reduces bias by up to 70%
Logits intervention outperforms hidden-layer approaches
Semantic-aware logits intervention maintains fluency and stability
Abstract
We proposed Static and Dynamic -- two zero-shot logits-layer debiasing methods. Dynamic reduces bias by up to 70% with minimal fluency loss. Logits intervention outperforms hidden-layer approaches. We show semantic-aware logits intervention is stable and effective for debiasing aligned LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
