Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs

Wei Xia

arXiv:2510.23650·cs.LG·October 29, 2025

Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs

Wei Xia

PDF

TL;DR

This paper introduces two zero-shot logits-layer debiasing methods for large language models, demonstrating that semantic-aware logit interventions are stable, effective, and outperform hidden-layer approaches in reducing bias with minimal fluency loss.

Contribution

The paper proposes static and dynamic logits-layer debiasing methods that are zero-shot, outperform hidden-layer approaches, and are stable for aligned large language models.

Findings

01

Dynamic method reduces bias by up to 70%

02

Logits intervention outperforms hidden-layer approaches

03

Semantic-aware logits intervention maintains fluency and stability

Abstract

We proposed Static and Dynamic -- two zero-shot logits-layer debiasing methods. Dynamic reduces bias by up to 70% with minimal fluency loss. Logits intervention outperforms hidden-layer approaches. We show semantic-aware logits intervention is stable and effective for debiasing aligned LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.