Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

Dyah Adila; Shuai Zhang; Boran Han; Yuyang Wang

arXiv:2406.03631·cs.LG·June 7, 2024

Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

Dyah Adila, Shuai Zhang, Boran Han, Yuyang Wang

PDF

Open Access

TL;DR

This paper introduces SteerFair, an unsupervised method to identify and steer away from biases in model representations, significantly reducing performance variance and improving accuracy in prompt-based tasks without labeled data.

Contribution

SteerFair is a novel unsupervised approach that detects and mitigates internal model biases by steering activations, outperforming supervised baselines with fewer labeled samples.

Findings

01

Reduces performance variance across prompt modifications

02

Surpasses supervised baseline accuracy with 100 labels

03

Matches supervised performance with 500 labels

Abstract

The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model's preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify this bias directly in the model's internal representation. Our approach, SteerFair, finds the bias direction in the model's representation space and steers activation values away from it during inference. Specifically, we exploit the observation that bias often adheres to simple association rules, such as the spurious association between the first option and correctness likelihood. Next, we construct demonstrations of these rules from unlabeled samples and use them to identify the bias…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems