Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
Dyah Adila, Shuai Zhang, Boran Han, Yuyang Wang

TL;DR
This paper introduces SteerFair, an unsupervised method to identify and steer away from biases in model representations, significantly reducing performance variance and improving accuracy in prompt-based tasks without labeled data.
Contribution
SteerFair is a novel unsupervised approach that detects and mitigates internal model biases by steering activations, outperforming supervised baselines with fewer labeled samples.
Findings
Reduces performance variance across prompt modifications
Surpasses supervised baseline accuracy with 100 labels
Matches supervised performance with 500 labels
Abstract
The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model's preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify this bias directly in the model's internal representation. Our approach, SteerFair, finds the bias direction in the model's representation space and steers activation values away from it during inference. Specifically, we exploit the observation that bias often adheres to simple association rules, such as the spurious association between the first option and correctness likelihood. Next, we construct demonstrations of these rules from unlabeled samples and use them to identify the bias…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
