Unsupervised Concept Vector Extraction for Bias Control in LLMs
Hannah Cyberey, Yangfeng Ji, David Evans

TL;DR
This paper introduces a novel unsupervised method to extract and manipulate concept representations in large language models, effectively reducing gender and racial biases without labeled data.
Contribution
It proposes a new unsupervised technique for concept vector extraction and a projection-based steering method to mitigate biases in LLMs.
Findings
Effective gender bias mitigation demonstrated
Method generalizes to racial bias
Code available for replication
Abstract
Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate these biases, but most work studies biases as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of "gender" is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model's representation. We develop a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs and show that it also generalizes to racial bias. Our code is available at: https://github.com/hannahxchen/gender-bias-steering
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
