Unsupervised Concept Vector Extraction for Bias Control in LLMs

Hannah Cyberey; Yangfeng Ji; David Evans

arXiv:2502.19721·cs.CL·September 19, 2025

Unsupervised Concept Vector Extraction for Bias Control in LLMs

Hannah Cyberey, Yangfeng Ji, David Evans

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel unsupervised method to extract and manipulate concept representations in large language models, effectively reducing gender and racial biases without labeled data.

Contribution

It proposes a new unsupervised technique for concept vector extraction and a projection-based steering method to mitigate biases in LLMs.

Findings

01

Effective gender bias mitigation demonstrated

02

Method generalizes to racial bias

03

Code available for replication

Abstract

Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate these biases, but most work studies biases as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of "gender" is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model's representation. We develop a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs and show that it also generalizes to racial bias. Our code is available at: https://github.com/hannahxchen/gender-bias-steering

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hannahxchen/gender-bias-steering
pytorchOfficial

Videos

Unsupervised Concept Vector Extraction for Bias Control in LLMs· underline

Taxonomy

TopicsTopic Modeling · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)