Augmenting Bias Detection in LLMs Using Topological Data Analysis
Keshav Varadarajan, Tananun Songdechakraiwut

TL;DR
This paper introduces a topological data analysis method to identify specific attention heads in GPT-2 responsible for biases towards certain identity groups, aiding in targeted bias mitigation.
Contribution
It presents a novel application of topological data analysis to pinpoint bias-contributing heads in large language models, enhancing interpretability.
Findings
Biases are concentrated in specific attention heads.
The method can identify heads responsible for particular group biases.
Potential for extending to bias mitigation strategies.
Abstract
Recently, many bias detection methods have been proposed to determine the level of bias a large language model captures. However, tests to identify which parts of a large language model are responsible for bias towards specific groups remain underdeveloped. In this study, we present a method using topological data analysis to identify which heads in GPT-2 contribute to the misrepresentation of identity groups present in the StereoSet dataset. We find that biases for particular categories, such as gender or profession, are concentrated in attention heads that act as hot spots. The metric we propose can also be used to determine which heads capture bias for a specific group within a bias category, and future work could extend this method to help de-bias large language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopological and Geometric Data Analysis · Advanced Graph Neural Networks · Machine Learning in Healthcare
