TL;DR
This paper introduces GECOBench, a dataset and evaluation framework to quantify gender bias in explanations generated by XAI methods for NLP models, revealing how fine-tuning affects bias mitigation.
Contribution
It presents a novel gender-controlled dataset and benchmark for evaluating explanation bias in language models, and analyzes the impact of fine-tuning on bias reduction in feature attributions.
Findings
Fine-tuning reduces explanation bias in XAI methods.
Explanation performance improves with more fine-tuned layers.
GECOBench enables objective evaluation of bias in explanations.
Abstract
Large pre-trained language models have become a crucial backbone for many downstream tasks in natural language processing (NLP), and while they are trained on a plethora of data containing a variety of biases, such as gender biases, it has been shown that they can also inherit such biases in their weights, potentially affecting their prediction behavior. However, it is unclear to what extent these biases also affect feature attributions generated by applying "explainable artificial intelligence" (XAI) techniques, possibly in unfavorable ways. To systematically study this question, we create a gender-controlled text dataset, GECO, in which the alteration of grammatical gender forms induces class-specific words and provides ground truth feature attributions for gender classification tasks. This enables an objective evaluation of the correctness of XAI methods. We apply this dataset to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsGeneralized ELBO with Constrained Optimization
