Your fairness may vary: Pretrained language model fairness in toxic text classification
Ioana Baldini, Dennis Wei, Karthikeyan Natesan Ramamurthy, Mikhail, Yurochkin, Moninder Singh

TL;DR
This paper highlights that pretrained language models for toxic text classification exhibit significant fairness variability, which is not solely dependent on model size, and demonstrates post-processing methods to enhance fairness without retraining.
Contribution
It reveals the variability of fairness in pretrained language models across different sizes and initializations, and adapts post-processing fairness techniques from tabular data to NLP models.
Findings
Fairness varies more than accuracy with training data size and initialization.
Model size explains little of the fairness variation.
Post-processing methods improve fairness without retraining.
Abstract
The popularity of pretrained language models in natural language processing systems calls for a careful evaluation of such models in down-stream tasks, which have a higher potential for societal impact. The evaluation of such systems usually focuses on accuracy measures. Our findings in this paper call for attention to be paid to fairness measures as well. Through the analysis of more than a dozen pretrained language models of varying sizes on two toxic text classification tasks (English), we demonstrate that focusing on accuracy measures alone can lead to models with wide variation in fairness characteristics. Specifically, we observe that fairness can vary even more than accuracy with increasing training data size and different random initializations. At the same time, we find that little of the fairness variation is explained by model size, despite claims in the literature. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
