An Empirical Investigation of Learning from Biased Toxicity Labels
Neel Nanda, Jonathan Uesato, Sven Gowal

TL;DR
This paper investigates how to effectively train toxicity prediction models using a small set of high-quality human labels and a large, biased synthetic dataset, balancing accuracy and fairness.
Contribution
It introduces and evaluates training strategies that leverage both small high-quality and large biased datasets for toxicity prediction, highlighting trade-offs between accuracy and fairness.
Findings
Training on all data and fine-tuning yields highest AUC.
No single strategy optimizes all fairness metrics.
Different strategies balance accuracy and fairness differently.
Abstract
Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As such, it is often only possible to gather a small amount of high-quality labels. In this paper, we study how different training strategies can leverage a small dataset of human-annotated labels and a large but noisy dataset of synthetically generated labels (which exhibit bias against identity groups) for predicting toxicity of online comments. We evaluate the accuracy and fairness properties of these approaches, and trade-offs between the two. While we find that initial training on all of the data and fine-tuning on clean data produces models with the highest AUC, we find that no single strategy performs best across all fairness metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Adversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
