Fight Fire with Fire: Fine-tuning Hate Detectors using Large Samples of Generated Hate Speech
Tomer Wullach, Amir Adler, Einat Minkov

TL;DR
This paper introduces a method to improve hate speech detection by generating synthetic hate speech data with GPT and fine-tuning large language models, resulting in better generalization across datasets.
Contribution
The study demonstrates that synthetic data generated by GPT can enhance hate speech classifiers more effectively than additional human-labeled data.
Findings
Synthetic data improves model generalization significantly.
Generated hate speech is more effective than out-of-domain human-labeled data.
Fine-tuning with generated data benefits multiple pretrained models.
Abstract
Automatic hate speech detection is hampered by the scarcity of labeled datasetd, leading to poor generalization. We employ pretrained language models (LMs) to alleviate this data bottleneck. We utilize the GPT LM for generating large amounts of synthetic hate speech sequences from available labeled examples, and leverage the generated data in fine-tuning large pretrained LMs on hate detection. An empirical study using the models of BERT, RoBERTa and ALBERT, shows that this approach improves generalization significantly and consistently within and across data distributions. In fact, we find that generating relevant labeled hate speech sequences is preferable to using out-of-domain, and sometimes also within-domain, human-labeled examples.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Cosine Annealing · Linear Warmup With Linear Decay · Byte Pair Encoding · Weight Decay · Discriminative Fine-Tuning · Adam · Residual Connection · LAMB
