Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation
Aman Khullar, Daniel Nkemelu, Cuong V. Nguyen, Michael L. Best

TL;DR
This paper introduces a synthetic data generation approach to improve hate speech detection in low-resource languages, enabling effective model training with limited data by transferring hate targets from high-resource language examples.
Contribution
The paper presents three novel methods for synthesizing hate speech data in low-resource languages using high-resource language examples, enhancing hate speech detection where data is scarce.
Findings
Synthetic data improves model performance in low-resource languages.
Models trained on synthetic data can outperform those trained on limited real data.
The approach enables hate speech detection in languages with minimal existing data.
Abstract
A growing body of work has focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of highly-resourced languages causing detection systems to either under-perform or not exist in limited data contexts. This is majorly caused by a lack of training data which is expensive to collect and curate in these settings. In this work, we propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts using synthetic data generation techniques. Given a handful of hate speech examples in a high-resource language such as English, we present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets. We apply our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Internet Traffic Analysis and Secure E-voting
