Enhanced Offensive Language Detection Through Data Augmentation
Ruibo Liu, Guangxuan Xu, Soroush Vosoughi

TL;DR
This paper introduces Dager, a data augmentation method using GPT-2 to improve offensive language detection on imbalanced datasets, significantly boosting classifier performance across multiple models.
Contribution
The paper presents Dager, a novel generation-based data augmentation technique that enhances offensive language detection, especially in low-resource and imbalanced datasets, demonstrating classifier-agnostic effectiveness.
Findings
Dager increases F1 score by 11% with only 1% training data.
Generated data maintains label integrity effectively.
Universal improvement across different classifiers.
Abstract
Detecting offensive language on social media is an important task. The ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content using a crowd-sourced dataset containing 100k labelled tweets. The dataset, however, suffers from class imbalance, where certain labels are extremely rare compared with other classes (e.g, the hateful class is only 5% of the data). In this work, we present Dager (Data Augmenter), a generation-based data augmentation method, that improves the performance of classification on imbalanced and low-resource data such as the offensive language dataset. Dager extracts the lexical features of a given class, and uses these features to guide the generation of a conditional generator built on GPT-2. The generated text can then be added to the training set as augmentation data. We show that applying Dager can increase the F1 score of the data challenge by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Spam and Phishing Detection · Internet Traffic Analysis and Secure E-voting
MethodsLinear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Discriminative Fine-Tuning · Linear Warmup With Linear Decay · WordPiece · Residual Connection · Multi-Head Attention · Adam
