Enhanced Offensive Language Detection Through Data Augmentation

Ruibo Liu; Guangxuan Xu; Soroush Vosoughi

arXiv:2012.02954·cs.CL·December 8, 2020·5 cites

Enhanced Offensive Language Detection Through Data Augmentation

Ruibo Liu, Guangxuan Xu, Soroush Vosoughi

PDF

Open Access

TL;DR

This paper introduces Dager, a data augmentation method using GPT-2 to improve offensive language detection on imbalanced datasets, significantly boosting classifier performance across multiple models.

Contribution

The paper presents Dager, a novel generation-based data augmentation technique that enhances offensive language detection, especially in low-resource and imbalanced datasets, demonstrating classifier-agnostic effectiveness.

Findings

01

Dager increases F1 score by 11% with only 1% training data.

02

Generated data maintains label integrity effectively.

03

Universal improvement across different classifiers.

Abstract

Detecting offensive language on social media is an important task. The ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content using a crowd-sourced dataset containing 100k labelled tweets. The dataset, however, suffers from class imbalance, where certain labels are extremely rare compared with other classes (e.g, the hateful class is only 5% of the data). In this work, we present Dager (Data Augmenter), a generation-based data augmentation method, that improves the performance of classification on imbalanced and low-resource data such as the offensive language dataset. Dager extracts the lexical features of a given class, and uses these features to guide the generation of a conditional generator built on GPT-2. The generated text can then be added to the training set as augmentation data. We show that applying Dager can increase the F1 score of the data challenge by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Spam and Phishing Detection · Internet Traffic Analysis and Secure E-voting

MethodsLinear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Discriminative Fine-Tuning · Linear Warmup With Linear Decay · WordPiece · Residual Connection · Multi-Head Attention · Adam