Noisy Self-Training with Data Augmentations for Offensive and Hate   Speech Detection Tasks

Jo\~ao A. Leite; Carolina Scarton; Diego F. Silva

arXiv:2307.16609·cs.CL·August 1, 2023

Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks

Jo\~ao A. Leite, Carolina Scarton, Diego F. Silva

PDF

Open Access 1 Repo

TL;DR

This paper investigates the effectiveness of self-training and noisy self-training with data augmentation techniques for offensive and hate speech detection, finding that while self-training improves performance, noisy approaches may decrease it.

Contribution

It provides a comprehensive evaluation of default and noisy self-training methods with various data augmentations across multiple BERT models for hate speech detection.

Findings

01

Self-training improves F1-macro scores by up to 1.5%.

02

Noisy self-training with augmentations decreases performance.

03

Performance gains are consistent across different model sizes.

Abstract

Online social media is rife with offensive and hateful comments, prompting the need for their automatic detection given the sheer amount of posts created every second. Creating high-quality human-labelled datasets for this task is difficult and costly, especially because non-offensive posts are significantly more frequent than offensive ones. However, unlabelled data is abundant, easier, and cheaper to obtain. In this scenario, self-training methods, using weakly-labelled examples to increase the amount of training data, can be employed. Recent "noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against noisy data and adversarial attacks. In this paper, we experiment with default and noisy self-training using three different textual data augmentation techniques across five different pre-trained BERT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jaugusto97/offense-self-training
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning

MethodsMulti-Head Attention · Attention Is All You Need · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Linear Layer · Dropout · WordPiece · Adam · Attention Dropout · Linear Warmup With Linear Decay