A Comprehensive Study on NLP Data Augmentation for Hate Speech   Detection: Legacy Methods, BERT, and LLMs

Md Saroar Jahan; Mourad Oussalah; Djamila Romaissa Beddia; Jhuma kabir; Mim; Nabil Arhab

arXiv:2404.00303·cs.CL·April 2, 2024·2 cites

A Comprehensive Study on NLP Data Augmentation for Hate Speech Detection: Legacy Methods, BERT, and LLMs

Md Saroar Jahan, Mourad Oussalah, Djamila Romaissa Beddia, Jhuma kabir, Mim, Nabil Arhab

PDF

Open Access

TL;DR

This study evaluates various NLP data augmentation techniques for hate speech detection, highlighting the effectiveness of BERT-based filtering and GPT-3, and providing insights into their impact on model performance and label integrity.

Contribution

It introduces a BERT-based cosine similarity filtration method to reduce label alteration and compares it with traditional and LLM-based augmentation techniques in hate speech detection.

Findings

01

BERT-based filtration reduces label alteration to 0.05%.

02

GPT-3 augmentation increases data size sevenfold and improves F1 score.

03

Traditional methods like back-translation have low label change rates (0.3-1.5%).

Abstract

The surge of interest in data augmentation within the realm of NLP has been driven by the need to address challenges posed by hate speech domains, the dynamic nature of social media vocabulary, and the demands for large-scale neural networks requiring extensive training data. However, the prevalent use of lexical substitution in data augmentation has raised concerns, as it may inadvertently alter the intended meaning, thereby impacting the efficacy of supervised machine learning models. In pursuit of suitable data augmentation methods, this study explores both established legacy approaches and contemporary practices such as Large Language Models (LLM), including GPT in Hate Speech detection. Additionally, we propose an optimized utilization of BERT-based encoder models with contextual cosine similarity filtration, exposing significant limitations in prior synonym substitution methods.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Multi-Head Attention · Weight Decay · Adam