Data Expansion using Back Translation and Paraphrasing for Hate Speech   Detection

Djamila Romaissa Beddiar; Md Saroar Jahan; Mourad Oussalah

arXiv:2106.04681·cs.CL·June 10, 2021

Data Expansion using Back Translation and Paraphrasing for Hate Speech Detection

Djamila Romaissa Beddiar, Md Saroar Jahan, Mourad Oussalah

PDF

TL;DR

This paper introduces a data augmentation pipeline combining back translation and paraphrasing to improve hate speech detection models, evaluated on multiple datasets with promising results.

Contribution

It presents a novel deep learning data augmentation method using back translation and paraphrasing to enhance hate speech classification accuracy.

Findings

01

Improved classification performance on five datasets.

02

Back translation and paraphrasing increase data diversity.

03

Comparison shows effectiveness over existing methods.

Abstract

With proliferation of user generated contents in social media platforms, establishing mechanisms to automatically identify toxic and abusive content becomes a prime concern for regulators, researchers, and society. Keeping the balance between freedom of speech and respecting each other dignity is a major concern of social media platform regulators. Although, automatic detection of offensive content using deep learning approaches seems to provide encouraging results, training deep learning-based models requires large amounts of high-quality labeled data, which is often missing. In this regard, we present in this paper a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation. Our pipeline investigates different word-embedding-based architectures for classification of hate speech. The back translation technique relies on an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory