Revisiting Hate Speech Benchmarks: From Data Curation to System   Deployment

Atharva Kulkarni; Sarah Masud; Vikram Goyal; Tanmoy Chakraborty

arXiv:2306.01105·cs.CL·June 16, 2023·1 cites

Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Atharva Kulkarni, Sarah Masud, Vikram Goyal, Tanmoy Chakraborty

PDF

Open Access 1 Repo

TL;DR

This paper introduces GOTHate, a large, diverse, code-mixed hate speech dataset from Twitter, and proposes HEN-mBERT, a multilingual model leveraging endogenous signals to improve hate speech detection in real-world scenarios.

Contribution

The paper presents GOTHate, a novel, neutrally-seeded, multilingual hate speech dataset, and introduces HEN-mBERT, a modular model that incorporates endogenous signals for enhanced detection.

Findings

01

GOTHate is more challenging to classify than existing datasets.

02

Adding endogenous signals improves hate speech detection accuracy.

03

HEN-mBERT outperforms baseline models by 2.5% in macro-F1 and 5% in hate class F1.

Abstract

Social media is awash with hateful content, much of which is often veiled with linguistic and topical diversity. The benchmark datasets used for hate speech detection do not account for such divagation as they are predominantly compiled using hate lexicons. However, capturing hate signals becomes challenging in neutrally-seeded malicious content. Thus, designing models and datasets that mimic the real-world variability of hate warrants further investigation. To this end, we present GOTHate, a large-scale code-mixed crowdsourced dataset of around 51k posts for hate speech detection from Twitter. GOTHate is neutrally seeded, encompassing different languages and topics. We conduct detailed comparisons of GOTHate with the existing hate speech datasets, highlighting its novelty. We benchmark it with 10 recent baselines. Our extensive empirical and benchmarking experiments suggest that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lcs2-iiitd/gothate
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Social Media and Politics · Internet Traffic Analysis and Secure E-voting