Empirical Evaluation of Public HateSpeech Datasets

Sadar Jaf; Basel Barakat

arXiv:2407.12018·cs.CL·July 18, 2024

Empirical Evaluation of Public HateSpeech Datasets

Sadar Jaf, Basel Barakat

PDF

Open Access

TL;DR

This paper empirically evaluates public hate speech datasets, revealing their limitations and providing insights to improve the training of more accurate hate speech detection models.

Contribution

It offers a comprehensive analysis of existing datasets, highlighting their weaknesses and guiding future improvements for hate speech classification.

Findings

01

Current datasets have significant limitations affecting model accuracy

02

Statistical analyses reveal specific dataset weaknesses

03

Recommendations for developing better hate speech datasets

Abstract

Despite the extensive communication benefits offered by social media platforms, numerous challenges must be addressed to ensure user safety. One of the most significant risks faced by users on these platforms is targeted hate speech. Social media platforms are widely utilised for generating datasets employed in training and evaluating machine learning algorithms for hate speech detection. However, existing public datasets exhibit numerous limitations, hindering the effective training of these algorithms and leading to inaccurate hate speech classification. This study provides a comprehensive empirical evaluation of several public datasets commonly used in automated hate speech classification. Through rigorous analysis, we present compelling evidence highlighting the limitations of current hate speech datasets. Additionally, we conduct a range of statistical analyses to elucidate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection