On the Challenges of Building Datasets for Hate Speech Detection

Vitthal Bhandari

arXiv:2309.02912·cs.CL·September 7, 2023·1 cites

On the Challenges of Building Datasets for Hate Speech Detection

Vitthal Bhandari

PDF

Open Access

TL;DR

This paper examines the difficulties in creating reliable hate speech datasets, highlighting subjectivity issues, and proposes a comprehensive framework to improve dataset quality and consistency for hate speech detection, especially towards sexual minorities.

Contribution

It introduces a holistic framework for hate speech dataset creation, addressing key challenges and guiding practitioners to develop more reliable and generalizable datasets.

Findings

01

Identifies key issues in hate speech dataset creation

02

Proposes a seven-dimension framework for data collection

03

Provides best practices for future dataset development

Abstract

Detection of hate speech has been formulated as a standalone application of NLP and different approaches have been adopted for identifying the target groups, obtaining raw data, defining the labeling process, choosing the detection algorithm, and evaluating the performance in the desired setting. However, unlike other downstream tasks, hate speech suffers from the lack of large-sized, carefully curated, generalizable datasets owing to the highly subjective nature of the task. In this paper, we first analyze the issues surrounding hate speech detection through a data-centric lens. We then outline a holistic framework to encapsulate the data creation pipeline across seven broad dimensions by taking the specific example of hate speech towards sexual minorities. We posit that practitioners would benefit from following this framework as a form of best practice when creating hate speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection