On the Challenges of Building Datasets for Hate Speech Detection
Vitthal Bhandari

TL;DR
This paper examines the difficulties in creating reliable hate speech datasets, highlighting subjectivity issues, and proposes a comprehensive framework to improve dataset quality and consistency for hate speech detection, especially towards sexual minorities.
Contribution
It introduces a holistic framework for hate speech dataset creation, addressing key challenges and guiding practitioners to develop more reliable and generalizable datasets.
Findings
Identifies key issues in hate speech dataset creation
Proposes a seven-dimension framework for data collection
Provides best practices for future dataset development
Abstract
Detection of hate speech has been formulated as a standalone application of NLP and different approaches have been adopted for identifying the target groups, obtaining raw data, defining the labeling process, choosing the detection algorithm, and evaluating the performance in the desired setting. However, unlike other downstream tasks, hate speech suffers from the lack of large-sized, carefully curated, generalizable datasets owing to the highly subjective nature of the task. In this paper, we first analyze the issues surrounding hate speech detection through a data-centric lens. We then outline a holistic framework to encapsulate the data creation pipeline across seven broad dimensions by taking the specific example of hate speech towards sexual minorities. We posit that practitioners would benefit from following this framework as a form of best practice when creating hate speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
