HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech Detection in Bangla
Nauros Romim, Mosahed Ahmed, Md Saiful Islam, Arnab Sen Sharma,, Hriteshwar Talukder, Mohammad Ruhul Amin

TL;DR
This paper introduces HS-BAN, a comprehensive Bangla hate speech dataset with over 50,000 comments, and develops a benchmark system using neural networks that achieves high accuracy in hate speech detection.
Contribution
The paper presents HS-BAN, a large annotated dataset for Bangla hate speech detection, and establishes a benchmark using neural networks with informal word embeddings.
Findings
Informal text-trained embeddings outperform formal text embeddings.
Bi-LSTM with FastText informal embeddings achieves 86.78% F1-score.
Dataset and benchmark system are publicly available.
Abstract
In this paper, we present HS-BAN, a binary class hate speech (HS) dataset in Bangla language consisting of more than 50,000 labeled comments, including 40.17% hate and rest are non hate speech. While preparing the dataset a strict and detailed annotation guideline was followed to reduce human annotation bias. The HS dataset was also preprocessed linguistically to extract different types of slang currently people write using symbols, acronyms, or alternative spellings. These slang words were further categorized into traditional and non-traditional slang lists and included in the results of this paper. We explored traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection for the Bangla language. Our experimental results show that existing word embedding models trained with informal texts perform better than those trained with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Internet Traffic Analysis and Secure E-voting
MethodsNetwork On Network · fastText
