BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification
Akif Islam, Sujan Kumar Roy, Md. Ekramul Hamid

TL;DR
BengaliSent140 is a large, diverse, and publicly available Bengali sentiment dataset with 139,792 samples, designed to facilitate hate speech classification and improve deep learning model training in Bengali NLP.
Contribution
This work introduces BengaliSent140, the first large-scale, multi-source Bengali sentiment dataset with harmonized binary labels, enabling better model training and benchmarking.
Findings
Balanced class distribution with 68,548 hate and 71,244 not-hate samples
Demonstrated baseline performance for hate speech classification
Broader linguistic coverage than existing datasets
Abstract
Sentiment analysis for the Bengali language has attracted increasing research interest in recent years. However, progress remains constrained by the scarcity of large-scale and diverse annotated datasets. Although several Bengali sentiment and hate speech datasets are publicly available, most are limited in size or confined to a single domain, such as social media comments. Consequently, these resources are often insufficient for training modern deep learning based models, which require large volumes of heterogeneous data to learn robust and generalizable representations. In this work, we introduce BengaliSent140, a large-scale Bengali binary sentiment dataset constructed by consolidating seven existing Bengali text datasets into a unified corpus. To ensure consistency across sources, heterogeneous annotation schemes are systematically harmonized into a binary sentiment formulation with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Spam and Phishing Detection
