BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
Sadia Alam, Md Farhan Ishmam, Navid Hasin Alvee, Md Shahnewaz, Siddique, Md Azam Hossain, Abu Raihan Mostofa Kamal

TL;DR
This paper introduces BnSentMix, a large, diverse Bengali-English code-mixed sentiment analysis dataset with 20,000 samples, and evaluates baseline models including novel transformer encoders, achieving around 70% accuracy.
Contribution
The paper provides the first large-scale, diverse sentiment dataset for Bengali-English code-mixed data and proposes baseline methods, including novel transformers, for sentiment analysis.
Findings
Baseline models achieved 69.8% accuracy.
Diverse data sources improve realism of code-mixed scenarios.
Performance varies across sentiment labels and text types.
Abstract
The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Hate Speech and Cyberbullying Detection
