SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis
Md Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios, Anastasopoulos, Marcos Zampieri

TL;DR
SentMix-3L is a new dataset for sentiment analysis involving Bangla, English, and Hindi code-mixed text, and demonstrates GPT-3.5's superior zero-shot performance over traditional models.
Contribution
Introduces SentMix-3L, the first multi-language code-mixed dataset for sentiment analysis involving three languages, and evaluates model performances on it.
Findings
GPT-3.5 zero-shot outperforms transformer-based models
SentMix-3L enables multi-language code-mixed sentiment analysis
Comprehensive evaluation of models on a novel three-language dataset
Abstract
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been build with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce SentMix-3L, a novel dataset for sentiment analysis containing code-mixed data between three languages Bangla, English, and Hindi. We carry out a comprehensive evaluation using SentMix-3L. We show that zero-shot prompting with GPT-3.5 outperforms all transformer-based models on SentMix-3L.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Digital Communication and Language
MethodsMulti-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Byte Pair Encoding · Weight Decay · Softmax · Dense Connections · Linear Warmup With Cosine Annealing
