Code-Mix Sentiment Analysis on Hinglish Tweets
Aashi Garg, Aneshya Das, Arshi Arya, Anushka Goyal, Aditi

TL;DR
This paper presents a specialized sentiment analysis framework for Hinglish tweets, utilizing mBERT with subword tokenization to improve accuracy in understanding code-mixed social media content.
Contribution
It introduces a fine-tuned mBERT model with subword tokenization tailored for Hinglish, addressing the challenges of code-mixed language in sentiment analysis.
Findings
Achieved high accuracy in Hinglish sentiment classification
Established a new benchmark for multilingual NLP in low-resource settings
Provided a production-ready AI tool for brand sentiment tracking
Abstract
The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Spam and Phishing Detection · Hate Speech and Cyberbullying Detection
