Code-Mix Sentiment Analysis on Hinglish Tweets

Aashi Garg; Aneshya Das; Arshi Arya; Anushka Goyal; Aditi

arXiv:2601.05091·cs.CL·January 9, 2026

Code-Mix Sentiment Analysis on Hinglish Tweets

Aashi Garg, Aneshya Das, Arshi Arya, Anushka Goyal, Aditi

PDF

Open Access

TL;DR

This paper presents a specialized sentiment analysis framework for Hinglish tweets, utilizing mBERT with subword tokenization to improve accuracy in understanding code-mixed social media content.

Contribution

It introduces a fine-tuned mBERT model with subword tokenization tailored for Hinglish, addressing the challenges of code-mixed language in sentiment analysis.

Findings

01

Achieved high accuracy in Hinglish sentiment classification

02

Established a new benchmark for multilingual NLP in low-resource settings

03

Provided a production-ready AI tool for brand sentiment tracking

Abstract

The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Spam and Phishing Detection · Hate Speech and Cyberbullying Detection