EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching
Paul Bontempo

TL;DR
This study explores how sentiment affects language choice in English-Tamil code-switching, using machine learning to analyze YouTube comments and revealing significant correlations between emotion and language switching behavior.
Contribution
It introduces a novel analysis linking sentiment to language switching patterns in code-switched text using a fine-tuned language identification model.
Findings
Positive utterances have higher English proportion (34.3%) than negative ones (24.8%)
Mixed-sentiment utterances show the highest language switch frequency
Emotion influences language choice due to socio-linguistic associations
Abstract
This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
