Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders
Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Jasabanta Patro

TL;DR
This paper investigates how multilingual encoders represent code-mixed Hindi-English text, revealing an English-dominant processing bias and proposing a new alignment method that improves cross-lingual understanding and downstream task performance.
Contribution
It introduces a novel trilingual post-training alignment objective that enhances code-mixed representation balance and cross-lingual task accuracy.
Findings
Standard models align English and Hindi well but not code-mixed inputs.
Continued pre-training improves English-code-mixed alignment but reduces English-Hindi alignment.
The proposed alignment method yields better downstream task performance.
Abstract
Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Computational and Text Analysis Methods
