Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Debajyoti Mazumder; Divyansh Pathak; Prashant Kodali; Jasabanta Patro

arXiv:2603.19771·cs.CL·March 23, 2026

Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Jasabanta Patro

PDF

Open Access

TL;DR

This paper investigates how multilingual encoders represent code-mixed Hindi-English text, revealing an English-dominant processing bias and proposing a new alignment method that improves cross-lingual understanding and downstream task performance.

Contribution

It introduces a novel trilingual post-training alignment objective that enhances code-mixed representation balance and cross-lingual task accuracy.

Findings

01

Standard models align English and Hindi well but not code-mixed inputs.

02

Continued pre-training improves English-code-mixed alignment but reduces English-Hindi alignment.

03

The proposed alignment method yields better downstream task performance.

Abstract

Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Computational and Text Analysis Methods