Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs
Himanshu Gupta, Pratik Jayarao, Chaitanya Dwivedi, Neeraj Varshney

TL;DR
This paper reviews challenges and solutions for enabling large language models to handle code-mixed language data, providing a comprehensive taxonomy, evaluation critique, and practical recommendations.
Contribution
It introduces a unifying taxonomy for CSW research in LLMs, reviews modeling and evaluation approaches, and offers actionable guidance for building and assessing CSW-capable models.
Findings
Current models struggle with grammaticality, factuality, and safety in CSW contexts.
Evaluation practices are unstable and lack reproducibility.
Existing benchmarks have limited linguistic coverage and English bias.
Abstract
Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
