Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang

TL;DR
This paper systematically evaluates how Large Language Models understand code-switched text, revealing that language mixing often degrades comprehension but can be mitigated through fine-tuning.
Contribution
It introduces a framework for assessing LLM performance on code-switched data and compares prompting versus fine-tuning strategies for improving understanding.
Findings
Degradation occurs when foreign tokens disrupt English text.
Embedding English into other languages can improve comprehension.
Fine-tuning helps mitigate performance degradation.
Abstract
Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English texteven under linguistic constraintsembedding English into other languages often improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
