Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text

Amr Mohamed; Yang Zhang; Michalis Vazirgiannis; Guokan Shang

arXiv:2506.14012·cs.CL·June 18, 2025

Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text

Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang

PDF

Open Access 1 Repo

TL;DR

This paper systematically evaluates how Large Language Models understand code-switched text, revealing that language mixing often degrades comprehension but can be mitigated through fine-tuning.

Contribution

It introduces a framework for assessing LLM performance on code-switched data and compares prompting versus fine-tuning strategies for improving understanding.

Findings

01

Degradation occurs when foreign tokens disrupt English text.

02

Embedding English into other languages can improve comprehension.

03

Fine-tuning helps mitigate performance degradation.

Abstract

Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English text $\unicode x 2013$ even under linguistic constraints $\unicode x 2013$ embedding English into other languages often improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amr-mohamedd/lost-in-the-mix
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques