Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching
Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, Dongha Lee

TL;DR
This paper explores whether code-switching in multilingual texts can activate language-specific knowledge in large language models, especially for low-resource languages, through a new English-Korean question-answering dataset and analysis.
Contribution
It introduces EnKoQA, a synthetic code-switching dataset, and analyzes how code-switching activates knowledge in LLMs for low-resource language tasks.
Findings
Code-switching faithfully activates knowledge in LLMs, especially in language-specific domains.
Multilingual LLMs show improved reasoning with code-switched inputs over monolingual English texts.
Potential of code-switching to enhance low-resource language understanding in LLMs.
Abstract
Recent large language models (LLMs) demonstrate multilingual abilities, yet they are English-centric due to dominance of English in training corpora. The limited resource for low-resource languages remains a crucial challenge. Code-switching (CS), a phenomenon where multilingual speakers alternate between languages in a discourse, can convey subtle cultural and linguistic nuances that can be otherwise lost in translation and elicits language-specific knowledge in human communications. In light of this, we investigate whether code-switching can activate, or identify and leverage knowledge for reasoning when LLMs solve low-resource language tasks. To facilitate the research, we first present EnKoQA, a synthetic English-Korean CS question-answering dataset. We provide comprehensive analysis on a variety of multilingual LLMs by subdividing activation process into knowledge identification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTranslation Studies and Practices
