Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Qingcheng Zeng; Yuheng Lu; Zeqi Zhou; Heli Qi; Puxuan Yu; Fuheng Zhao; Hitomi Yanaka; Weihao Xuan; Naoto Yokoya

arXiv:2604.17632·cs.IR·April 21, 2026

Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Qingcheng Zeng, Yuheng Lu, Zeqi Zhou, Heli Qi, Puxuan Yu, Fuheng Zhao, Hitomi Yanaka, Weihao Xuan, Naoto Yokoya

PDF

1 Repo

TL;DR

This paper investigates the challenges of code-switching in information retrieval, introducing new benchmarks and revealing significant performance drops in current models, highlighting the need for specialized solutions.

Contribution

It presents CSR-L and CS-MTEB benchmarks for code-switching IR, analyzing the limitations of existing models and techniques in handling mixed-language queries.

Findings

01

Code-switching causes up to 27% performance decline in IR tasks.

02

Embedding divergence explains the failure of current models on code-switched text.

03

Vocabulary expansion alone cannot fully address code-switching challenges.

Abstract

Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated to code-switching IR. We introduce CSR-L (Code-Switching Retrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, dense, and late-interaction paradigms reveals that code-switching acts as a fundamental performance bottleneck, degrading the effectiveness of even robust multilingual models. We demonstrate that this failure stems from substantial divergence in the embedding space between pure and code-switched text. Scaling this investigation, we propose CS-MTEB, a comprehensive benchmark covering 11 diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

paddler2022/Code-Switching-Information-Retrieval
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.