TL;DR
This paper investigates the challenges of code-switching in information retrieval, introducing new benchmarks and revealing significant performance drops in current models, highlighting the need for specialized solutions.
Contribution
It presents CSR-L and CS-MTEB benchmarks for code-switching IR, analyzing the limitations of existing models and techniques in handling mixed-language queries.
Findings
Code-switching causes up to 27% performance decline in IR tasks.
Embedding divergence explains the failure of current models on code-switched text.
Vocabulary expansion alone cannot fully address code-switching challenges.
Abstract
Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated to code-switching IR. We introduce CSR-L (Code-Switching Retrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, dense, and late-interaction paradigms reveals that code-switching acts as a fundamental performance bottleneck, degrading the effectiveness of even robust multilingual models. We demonstrate that this failure stems from substantial divergence in the embedding space between pure and code-switched text. Scaling this investigation, we propose CS-MTEB, a comprehensive benchmark covering 11 diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
