The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, Liane Lewin-Eytan

TL;DR
This paper investigates retrieval biases in cross-lingual RAG systems over Arabic-English corpora, revealing retrieval as a key bottleneck and proposing strategies to improve multilingual retrieval performance in real-world scenarios.
Contribution
It provides a systematic analysis of multilingual retrieval challenges in domain-specific RAG, introduces new benchmarks, and proposes simple strategies to mitigate retrieval biases across languages.
Findings
Retrieval performance drops when user query and document languages differ.
Retrieval biases are mainly due to the retriever's difficulty in ranking cross-lingual documents.
Simple strategies like enforcing equal retrieval or translating queries improve performance.
Abstract
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
