Lost in Transliteration: Bridging the Script Gap in Neural IR
Andreas Chari, Iadh Ounis, Sean MacAvaney

TL;DR
This paper investigates the performance gap in neural IR systems when handling transliterated queries and proposes a fine-tuning approach on mixed native and Latinized texts to improve cross-script retrieval accuracy.
Contribution
It demonstrates that training on a mixture of native and transliterated scripts significantly enhances neural IR models' ability to handle transliterated queries effectively.
Findings
Models fine-tuned on mixed scripts perform nearly as well on transliterated queries as on native scripts.
Current models' performance drops sharply with transliterated queries, highlighting the script gap.
Transliterations can cause loss of query nuances, indicating the need for further research.
Abstract
Most human languages use scripts other than the Latin alphabet. Search users in these languages often formulate their information needs in a transliterated -- usually Latinized -- form for ease of typing. For example, Greek speakers might use Greeklish, and Arabic speakers might use Arabizi. This paper shows that current search systems, including those that use multilingual dense embeddings such as BGE-M3, do not generalise to this setting, and their performance rapidly deteriorates when exposed to transliterated queries. This creates a ``script gap" between the performance of the same queries when written in their native or transliterated form. We explore whether adapting the popular ``translate-train" paradigm to transliterations can enhance the robustness of multilingual Information Retrieval (IR) methods and bridge the gap between native and transliterated scripts. By exploring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
