Lost in Transliteration: Bridging the Script Gap in Neural IR

Andreas Chari; Iadh Ounis; Sean MacAvaney

arXiv:2505.08411·cs.IR·May 14, 2025

Lost in Transliteration: Bridging the Script Gap in Neural IR

Andreas Chari, Iadh Ounis, Sean MacAvaney

PDF

TL;DR

This paper investigates the performance gap in neural IR systems when handling transliterated queries and proposes a fine-tuning approach on mixed native and Latinized texts to improve cross-script retrieval accuracy.

Contribution

It demonstrates that training on a mixture of native and transliterated scripts significantly enhances neural IR models' ability to handle transliterated queries effectively.

Findings

01

Models fine-tuned on mixed scripts perform nearly as well on transliterated queries as on native scripts.

02

Current models' performance drops sharply with transliterated queries, highlighting the script gap.

03

Transliterations can cause loss of query nuances, indicating the need for further research.

Abstract

Most human languages use scripts other than the Latin alphabet. Search users in these languages often formulate their information needs in a transliterated -- usually Latinized -- form for ease of typing. For example, Greek speakers might use Greeklish, and Arabic speakers might use Arabizi. This paper shows that current search systems, including those that use multilingual dense embeddings such as BGE-M3, do not generalise to this setting, and their performance rapidly deteriorates when exposed to transliterated queries. This creates a ``script gap" between the performance of the same queries when written in their native or transliterated form. We explore whether adapting the popular ``translate-train" paradigm to transliterations can enhance the robustness of multilingual Information Retrieval (IR) methods and bridge the gap between native and transliterated scripts. By exploring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.