TL;DR
This paper demonstrates that pre-trained multilingual models can be effectively used for zero-shot cross-lingual information retrieval, significantly outperforming unsupervised methods for several languages.
Contribution
It introduces a zero-shot transfer approach for multilingual retrieval using pre-trained models, addressing data scarcity in non-English IR tasks.
Findings
Significant improvements over unsupervised methods for Arabic, Chinese, and Spanish.
Augmenting English training data with target language examples can enhance performance.
Zero-shot transfer is viable for non-English retrieval without language-specific training data.
Abstract
While billions of non-English speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-English languages. This is primarily due to a lack of data set that are suitable to train ranking algorithms. In this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents. Our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish. We also show that augmenting the English training collection with some examples from the target language can sometimes improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
