On Cross-Lingual Retrieval with Multilingual Text Encoders
Robert Litschko, Ivan Vuli\'c, Simone Paolo Ponzetto, Goran Glava\v{s}

TL;DR
This paper systematically evaluates multilingual text encoders for cross-lingual retrieval, revealing their strengths and limitations at document and sentence levels, and proposes localized relevance matching to improve performance.
Contribution
It provides a comprehensive empirical analysis of multilingual encoders for cross-lingual retrieval and introduces localized relevance matching for better document-level performance.
Findings
Pretrained multilingual encoders underperform compared to older models in unsupervised document retrieval.
State-of-the-art sentence retrieval is achieved by specialized supervised models, not vanilla encoders.
Supervised re-ranking offers limited improvements unless fine-tuned in-domain.
Abstract
In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs. For sentence-level retrieval, we do obtain state-of-the-art performance: the peak scores, however, are met by multilingual encoders that have been further specialized, in a supervised fashion, for sentence understanding tasks, rather…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsBalanced Selection
