On Cross-Lingual Retrieval with Multilingual Text Encoders

Robert Litschko; Ivan Vuli\'c; Simone Paolo Ponzetto; Goran Glava\v{s}

arXiv:2112.11031·cs.CL·December 22, 2021

On Cross-Lingual Retrieval with Multilingual Text Encoders

Robert Litschko, Ivan Vuli\'c, Simone Paolo Ponzetto, Goran Glava\v{s}

PDF

Open Access 1 Repo

TL;DR

This paper systematically evaluates multilingual text encoders for cross-lingual retrieval, revealing their strengths and limitations at document and sentence levels, and proposes localized relevance matching to improve performance.

Contribution

It provides a comprehensive empirical analysis of multilingual encoders for cross-lingual retrieval and introduces localized relevance matching for better document-level performance.

Findings

01

Pretrained multilingual encoders underperform compared to older models in unsupervised document retrieval.

02

State-of-the-art sentence retrieval is achieved by specialized supervised models, not vanilla encoders.

03

Supervised re-ranking offers limited improvements unless fine-tuned in-domain.

Abstract

In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs. For sentence-level retrieval, we do obtain state-of-the-art performance: the peak scores, however, are met by multilingual encoders that have been further specialized, in a supervised fashion, for sentence understanding tasks, rather…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rlitschk/EncoderCLIR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsBalanced Selection