Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual   Retrieval

Robert Litschko; Ivan Vuli\'c; Simone Paolo Ponzetto; Goran; Glava\v{s}

arXiv:2101.08370·cs.CL·January 22, 2021

Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

Robert Litschko, Ivan Vuli\'c, Simone Paolo Ponzetto, Goran, Glava\v{s}

PDF

1 Repo

TL;DR

This paper systematically evaluates pretrained multilingual text encoders for unsupervised cross-lingual retrieval, finding they underperform compared to CLWEs at document level but excel at sentence level with task-specific fine-tuning.

Contribution

It provides a comprehensive empirical analysis of multilingual encoders' effectiveness for unsupervised cross-lingual retrieval across many language pairs.

Findings

01

Pretrained encoders do not outperform CLWEs in unsupervised document retrieval.

02

State-of-the-art performance is achievable in sentence retrieval with specialized encoder variants.

03

Off-the-shelf encoders are less effective than fine-tuned variants for sentence-level tasks.

Abstract

Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rlitschk/EncoderCLIR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · WordPiece · Residual Connection · Dense Connections · Layer Normalization · Attention Is All You Need · Byte Pair Encoding · Label Smoothing