Transfer Learning Approaches for Building Cross-Language Dense Retrieval   Models

Suraj Nair; Eugene Yang; Dawn Lawrie; Kevin Duh; Paul McNamee; Kenton; Murray; James Mayfield; Douglas W. Oard

arXiv:2201.08471·cs.IR·January 27, 2022

Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin Duh, Paul McNamee, Kenton, Murray, James Mayfield, Douglas W. Oard

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper presents ColBERT-X, a cross-language dense retrieval model leveraging XLM-R, trained via zero-shot and translate-train methods, significantly outperforming traditional lexical baselines in multilingual document ranking tasks.

Contribution

It introduces ColBERT-X, a novel cross-language dense retrieval model using transformer encoders, and demonstrates effective training strategies for multilingual information retrieval.

Findings

01

Significant improvements over lexical CLIR baselines.

02

Effective zero-shot and translate-train training methods.

03

Statistically significant results across multiple languages.

Abstract

The advent of transformer-based models such as BERT has led to the rise of neural ranking models. These models have improved the effectiveness of retrieval systems well beyond that of lexical term matching models such as BM25. While monolingual retrieval tasks have benefited from large-scale training collections such as MS MARCO and advances in neural architectures, cross-language retrieval tasks have fallen behind these advancements. This paper introduces ColBERT-X, a generalization of the ColBERT multi-representation dense retrieval model that uses the XLM-RoBERTa (XLM-R) encoder to support cross-language information retrieval (CLIR). ColBERT-X can be trained in two ways. In zero-shot training, the system is trained on the English MS MARCO collection, relying on the XLM-R encoder for cross-language mappings. In translate-train, the system is trained on the MS MARCO English queries…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hltcoe/colbert-x
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsAttention Is All You Need · XLM-R · Linear Layer · Layer Normalization · Dense Connections · Linear Warmup With Linear Decay · Softmax · Multi-Head Attention · Weight Decay · Adam