ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval
Antoine Louis, Vageesh Saxena, Gijs van Dijck, Gerasimos Spanakis

TL;DR
ColBERT-XM is a modular multi-vector retrieval model that excels in zero-shot multilingual information retrieval, reducing reliance on labeled data and energy consumption while maintaining competitive performance across diverse languages.
Contribution
It introduces a novel modular dense retrieval architecture that transfers knowledge from high-resource languages to low-resource ones without additional fine-tuning.
Findings
Achieves competitive results in zero-shot multilingual retrieval tasks.
Highly data-efficient and adaptable to out-of-distribution data.
Reduces energy consumption and carbon emissions compared to existing models.
Abstract
State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained language models capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Text and Document Classification Technologies · Topic Modeling
MethodsFocus
