MS-Shift: An Analysis of MS MARCO Distribution Shifts on Neural   Retrieval

Simon Lupart; Thibault Formal; St\'ephane Clinchant

arXiv:2205.02870·cs.IR·January 26, 2023·1 cites

MS-Shift: An Analysis of MS MARCO Distribution Shifts on Neural Retrieval

Simon Lupart, Thibault Formal, St\'ephane Clinchant

PDF

Open Access 1 Repo

TL;DR

This paper investigates how neural retrieval models based on BERT perform under explicit distribution shifts within MS MARCO, revealing varying robustness among different approaches and providing resources for future benchmarking.

Contribution

It introduces three controlled distribution shifts in MS MARCO and analyzes their impact on various neural retrieval models, enhancing understanding of model generalization.

Findings

01

Dense models are most affected by distribution shifts.

02

Performance drops correlate with vocabulary and representation dissimilarity.

03

MS MARCO query subsets are released for benchmarking zero-shot transfer.

Abstract

Pre-trained Language Models have recently emerged in Information Retrieval as providing the backbone of a new generation of neural systems that outperform traditional methods on a variety of tasks. However, it is still unclear to what extent such approaches generalize in zero-shot conditions. The recent BEIR benchmark provides partial answers to this question by comparing models on datasets and tasks that differ from the training conditions. We aim to address the same question by comparing models under more explicit distribution shifts. To this end, we build three query-based distribution shifts within MS MARCO (query-semantic, query-intent, query-length), which are used to evaluate the three main families of neural retrievers based on BERT: sparse, dense, and late-interaction -- as well as a monoBERT re-ranker. We further analyse the performance drops between the train and test query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

naver/ms-marco-shift
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications