BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information   Retrieval Models

Nandan Thakur; Nils Reimers; Andreas R\"uckl\'e; Abhishek Srivastava,; Iryna Gurevych

arXiv:2104.08663·cs.IR·October 22, 2021·25 cites

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Nandan Thakur, Nils Reimers, Andreas R\"uckl\'e, Abhishek Srivastava,, Iryna Gurevych

PDF

Open Access 3 Repos 10 Models 5 Datasets

TL;DR

BEIR is a comprehensive benchmark comprising 18 diverse datasets designed to evaluate the zero-shot generalization of various information retrieval models across different domains and tasks.

Contribution

The paper introduces BEIR, a heterogeneous benchmark for evaluating IR models' out-of-distribution generalization, including a diverse set of datasets and systematic evaluation of multiple retrieval architectures.

Findings

01

BM25 is a robust baseline.

02

Re-ranking and late-interaction models perform best but are computationally expensive.

03

Dense and sparse models are efficient but underperform, indicating room for improvement.

Abstract

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications