BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur, Nils Reimers, Andreas R\"uckl\'e, Abhishek Srivastava,, Iryna Gurevych

TL;DR
BEIR is a comprehensive benchmark comprising 18 diverse datasets designed to evaluate the zero-shot generalization of various information retrieval models across different domains and tasks.
Contribution
The paper introduces BEIR, a heterogeneous benchmark for evaluating IR models' out-of-distribution generalization, including a diverse set of datasets and systematic evaluation of multiple retrieval architectures.
Findings
BM25 is a robust baseline.
Re-ranking and late-interaction models perform best but are computationally expensive.
Dense and sparse models are efficient but underperform, indicating room for improvement.
Abstract
Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗intfloat/multilingual-e5-largemodel· 4.5M dl· ♡ 11664.5M dl♡ 1166
- 🤗intfloat/multilingual-e5-smallmodel· 3.9M dl· ♡ 2963.9M dl♡ 296
- 🤗Alibaba-NLP/gte-multilingual-basemodel· 914k dl· ♡ 353914k dl♡ 353
- 🤗intfloat/multilingual-e5-basemodel· 2.5M dl· ♡ 3442.5M dl♡ 344
- 🤗intfloat/e5-base-v2model· 1.6M dl· ♡ 1541.6M dl♡ 154
- 🤗intfloat/multilingual-e5-large-instructmodel· 1.3M dl· ♡ 6091.3M dl♡ 609
- 🤗BeIR/sparta-msmarco-distilbert-base-v1model· 23 dl· ♡ 423 dl♡ 4
- 🤗doc2query/S2ORC-t5-base-v1model· 4 dl· ♡ 44 dl♡ 4
- 🤗doc2query/all-t5-base-v1model· 193 dl· ♡ 10193 dl♡ 10
- 🤗doc2query/all-with_prefix-t5-base-v1model· 1.9k dl· ♡ 101.9k dl♡ 10
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
