Resources and Evaluations for Multi-Distribution Dense Information Retrieval
Soumya Chatterjee, Omar Khattab, Simran Arora

TL;DR
This paper introduces the multi-distribution information retrieval problem, creates benchmarks from existing datasets, and proposes simple budget-allocation methods that significantly improve retrieval recall across multiple domains.
Contribution
It defines the novel multi-distribution IR problem, designs new benchmarks, and presents simple, effective methods for allocating retrieval resources across diverse collections.
Findings
Methods improve Recall@100 by up to 8 points.
Strategies prevent dominant domains from consuming all retrieval budget.
Benchmarks are publicly available for future research.
Abstract
We introduce and define the novel problem of multi-distribution information retrieval (IR) where given a query, systems need to retrieve passages from within multiple collections, each drawn from a different distribution. Some of these collections and distributions might not be available at training time. To evaluate methods for multi-distribution retrieval, we design three benchmarks for this task from existing single-distribution datasets, namely, a dataset based on question answering and two based on entity matching. We propose simple methods for this task which allocate the fixed retrieval budget (top-k passages) strategically across domains to prevent the known domains from consuming most of the budget. We show that our methods lead to an average of 3.8+ and up to 8.0 points improvements in Recall@100 across the datasets and that improvements are consistent when fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Text and Document Classification Technologies
MethodsBalanced Selection
