DAPR: A Benchmark on Document-Aware Passage Retrieval
Kexin Wang, Nils Reimers, Iryna Gurevych

TL;DR
This paper introduces Document-Aware Passage Retrieval (DAPR), a new benchmark for retrieving relevant passages within long documents, highlighting the limitations of current methods and providing a platform for future research.
Contribution
The paper proposes the DAPR benchmark, analyzes errors in existing retrieval methods, and evaluates hybrid and contextualized approaches on this new task.
Findings
Hybrid retrieval fails on hard, context-dependent queries.
Contextualized passage representations improve hard query performance.
Current methods perform poorly overall on document-aware retrieval.
Abstract
The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task \emph{Document-Aware Passage Retrieval} (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5\%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
