DAPR: A Benchmark on Document-Aware Passage Retrieval

Kexin Wang; Nils Reimers; Iryna Gurevych

arXiv:2305.13915·cs.IR·June 11, 2024·1 cites

DAPR: A Benchmark on Document-Aware Passage Retrieval

Kexin Wang, Nils Reimers, Iryna Gurevych

PDF

Open Access 3 Repos 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces Document-Aware Passage Retrieval (DAPR), a new benchmark for retrieving relevant passages within long documents, highlighting the limitations of current methods and providing a platform for future research.

Contribution

The paper proposes the DAPR benchmark, analyzes errors in existing retrieval methods, and evaluates hybrid and contextualized approaches on this new task.

Findings

01

Hybrid retrieval fails on hard, context-dependent queries.

02

Contextualized passage representations improve hard query performance.

03

Current methods perform poorly overall on document-aware retrieval.

Abstract

The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task \emph{Document-Aware Passage Retrieval} (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5\%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
kwang2049/long-coref
model· 5 dl· ♡ 3
5 dl♡ 3

Datasets

UKPLab/dapr
dataset· 924 dl
924 dl

Videos

DAPR: A Benchmark on Document-Aware Passage Retrieval· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Handwritten Text Recognition Techniques