Dense X Retrieval: What Retrieval Granularity Should We Use?

Tong Chen; Hongwei Wang; Sihao Chen; Wenhao Yu; Kaixin Ma; Xinran; Zhao; Hongming Zhang; Dong Yu

arXiv:2312.06648·cs.CL·October 7, 2024·5 cites

Dense X Retrieval: What Retrieval Granularity Should We Use?

Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran, Zhao, Hongming Zhang, Dong Yu

PDF

Open Access 3 Repos 1 Models 2 Datasets 1 Video

TL;DR

This paper investigates how the choice of retrieval units affects dense retrieval performance in NLP, introducing propositions as a novel fine-grained unit that improves retrieval and downstream task results.

Contribution

It introduces propositions as a new retrieval unit and empirically demonstrates their superiority over passages for dense retrieval tasks.

Findings

01

Propositions outperform passages in retrieval accuracy.

02

Fine-grained retrieval units enhance downstream QA performance.

03

Indexing by propositions improves retrieval efficiency.

Abstract

Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
chentong00/propositionizer-wiki-flan-t5-large
model· 656 dl· ♡ 49
656 dl♡ 49

Datasets

Videos

Dense X Retrieval: What Retrieval Granularity Should We Use?· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems