Doc2Query--: When Less is More

Mitko Gospodinov; Sean MacAvaney; Craig Macdonald

arXiv:2301.03266·cs.IR·February 28, 2023·1 cites

Doc2Query--: When Less is More

Mitko Gospodinov, Sean MacAvaney, Craig Macdonald

PDF

Open Access 1 Repo

TL;DR

This paper investigates the hallucination problem in Doc2Query, a document expansion technique, and proposes filtering methods that enhance retrieval effectiveness, reduce index size, and improve efficiency.

Contribution

It introduces a relevance-based filtering approach to mitigate hallucinations in Doc2Query, leading to better retrieval performance and more efficient indexing.

Findings

01

Relevance filtering improves retrieval effectiveness by up to 16%.

02

Filtering reduces query execution time by 23%.

03

Index size is reduced by 33%.

Abstract

Doc2Query -- the process of expanding the content of a document before indexing using a sequence-to-sequence model -- has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to "hallucinating" content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 23% and cutting the index size by 33%. We release the code, data, and a live demonstration to facilitate reproduction and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

terrierteam/pyterrier_doc2query
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Web Data Mining and Analysis · Topic Modeling