Doc2Query--: When Less is More
Mitko Gospodinov, Sean MacAvaney, Craig Macdonald

TL;DR
This paper investigates the hallucination problem in Doc2Query, a document expansion technique, and proposes filtering methods that enhance retrieval effectiveness, reduce index size, and improve efficiency.
Contribution
It introduces a relevance-based filtering approach to mitigate hallucinations in Doc2Query, leading to better retrieval performance and more efficient indexing.
Findings
Relevance filtering improves retrieval effectiveness by up to 16%.
Filtering reduces query execution time by 23%.
Index size is reduced by 33%.
Abstract
Doc2Query -- the process of expanding the content of a document before indexing using a sequence-to-sequence model -- has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to "hallucinating" content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 23% and cutting the index size by 33%. We release the code, data, and a live demonstration to facilitate reproduction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Web Data Mining and Analysis · Topic Modeling
