Query2doc: Query Expansion with Large Language Models
Liang Wang, Nan Yang, Furu Wei

TL;DR
This paper presents query2doc, a simple query expansion method using large language models to generate pseudo-documents, significantly improving retrieval performance in both sparse and dense systems without fine-tuning.
Contribution
It introduces a novel query expansion approach leveraging LLM-generated pseudo-documents, enhancing retrieval accuracy across multiple datasets without additional training.
Findings
Boosts BM25 performance by 3% to 15% on MS-MARCO and TREC DL datasets.
Improves dense retriever results in both in-domain and out-of-domain scenarios.
Operates effectively without any model fine-tuning.
Abstract
This paper introduces a simple yet effective query expansion approach, denoted as query2doc, to improve both sparse and dense retrieval systems. The proposed method first generates pseudo-documents by few-shot prompting large language models (LLMs), and then expands the query with generated pseudo-documents. LLMs are trained on web-scale text corpora and are adept at knowledge memorization. The pseudo-documents from LLMs often contain highly relevant information that can aid in query disambiguation and guide the retrievers. Experimental results demonstrate that query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and TREC DL, without any model fine-tuning. Furthermore, our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis
