Query2doc: Query Expansion with Large Language Models

Liang Wang; Nan Yang; Furu Wei

arXiv:2303.07678·cs.IR·October 12, 2023·6 cites

Query2doc: Query Expansion with Large Language Models

Liang Wang, Nan Yang, Furu Wei

PDF

Open Access 1 Datasets

TL;DR

This paper presents query2doc, a simple query expansion method using large language models to generate pseudo-documents, significantly improving retrieval performance in both sparse and dense systems without fine-tuning.

Contribution

It introduces a novel query expansion approach leveraging LLM-generated pseudo-documents, enhancing retrieval accuracy across multiple datasets without additional training.

Findings

01

Boosts BM25 performance by 3% to 15% on MS-MARCO and TREC DL datasets.

02

Improves dense retriever results in both in-domain and out-of-domain scenarios.

03

Operates effectively without any model fine-tuning.

Abstract

This paper introduces a simple yet effective query expansion approach, denoted as query2doc, to improve both sparse and dense retrieval systems. The proposed method first generates pseudo-documents by few-shot prompting large language models (LLMs), and then expands the query with generated pseudo-documents. LLMs are trained on web-scale text corpora and are adept at knowledge memorization. The pseudo-documents from LLMs often contain highly relevant information that can aid in query disambiguation and guide the retrievers. Experimental results demonstrate that query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and TREC DL, without any model fine-tuning. Furthermore, our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

intfloat/query2doc_msmarco
dataset· 195 dl
195 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis