Embedding-based Zero-shot Retrieval through Query Generation
Davis Liang, Peng Xu, Siamak Shakeri, Cicero Nogueira dos Santos,, Ramesh Nallapati, Zhiheng Huang, Bing Xiang

TL;DR
This paper introduces a novel synthetic data generation method for embedding-based zero-shot passage retrieval, significantly outperforming traditional BM25 in multiple datasets by leveraging query generation techniques.
Contribution
The work presents a new approach for generating synthetic training data to enhance neural retrieval models, enabling effective zero-shot retrieval without extensive labeled datasets.
Findings
Outperforms BM25 on 5 out of 6 datasets
Average Recall@1 improvement of 2.45 points
Synthetic data can sometimes surpass real data for training
Abstract
Passage retrieval addresses the problem of locating relevant passages, usually from a large corpus, given a query. In practice, lexical term-matching algorithms like BM25 are popular choices for retrieval owing to their efficiency. However, term-based matching algorithms often miss relevant passages that have no lexical overlap with the query and cannot be finetuned to downstream datasets. In this work, we consider the embedding-based two-tower architecture as our neural retrieval model. Since labeled data can be scarce and because neural retrieval models require vast amounts of data to train, we propose a novel method for generating synthetic training data for retrieval. Our system produces remarkable results, significantly outperforming BM25 on 5 out of 6 datasets tested, by an average of 2.45 points for Recall@1. In some cases, our model trained on synthetic data can even outperform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
