GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval
Kexin Wang, Nandan Thakur, Nils Reimers, Iryna Gurevych

TL;DR
This paper introduces Generative Pseudo Labeling (GPL), an unsupervised domain adaptation method for dense retrieval that improves performance on specialized datasets with less target domain data.
Contribution
GPL combines a query generator with pseudo labeling from a cross-encoder to enhance dense retrieval across domains without requiring labeled data.
Findings
GPL outperforms state-of-the-art dense retrieval by up to 9.3 points nDCG@10.
GPL is more robust and requires less unlabeled data from the target domain.
Combining GPL with TSDAE yields an additional 1.4 points nDCG@10 improvement.
Abstract
Dense retrieval approaches can overcome the lexical gap and lead to significantly improved search results. However, they require large amounts of training data which is not available for most domains. As shown in previous work (Thakur et al., 2021b), the performance of dense retrievers severely degrades under a domain shift. This limits the usage of dense retrieval approaches to only a few domains with large training datasets. In this paper, we propose the novel unsupervised domain adaptation method Generative Pseudo Labeling (GPL), which combines a query generator with pseudo labeling from a cross-encoder. On six representative domain-specialized datasets, we find the proposed GPL can outperform an out-of-the-box state-of-the-art dense retrieval approach by up to 9.3 points nDCG@10. GPL requires less (unlabeled) data from the target domain and is more robust in its training than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗GPL/msmarco-distilbert-margin-msemodel· 8 dl· ♡ 18 dl♡ 1
- 🤗doc2query/msmarco-german-mt5-base-v1model· 12 dl· ♡ 612 dl♡ 6
- 🤗doc2query/msmarco-arabic-mt5-base-v1model· 42 dl· ♡ 242 dl♡ 2
- 🤗doc2query/msmarco-chinese-mt5-base-v1model· 13 dl· ♡ 1413 dl♡ 14
- 🤗doc2query/msmarco-dutch-mt5-base-v1model· 2 dl· ♡ 22 dl♡ 2
- 🤗doc2query/msmarco-french-mt5-base-v1model· 15 dl· ♡ 415 dl♡ 4
- 🤗doc2query/msmarco-hindi-mt5-base-v1model· 2 dl· ♡ 12 dl♡ 1
- 🤗doc2query/msmarco-indonesian-mt5-base-v1model· 3 dl· ♡ 23 dl♡ 2
- 🤗doc2query/msmarco-italian-mt5-base-v1model· 2 dl· ♡ 12 dl♡ 1
- 🤗doc2query/msmarco-japanese-mt5-base-v1model· 44 dl· ♡ 544 dl♡ 5
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Softmax · TSDAE
