Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps
Hiroki Iida, Naoaki Okazaki

TL;DR
This paper introduces an unsupervised domain adaptation technique for sparse retrieval models like SPLADE, addressing vocabulary and word frequency gaps to improve out-of-domain information retrieval performance.
Contribution
It proposes a novel method combining vocabulary expansion, continual pretraining, and inverse document frequency weighting to enhance SPLADE's domain adaptation without supervision.
Findings
Outperforms existing unsupervised domain adaptation methods
Achieves state-of-the-art results when combined with BM25
Effectively handles vocabulary and frequency gaps in target domains
Abstract
IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
