Unsupervised Domain Adaptation for Sparse Retrieval by Filling   Vocabulary and Word Frequency Gaps

Hiroki Iida; Naoaki Okazaki

arXiv:2211.03988·cs.CL·November 11, 2022·1 cites

Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps

Hiroki Iida, Naoaki Okazaki

PDF

Open Access 1 Repo

TL;DR

This paper introduces an unsupervised domain adaptation technique for sparse retrieval models like SPLADE, addressing vocabulary and word frequency gaps to improve out-of-domain information retrieval performance.

Contribution

It proposes a novel method combining vocabulary expansion, continual pretraining, and inverse document frequency weighting to enhance SPLADE's domain adaptation without supervision.

Findings

01

Outperforms existing unsupervised domain adaptation methods

02

Achieves state-of-the-art results when combined with BM25

03

Effectively handles vocabulary and frequency gaps in target domains

Abstract

IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

meshidenn/cai
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning