The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles
Hiun Kim, Tae Kwan Lee, Taeryun Won

TL;DR
This study investigates how different pre-training datasets and options affect the performance of Expanded-SPLADE models in web document retrieval, highlighting the impact of dataset choice and pruning on effectiveness and efficiency.
Contribution
It provides empirical insights into the effects of pre-training data and settings on SPLADE-style models for neural information retrieval.
Findings
Models pre-trained on general corpora with higher learning rates perform better after fine-tuning.
Strict pruning increases retrieval cost and variance in postings list length.
Repeating the pre-training dataset has minimal impact on retrieval effectiveness.
Abstract
Masked Language Modeling (MLM) pre-training is one of the primary ways to initialize Neural Information Retrieval (IR) models prior to retrieval fine-tuning. However, studies show that MLM pre-trained models have limited readiness and transfer learning issues for fine-tuning them into Neural Bi-Encoder models. This paper studies the effect of different pre-training datasets and pre-training options on the MLM pre-trained models for retrieval fine-tuning. The study focuses on the SPLADE-style model, which uses the MLM layer also at fine-tuning time. More specifically, we experimented with Expanded-SPLADE (ESPLADE) models, a specific instance of SPLADE models, and in-house web document titles are used as datasets. Pre-training, fine-tuning, and evaluation with optional test-time pruning of sparse vectors are conducted. Our observations are three-fold: First, fine-tuned models of higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
