The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles

Hiun Kim; Tae Kwan Lee; Taeryun Won

arXiv:2605.01407·cs.IR·May 5, 2026

The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles

Hiun Kim, Tae Kwan Lee, Taeryun Won

PDF

TL;DR

This study investigates how different pre-training datasets and options affect the performance of Expanded-SPLADE models in web document retrieval, highlighting the impact of dataset choice and pruning on effectiveness and efficiency.

Contribution

It provides empirical insights into the effects of pre-training data and settings on SPLADE-style models for neural information retrieval.

Findings

01

Models pre-trained on general corpora with higher learning rates perform better after fine-tuning.

02

Strict pruning increases retrieval cost and variance in postings list length.

03

Repeating the pre-training dataset has minimal impact on retrieval effectiveness.

Abstract

Masked Language Modeling (MLM) pre-training is one of the primary ways to initialize Neural Information Retrieval (IR) models prior to retrieval fine-tuning. However, studies show that MLM pre-trained models have limited readiness and transfer learning issues for fine-tuning them into Neural Bi-Encoder models. This paper studies the effect of different pre-training datasets and pre-training options on the MLM pre-trained models for retrieval fine-tuning. The study focuses on the SPLADE-style model, which uses the MLM layer also at fine-tuning time. More specifically, we experimented with Expanded-SPLADE (ESPLADE) models, a specific instance of SPLADE models, and in-house web document titles are used as datasets. Pre-training, fine-tuning, and evaluation with optional test-time pruning of sparse vectors are conducted. Our observations are three-fold: First, fine-tuned models of higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.