Exploring the Representation Power of SPLADE Models
Joel Mackenzie, Shengyao Zhuang, Guido Zuccon

TL;DR
This paper investigates the SPLADE model's ability to encode ranking signals in sparse document representations, revealing its effectiveness even with non-traditional or random vocabulary terms.
Contribution
It provides empirical evidence that SPLADE can encode useful ranking signals beyond traditional lexical features, expanding understanding of its representation power.
Findings
SPLADE encodes signals even with stopwords or random words.
Constrained vocabularies do not significantly reduce SPLADE's effectiveness.
SPLADE's representations capture more than just lexical matching.
Abstract
The SPLADE (SParse Lexical AnD Expansion) model is a highly effective approach to learned sparse retrieval, where documents are represented by term impact scores derived from large language models. During training, SPLADE applies regularization to ensure postings lists are kept sparse -- with the aim of mimicking the properties of natural term distributions -- allowing efficient and effective lexical matching and ranking. However, we hypothesize that SPLADE may encode additional signals into common postings lists to further improve effectiveness. To explore this idea, we perform a number of empirical analyses where we re-train SPLADE with different, controlled vocabularies and measure how effective it is at ranking passages. Our findings suggest that SPLADE can effectively encode useful ranking signals in documents even when the vocabulary is constrained to terms that are not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
