Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets
Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano, Venturini, and Leonardo Venuta

TL;DR
This paper evaluates the scalability of approximate sparse retrieval algorithms, specifically Seismic and graph-based solutions, on massive datasets like 138 million passages, highlighting their efficiency and effectiveness at large scale.
Contribution
It provides the first large-scale analysis of sparse retrieval algorithms on datasets exceeding hundreds of millions of documents, comparing different approaches and assessing their practical performance.
Findings
Seismic and graph-based algorithms scale effectively to 138M passages.
Indexing time and retrieval efficiency are analyzed at large scale.
Sparse retrieval methods remain effective on massive datasets.
Abstract
Learned sparse text embeddings have gained popularity due to their effectiveness in top-k retrieval and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world retrieval systems. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Machine Learning and ELM
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
