Investigating the Scalability of Approximate Sparse Retrieval Algorithms   to Massive Datasets

Sebastian Bruch; Franco Maria Nardini; Cosimo Rulli; Rossano; Venturini; and Leonardo Venuta

arXiv:2501.11628·cs.IR·January 22, 2025

Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets

Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano, Venturini, and Leonardo Venuta

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the scalability of approximate sparse retrieval algorithms, specifically Seismic and graph-based solutions, on massive datasets like 138 million passages, highlighting their efficiency and effectiveness at large scale.

Contribution

It provides the first large-scale analysis of sparse retrieval algorithms on datasets exceeding hundreds of millions of documents, comparing different approaches and assessing their practical performance.

Findings

01

Seismic and graph-based algorithms scale effectively to 138M passages.

02

Indexing time and retrieval efficiency are analyzed at large scale.

03

Sparse retrieval methods remain effective on massive datasets.

Abstract

Learned sparse text embeddings have gained popularity due to their effectiveness in top-k retrieval and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world retrieval systems. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tuskanny/seismic
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Machine Learning and ELM

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings