Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

Ashley N. Abraham; Andrew Strelzoff; Haley R. Dozier; Althea C. Henslee; and Mark A. Chappell

arXiv:2604.21645·cs.LG·April 24, 2026

Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

Ashley N. Abraham, Andrew Strelzoff, Haley R. Dozier, Althea C. Henslee, and Mark A. Chappell

PDF

TL;DR

This paper presents a scalable Python-based approach for large-scale approximate nearest neighbor search using product quantization, inverted indexing, and Dask to reduce computational costs without sacrificing accuracy.

Contribution

It introduces a novel method combining PQ, inverted indexing, and Dask for efficient large-scale data clustering and search in Python.

Findings

01

Reduces computational expense for large-scale high-dimensional data

02

Maintains accuracy while scaling to large datasets

03

Demonstrates effective parallelization with Dask

Abstract

Large-scale Nearest Neighbor (NN) search, though widely utilized in the similarity search field, remains challenged by the computational limitations inherent in processing large scale data. In an effort to decrease the computational expense needed, Approximate Nearest Neighbor (ANN) search is often used in applications that do not require the exact similarity search, but instead can rely on an approximation. Product Quantization (PQ) is a memory-efficient ANN effective for clustering all sizes of datasets. Clustering large-scale, high dimensional data requires a heavy computational expense, in both memory-cost and execution time. This work focuses on a unique way to divide and conquer the large scale data in Python using PQ, Inverted Indexing and Dask, combining the results without compromising the accuracy and reducing computational requirements to the level required when using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.