Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search
Daniel Severo, Giuseppe Ottaviano, Matthew Muckley, Karen Ullrich,, Matthijs Douze

TL;DR
This paper presents lossless compression techniques for vector IDs in approximate nearest neighbor search indexes, significantly reducing storage size without affecting search accuracy or speed, especially on large-scale datasets.
Contribution
It introduces novel lossless compression schemes for vector IDs in index structures, achieving up to 7x compression and reducing index size by 30% on billion-scale datasets.
Findings
Achieved up to 7x compression of vector IDs.
Reduced index size by 30% on billion-scale datasets.
Lossless compression of quantized vector codes in some cases.
Abstract
Approximate nearest neighbor search for vectors relies on indexes that are most often accessed from RAM. Therefore, storage is the factor limiting the size of the database that can be served from a machine. Lossy vector compression, i.e., embedding quantization, has been applied extensively to reduce the size of indexes. However, for inverted file and graph-based indices, auxiliary data such as vector ids and links (edges) can represent most of the storage cost. We introduce and evaluate lossless compression schemes for these cases. These approaches are based on asymmetric numeral systems or wavelet trees that exploit the fact that the ordering of ids is irrelevant within the data structures. In some settings, we are able to compress the vector ids by a factor 7, with no impact on accuracy or search runtime. On billion-scale datasets, this results in a reduction of 30% of the index…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Face and Expression Recognition · Advanced Image and Video Retrieval Techniques
