Similarity preserving compressions of high dimensional sparse data
Raghav Kulkarni, Rameshwar Pratap

TL;DR
This paper introduces an efficient compression method for high-dimensional sparse data that preserves similarity measures like Hamming distance and Inner Product, with length depending only on sparsity, suitable for streaming and real-valued data.
Contribution
It presents a novel binary vector compression scheme that maintains similarity metrics, independent of data dimension, and extends to real-valued data for multiple similarity measures.
Findings
Compression length depends only on sparsity, not dimension.
Scheme works in streaming setting for one-shot similarity computation.
Generalizes to real-valued data for Euclidean distance and Inner Product.
Abstract
The rise of internet has resulted in an explosion of data consisting of millions of articles, images, songs, and videos. Most of this data is high dimensional and sparse. The need to perform an efficient search for similar objects in such high dimensional big datasets is becoming increasingly common. Even with the rapid growth in computing power, the brute-force search for such a task is impractical and at times impossible. Therefore it is quite natural to investigate the techniques that compress the dimension of the data-set while preserving the similarity between data objects. In this work, we propose an efficient compression scheme mapping binary vectors into binary vectors and simultaneously preserving Hamming distance and Inner Product. The length of our compression depends only on the sparsity and is independent of the dimension of the data. Moreover our schemes provide one-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Image and Video Retrieval Techniques · Algorithms and Data Compression
