Efficient Compression Technique for Sparse Sets
Rameshwar Pratap, Ishan Sohony, Raghav Kulkarni

TL;DR
This paper proposes an efficient set compression method that preserves Jaccard similarity, reducing computational resources compared to existing techniques, and validates its effectiveness through theoretical analysis and experiments.
Contribution
It demonstrates that a known compression technique effectively preserves Jaccard similarity, offering improvements in speed and randomness over state-of-the-art methods.
Findings
Achieves similar accuracy to min-wise permutation
Reduces compression time significantly
Uses less randomness in compression process
Abstract
Recent technological advancements have led to the generation of huge amounts of data over the web, such as text, image, audio and video. Most of this data is high dimensional and sparse, for e.g., the bag-of-words representation used for representing text. Often, an efficient search for similar data points needs to be performed in many applications like clustering, nearest neighbour search, ranking and indexing. Even though there have been significant increases in computational power, a simple brute-force similarity-search on such datasets is inefficient and at times impossible. Thus, it is desirable to get a compressed representation which preserves the similarity between data points. In this work, we consider the data points as sets and use Jaccard similarity as the similarity measure. Compression techniques are generally evaluated on the following parameters --1) Randomness required…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
