Efficient Compression Technique for Sparse Sets

Rameshwar Pratap; Ishan Sohony; Raghav Kulkarni

arXiv:1708.04799·cs.IT·August 17, 2017

Efficient Compression Technique for Sparse Sets

Rameshwar Pratap, Ishan Sohony, Raghav Kulkarni

PDF

TL;DR

This paper proposes an efficient set compression method that preserves Jaccard similarity, reducing computational resources compared to existing techniques, and validates its effectiveness through theoretical analysis and experiments.

Contribution

It demonstrates that a known compression technique effectively preserves Jaccard similarity, offering improvements in speed and randomness over state-of-the-art methods.

Findings

01

Achieves similar accuracy to min-wise permutation

02

Reduces compression time significantly

03

Uses less randomness in compression process

Abstract

Recent technological advancements have led to the generation of huge amounts of data over the web, such as text, image, audio and video. Most of this data is high dimensional and sparse, for e.g., the bag-of-words representation used for representing text. Often, an efficient search for similar data points needs to be performed in many applications like clustering, nearest neighbour search, ranking and indexing. Even though there have been significant increases in computational power, a simple brute-force similarity-search on such datasets is inefficient and at times impossible. Thus, it is desirable to get a compressed representation which preserves the similarity between data points. In this work, we consider the data points as sets and use Jaccard similarity as the similarity measure. Compression techniques are generally evaluated on the following parameters --1) Randomness required…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.