Efficient Sketching Algorithm for Sparse Binary Data
Rameshwar Pratap, Debajyoti Bera, Karthik Revanuru

TL;DR
This paper introduces $insketch$, a fast and simple sketching algorithm for high-dimensional sparse binary data that preserves multiple similarity measures, enabling efficient data analysis tasks with comparable accuracy to existing methods.
Contribution
The paper proposes $insketch$, a novel sketching algorithm that efficiently reduces dimensionality of sparse binary datasets while preserving key similarity measures, with theoretical analysis and practical validation.
Findings
$insketch$ achieves similar accuracy to state-of-the-art methods.
It offers significant speedup in dimensionality reduction.
The algorithm is simple and easy to implement.
Abstract
Recent advancement of the WWW, IOT, social network, e-commerce, etc. have generated a large volume of data. These datasets are mostly represented by high dimensional and sparse datasets. Many fundamental subroutines of common data analytic tasks such as clustering, classification, ranking, nearest neighbour search, etc. scale poorly with the dimension of the dataset. In this work, we address this problem and propose a sketching (alternatively, dimensionality reduction) algorithm -- (Binary Data Sketch) -- for sparse binary datasets. preserves the binary version of the dataset after sketching and maintains estimates for multiple similarity measures such as Jaccard, Cosine, Inner-Product similarities, and Hamming distance, on the same sketch. We present a theoretical analysis of our algorithm and complement it with extensive experimentation on several real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
