Efficient Sketching Algorithm for Sparse Binary Data

Rameshwar Pratap; Debajyoti Bera; Karthik Revanuru

arXiv:1910.04658·cs.IR·October 11, 2019

Efficient Sketching Algorithm for Sparse Binary Data

Rameshwar Pratap, Debajyoti Bera, Karthik Revanuru

PDF

TL;DR

This paper introduces $insketch$, a fast and simple sketching algorithm for high-dimensional sparse binary data that preserves multiple similarity measures, enabling efficient data analysis tasks with comparable accuracy to existing methods.

Contribution

The paper proposes $insketch$, a novel sketching algorithm that efficiently reduces dimensionality of sparse binary datasets while preserving key similarity measures, with theoretical analysis and practical validation.

Findings

01

$insketch$ achieves similar accuracy to state-of-the-art methods.

02

It offers significant speedup in dimensionality reduction.

03

The algorithm is simple and easy to implement.

Abstract

Recent advancement of the WWW, IOT, social network, e-commerce, etc. have generated a large volume of data. These datasets are mostly represented by high dimensional and sparse datasets. Many fundamental subroutines of common data analytic tasks such as clustering, classification, ranking, nearest neighbour search, etc. scale poorly with the dimension of the dataset. In this work, we address this problem and propose a sketching (alternatively, dimensionality reduction) algorithm -- $\binsketch$ (Binary Data Sketch) -- for sparse binary datasets. $\binsketch$ preserves the binary version of the dataset after sketching and maintains estimates for multiple similarity measures such as Jaccard, Cosine, Inner-Product similarities, and Hamming distance, on the same sketch. We present a theoretical analysis of our algorithm and complement it with extensive experimentation on several real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.