Efficient Binary Embedding of Categorical Data using BinSketch
Bhisham Dev Verma, Rameshwar Pratap, Debajyoti Bera

TL;DR
This paper introduces Cabin and Cham, a fast and accurate binary sketching method for high-dimensional categorical data that preserves distances efficiently, especially in sparse datasets, with strong theoretical guarantees and empirical validation.
Contribution
The paper proposes a novel binary sketching algorithm Cabin and a distance estimation method Cham, with theoretical analysis and practical validation for high-dimensional sparse categorical data.
Findings
Significantly faster than existing methods.
Maintains high accuracy in distance estimation.
Effective on datasets with over a million dimensions.
Abstract
In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches. The minimum dimension of the sketches required by Cham to ensure a good estimation theoretically depends only on the sparsity of the data points - making it useful for many real-life scenarios involving sparse datasets. We present a rigorous theoretical analysis of our approach and supplement it with extensive experiments on several high-dimensional real-world data sets, including one with over a million dimensions. We show that the Cabin and Cham duo is a significantly fast and accurate approach for tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Face and Expression Recognition · Visual Attention and Saliency Detection
