Dimensionality Reduction for Categorical Data
Debajyoti Bera, Rameshwar Pratap, Bhisham Dev Verma

TL;DR
This paper introduces FSketch, a novel method for compressing sparse categorical data into low-dimensional discrete sketches that preserve pairwise Hamming distances, enabling efficient data mining without significant loss of accuracy.
Contribution
FSketch provides a single-pass, efficient sketching algorithm for sparse categorical data that guarantees approximate Hamming distance preservation, improving speed and accuracy over existing methods.
Findings
FSketch is significantly faster than related algorithms.
It achieves high accuracy in RMSE, clustering, and similarity search tasks.
The method's effectiveness is validated on real-world datasets.
Abstract
Categorical attributes are those that can take a discrete set of values, e.g., colours. This work is about compressing vectors over categorical attributes to low-dimension discrete vectors. The current hash-based methods compressing vectors over categorical attributes to low-dimension discrete vectors do not provide any guarantee on the Hamming distances between the compressed representations. Here we present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwise Hamming distances among the uncompressed data only from their sketches. We claim that these sketches can be used in the usual data mining tasks in place of the original data without compromising the quality of the task. For that, we ensure that the sketches also are categorical, sparse, and the Hamming distance estimates are reasonably precise. Both the sketch construction and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
