CountSketches, Feature Hashing and the Median of Three
Kasper Green Larsen, Rasmus Pagh, Jakub T\v{e}tek

TL;DR
This paper provides a new analysis of CountSketch, demonstrating improved variance bounds and showing that using the median of multiple estimates enhances accuracy, with implications for feature hashing and practical implementations.
Contribution
The paper introduces a novel variance analysis of CountSketch, revealing improved bounds and connecting median-based estimators to feature hashing reliability.
Findings
Variance improves to O(min{||v||_1^2/s^2, ||v||_2^2/s}) for t > 1
Median of estimates reduces failure probability exponentially in t
Experimental results support theoretical variance improvements
Abstract
In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector to a vector of dimension , where are integer parameters. It is known that even for , a CountSketch allows estimating coordinates of with variance bounded by . For , the estimator takes the median of independent estimates, and the probability that the estimate is off by more than is exponentially small in . This suggests choosing to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant . Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of Count-Sketch, showing an improvement in variance to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Management and Algorithms · Algorithms and Data Compression · Machine Learning and Algorithms
