CountSketches, Feature Hashing and the Median of Three

Kasper Green Larsen; Rasmus Pagh; Jakub T\v{e}tek

arXiv:2102.02193·cs.DS·February 4, 2021

CountSketches, Feature Hashing and the Median of Three

Kasper Green Larsen, Rasmus Pagh, Jakub T\v{e}tek

PDF

Open Access 1 Video

TL;DR

This paper provides a new analysis of CountSketch, demonstrating improved variance bounds and showing that using the median of multiple estimates enhances accuracy, with implications for feature hashing and practical implementations.

Contribution

The paper introduces a novel variance analysis of CountSketch, revealing improved bounds and connecting median-based estimators to feature hashing reliability.

Findings

01

Variance improves to O(min{||v||_1^2/s^2, ||v||_2^2/s}) for t > 1

02

Median of estimates reduces failure probability exponentially in t

03

Experimental results support theoretical variance improvements

Abstract

In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector $v$ to a vector of dimension $(2 t - 1) s$ , where $t, s > 0$ are integer parameters. It is known that even for $t = 1$ , a CountSketch allows estimating coordinates of $v$ with variance bounded by $∥ v ∥_{2}^{2} / s$ . For $t > 1$ , the estimator takes the median of $2 t - 1$ independent estimates, and the probability that the estimate is off by more than $2∥ v ∥_{2} / s$ is exponentially small in $t$ . This suggests choosing $t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant $t$ . Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of Count-Sketch, showing an improvement in variance to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CountSketches, Feature Hashing and the Median of Three· slideslive

Taxonomy

TopicsData Management and Algorithms · Algorithms and Data Compression · Machine Learning and Algorithms