Sign-Full Random Projections

Ping Li

arXiv:1805.00533·stat.ME·May 3, 2018

Sign-Full Random Projections

Ping Li

PDF

Open Access

TL;DR

This paper introduces 'sign-full' random projections that improve cosine similarity estimation over traditional 1-bit methods, especially at high similarity levels, by using expectation-based estimators and normalization techniques.

Contribution

It develops novel estimators for cosine similarity from full projection data, significantly enhancing accuracy over sign-only methods and providing practical normalization strategies.

Findings

01

Estimated cosine similarity has lower variance with sign-full projections.

02

Normalized estimators outperform sign-sign projections at high similarity.

03

At high similarity, variance is reduced to about 40% of sign-sign estimators.

Abstract

The method of 1-bit ("sign-sign") random projections has been a popular tool for efficient search and machine learning on large datasets. Given two $D$ -dim data vectors $u$ , $v \in R^{D}$ , one can generate $x = \sum_{i = 1}^{D} u_{i} r_{i}$ , and $y = \sum_{i = 1}^{D} v_{i} r_{i}$ , where $r_{i} \sim N (0, 1)$ iid. The "collision probability" is $P r (s g n (x) = s g n (y)) = 1 - \frac{c o s ^{- 1} ρ}{π}$ , where $ρ = ρ (u, v)$ is the cosine similarity. We develop "sign-full" random projections by estimating $ρ$ from (e.g.,) the expectation $E (s g n (x) y) = \frac{2}{π} ρ$ , which can be further substantially improved by normalizing $y$ . For nonnegative data, we recommend an interesting estimator based on $E (y_{-} 1_{x \geq 0} + y_{+} 1_{x < 0})$ and its normalized version. The recommended estimator almost matches the accuracy of the (computationally expensive) maximum likelihood…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Statistical Methods and Inference · Random Matrices and Applications