OPORP: One Permutation + One Random Projection

Ping Li; Xiaoyun Li

arXiv:2302.03505·stat.ML·May 24, 2023

OPORP: One Permutation + One Random Projection

Ping Li, Xiaoyun Li

PDF

Open Access

TL;DR

OPORP introduces a permutation and random projection method that enhances the accuracy of cosine similarity estimation in embedding-based retrieval by normalization and fixed-length binning, reducing variance compared to previous methods.

Contribution

This paper proposes OPORP, a novel data reduction technique combining permutation, random projection, and normalization, significantly improving cosine similarity estimation accuracy in high-dimensional embeddings.

Findings

01

Variance reduction through normalization and binning

02

Exact recovery of VSRP with repeated OPORP

03

Improved cosine similarity estimation accuracy

Abstract

Consider two $D$ -dimensional data vectors (e.g., embeddings): $u, v$ . In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, $D = 256 \sim 1024$ are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector $r$ is generated i.i.d. with moments: $E (r_{i}) = 0, E (r_{i}^{2}) = 1, E (r_{i}^{3}) = 0, E (r_{i}^{4}) = s$ . We multiply (as dot product) $r$ with all permuted data vectors. Then we break the $D$ columns into $k$ equal-length bins and aggregate (i.e., sum) the values in each bin to obtain $k$ samples from each data vector. One crucial step is to normalize the $k$ samples to the unit $l_{2}$ norm. We show that the estimation variance is essentially: $(s-1)A +…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Machine Learning and Algorithms · Face and Expression Recognition