OPORP: One Permutation + One Random Projection
Ping Li, Xiaoyun Li

TL;DR
OPORP introduces a permutation and random projection method that enhances the accuracy of cosine similarity estimation in embedding-based retrieval by normalization and fixed-length binning, reducing variance compared to previous methods.
Contribution
This paper proposes OPORP, a novel data reduction technique combining permutation, random projection, and normalization, significantly improving cosine similarity estimation accuracy in high-dimensional embeddings.
Findings
Variance reduction through normalization and binning
Exact recovery of VSRP with repeated OPORP
Improved cosine similarity estimation accuracy
Abstract
Consider two -dimensional data vectors (e.g., embeddings): . In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector is generated i.i.d. with moments: . We multiply (as dot product) with all permuted data vectors. Then we break the columns into equal-length bins and aggregate (i.e., sum) the values in each bin to obtain samples from each data vector. One crucial step is to normalize the samples to the unit norm. We show that the estimation variance is essentially: $(s-1)A +…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Machine Learning and Algorithms · Face and Expression Recognition
