Dimension Independent Similarity Computation
Reza Bosagh Zadeh, Ashish Goel

TL;DR
This paper introduces algorithms for computing pairwise similarities between high-dimensional sparse vectors efficiently, with performance independent of dimension, suitable for large-scale data like social networks.
Contribution
The paper presents a suite of dimension-independent algorithms for various similarity measures, optimized for MapReduce and validated on Twitter data.
Findings
Algorithms are provably independent of dimension
Validated at large scale on Twitter data
Algorithms are deployed in production at Twitter
Abstract
We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high dimensional sparse vectors. All of our results are provably independent of dimension, meaning apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension, thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similiarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems at large scale using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Advanced Image and Video Retrieval Techniques
