Dimension Independent Similarity Computation

Reza Bosagh Zadeh; Ashish Goel

arXiv:1206.2082·cs.DS·May 24, 2013·52 cites

Dimension Independent Similarity Computation

Reza Bosagh Zadeh, Ashish Goel

PDF

Open Access

TL;DR

This paper introduces algorithms for computing pairwise similarities between high-dimensional sparse vectors efficiently, with performance independent of dimension, suitable for large-scale data like social networks.

Contribution

The paper presents a suite of dimension-independent algorithms for various similarity measures, optimized for MapReduce and validated on Twitter data.

Findings

01

Algorithms are provably independent of dimension

02

Validated at large scale on Twitter data

03

Algorithms are deployed in production at Twitter

Abstract

We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high dimensional sparse vectors. All of our results are provably independent of dimension, meaning apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension, thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similiarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems at large scale using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Advanced Image and Video Retrieval Techniques