A Memory-Efficient Sketch Method for Estimating High Similarities in   Streaming Sets

Pinghui Wang; Yiyan Qi; Yuanming Zhang; Qiaozhu Zhai; Chenxu Wang,; John C.S. Lui; Xiaohong Guan

arXiv:1905.08977·cs.DS·May 23, 2019

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang,, John C.S. Lui, Xiaohong Guan

PDF

TL;DR

This paper introduces MaxLogHash, a memory-efficient sketching method for accurately estimating high set similarities in streaming data, outperforming existing techniques in memory usage while maintaining accuracy.

Contribution

We propose MaxLogHash, a novel sketch method that significantly reduces memory requirements for estimating high similarities in streaming sets, addressing limitations of prior compressed MinHash variants.

Findings

01

MaxLogHash is about 5 times more memory efficient than MinHash.

02

It achieves comparable accuracy with smaller register sizes.

03

Experimental results confirm its effectiveness on various datasets.

Abstract

Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate…

Figures29

Click any figure to enlarge with its caption.

Equations71

J_{u_{1}, u_{2}}^{(t)} = \frac{∣ \cap ^{(t)} ( u _{1} , u _{2} ) ∣}{∣ \cup ^{(t)} ( u _{1} , u _{2} ) ∣},

J_{u_{1}, u_{2}}^{(t)} = \frac{∣ \cap ^{(t)} ( u _{1} , u _{2} ) ∣}{∣ \cup ^{(t)} ( u _{1} , u _{2} ) ∣},

J_{A, B} = \frac{∣ \cap ( A , B ) ∣}{∣ \cup ( A , B ) ∣} = P (min (π (A)) = min (π (B))),

J_{A, B} = \frac{∣ \cap ( A , B ) ∣}{∣ \cup ( A , B ) ∣} = P (min (π (A)) = min (π (B))),

\hat{J}_{A, B} = \frac{\sum _{i = 1}^{k} 1 ( min ( π _{1} ( A )) = min ( π _{1} ( B ))}{k},

\hat{J}_{A, B} = \frac{\sum _{i = 1}^{k} 1 ( min ( π _{1} ( A )) = min ( π _{1} ( B ))}{k},

Var (\hat{J}_{A, B}) = \frac{J _{A, B} ( 1 - J _{A, B} )}{k} .

Var (\hat{J}_{A, B}) = \frac{J _{A, B} ( 1 - J _{A, B} )}{k} .

S_{A} = (min (π_{1} (A)), \dots, min (π_{k} (A))),

S_{A} = (min (π_{1} (A)), \dots, min (π_{k} (A))),

S_{A}^{(b)} = (min^{(b)} (π_{1} (A)), \dots, min^{(b)} (π_{k} (A))) .

S_{A}^{(b)} = (min^{(b)} (π_{1} (A)), \dots, min^{(b)} (π_{k} (A))) .

\hat{J}_{A, B}^{(b)} = \frac{\sum _{i = 1}^{k} 1 ( min ^{(b)} ( π _{i} ( A )) = min ^{(b)} ( π _{i} ( B ))) - k / 2 ^{b}}{k ( 1 - 1/ 2 ^{b} )} .

\hat{J}_{A, B}^{(b)} = \frac{\sum _{i = 1}^{k} 1 ( min ^{(b)} ( π _{i} ( A )) = min ^{(b)} ( π _{i} ( B ))) - k / 2 ^{b}}{k ( 1 - 1/ 2 ^{b} )} .

Var (\hat{J}_{A, B}^{(b)}) = \frac{1 - J _{A, B}}{k} (J_{A, B} + \frac{1}{2 ^{b} - 1}) .

Var (\hat{J}_{A, B}^{(b)}) = \frac{1 - J _{A, B}}{k} (J_{A, B} + \frac{1}{2 ^{b} - 1}) .

S_{A}^{(Odd)} [j] = \oplus_{i = 1, \dots, k} 1 (h (i, min (π_{i} (A))) = j), 1 \leq j \leq z .

S_{A}^{(Odd)} [j] = \oplus_{i = 1, \dots, k} 1 (h (i, min (π_{i} (A))) = j), 1 \leq j \leq z .

\hat{J}_{A, B}^{(Odd)} = 1 + \frac{z}{4 k} ln (1 - \frac{2 \sum _{i = 1}^{z} S _{A}^{(Odd)} [ j ] \oplus S _{B}^{(Odd)} [ j ]}{z}) .

\hat{J}_{A, B}^{(Odd)} = 1 + \frac{z}{4 k} ln (1 - \frac{2 \sum _{i = 1}^{z} S _{A}^{(Odd)} [ j ] \oplus S _{B}^{(Odd)} [ j ]}{z}) .

min (π (A \cup {v})) = min (min (π (A)), π (v)) .

min (π (A \cup {v})) = min (min (π (A)), π (v)) .

min^{(b)} (π (A \cup {v})) \neq = min (min^{(b)} (π (A)), π^{(b)} (v)) .

min^{(b)} (π (A \cup {v})) \neq = min (min^{(b)} (π (A)), π^{(b)} (v)) .

MaxLog (h (A)) = v \in A max r (v) = v \in A max ⌊ - lo g_{2} h (v)⌋ .

MaxLog (h (A)) = v \in A max r (v) = v \in A max ⌊ - lo g_{2} h (v)⌋ .

P (r (v) = j) = \frac{1}{2 ^{j + 1}}, P (r (v) < j) = 1 - \frac{1}{2 ^{j}}, j \in {0, 1, 2, \dots} .

P (r (v) = j) = \frac{1}{2 ^{j + 1}}, P (r (v) < j) = 1 - \frac{1}{2 ^{j}}, j \in {0, 1, 2, \dots} .

P (MaxLog (h (A)) \leq 2^{⌈ l o g_{2} l o g_{2} ∣ I ∣ ⌉} - 1) = (1 - \frac{1}{2 ^{2^{⌈ l o g_{2} l o g_{2} ∣ I ∣ ⌉}}})^{∣ A ∣} .

P (MaxLog (h (A)) \leq 2^{⌈ l o g_{2} l o g_{2} ∣ I ∣ ⌉} - 1) = (1 - \frac{1}{2 ^{2^{⌈ l o g_{2} l o g_{2} ∣ I ∣ ⌉}}})^{∣ A ∣} .

MaxLog (h (A \cup {v})) = max (MaxLog (h (A)), ⌊ - lo g_{2} h (v)⌋) .

MaxLog (h (A \cup {v})) = max (MaxLog (h (A)), ⌊ - lo g_{2} h (v)⌋) .

γ = = P (MaxLog (h (A)) \neq = MaxLog (h (B))) j = 1 \sum + \infty \frac{∣ A ∖ B ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j + 1}})^{∣ A ∖ B ∣ - 1} (1 - \frac{1}{2 ^{j}})^{∣ B ∣} + j = 1 \sum + \infty \frac{∣ B ∖ A ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j + 1}})^{∣ B ∖ A ∣ - 1} (1 - \frac{1}{2 ^{j}})^{∣ A ∣} .

γ = = P (MaxLog (h (A)) \neq = MaxLog (h (B))) j = 1 \sum + \infty \frac{∣ A ∖ B ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j + 1}})^{∣ A ∖ B ∣ - 1} (1 - \frac{1}{2 ^{j}})^{∣ B ∣} + j = 1 \sum + \infty \frac{∣ B ∖ A ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j + 1}})^{∣ B ∖ A ∣ - 1} (1 - \frac{1}{2 ^{j}})^{∣ A ∣} .

E (γ) = \frac{\sum _{i = 1}^{k} 1 ( MaxLog ( h _{i} ( A )) \neq = MaxLog ( h _{i} ( B )))}{k},

E (γ) = \frac{\sum _{i = 1}^{k} 1 ( MaxLog ( h _{i} ( A )) \neq = MaxLog ( h _{i} ( B )))}{k},

P (MaxLog (h (A)) \neq = MaxLog (h (B)) \land δ_{A, B} = 1) \approx 0.7213 (1 - J_{A, B}),

P (MaxLog (h (A)) \neq = MaxLog (h (B)) \land δ_{A, B} = 1) \approx 0.7213 (1 - J_{A, B}),

χ_{u_{1}, u_{2}} [i] = 1 (m_{u_{1}} [i] \neq = m_{u_{2}} [i]), i = 1, \dots, k,

χ_{u_{1}, u_{2}} [i] = 1 (m_{u_{1}} [i] \neq = m_{u_{2}} [i]), i = 1, \dots, k,

\psi_{u_{1},u_{2}}[i]=\left\{\begin{array}[]{ll}s_{u_{1}}[i],&m_{u_{1}}[i]>m_{u_{2}}[i]\\ s_{u_{2}}[i],&m_{u_{1}}[i]<m_{u_{2}}[i]\\ -1,&m_{u_{1}}[i]=m_{u_{2}}[i].\end{array}\right.

\psi_{u_{1},u_{2}}[i]=\left\{\begin{array}[]{ll}s_{u_{1}}[i],&m_{u_{1}}[i]>m_{u_{2}}[i]\\ s_{u_{2}}[i],&m_{u_{1}}[i]<m_{u_{2}}[i]\\ -1,&m_{u_{1}}[i]=m_{u_{2}}[i].\end{array}\right.

P (δ_{u_{1}, u_{2}} [i] = 1) = α_{∣ \cup (u_{1}, u_{2}) ∣} (1 - J_{u_{1}, u_{2}}), i = 1, \dots, k,

P (δ_{u_{1}, u_{2}} [i] = 1) = α_{∣ \cup (u_{1}, u_{2}) ∣} (1 - J_{u_{1}, u_{2}}), i = 1, \dots, k,

Δ (u_{1}, u_{2}) = (I_{u_{1}} ∖ I_{u_{2}}) \cup (I_{u_{2}} ∖ I_{u_{1}}) = \cup (u_{1}, u_{2}) ∖ \cap (u_{1}, u_{2}) .

Δ (u_{1}, u_{2}) = (I_{u_{1}} ∖ I_{u_{2}}) \cup (I_{u_{2}} ∖ I_{u_{1}}) = \cup (u_{1}, u_{2}) ∖ \cap (u_{1}, u_{2}) .

P (δ_{u_{1}, u_{2}} [i] = 1 \land r^{*} = j) = w \in Δ (u_{1}, u_{2}) \sum P (r_{i} (w) = j) v \in \cup (u_{1}, u_{2}) ∖ {w} \prod P (r_{i} (v) < j) = \frac{∣Δ ( u _{1} , u _{2} ) ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j}})^{∣ \cup (u_{1}, u_{2}) ∣ - 1} .

P (δ_{u_{1}, u_{2}} [i] = 1 \land r^{*} = j) = w \in Δ (u_{1}, u_{2}) \sum P (r_{i} (w) = j) v \in \cup (u_{1}, u_{2}) ∖ {w} \prod P (r_{i} (v) < j) = \frac{∣Δ ( u _{1} , u _{2} ) ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j}})^{∣ \cup (u_{1}, u_{2}) ∣ - 1} .

P (δ_{u_{1}, u_{2}} [i] = 1) = j = 0 \sum + \infty P (δ_{i} = 1 \land r^{*} = j) = w \in Δ (u_{1}, u_{2}) \sum P (r_{w} = j) v \in \cup (u_{1}, u_{2}) ∖ {w} \prod P (r_{v} < j) = j = 0 \sum + \infty \frac{∣Δ ( u _{1} , u _{2} ) ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j}})^{∣ \cup (u_{1}, u_{2}) ∣ - 1} = j = 1 \sum + \infty \frac{∣Δ ( u _{1} , u _{2} ) ∣}{∣ \cup ( u _{1} , u _{2} ) ∣} \cdot \frac{∣ \cup ( u _{1} , u _{2} ) ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j}})^{∣ \cup (u_{1}, u_{2}) ∣ - 1} = α_{∣ \cup (u_{1}, u_{2}) ∣} (1 - J_{u_{1}, u_{2}}),

P (δ_{u_{1}, u_{2}} [i] = 1) = j = 0 \sum + \infty P (δ_{i} = 1 \land r^{*} = j) = w \in Δ (u_{1}, u_{2}) \sum P (r_{w} = j) v \in \cup (u_{1}, u_{2}) ∖ {w} \prod P (r_{v} < j) = j = 0 \sum + \infty \frac{∣Δ ( u _{1} , u _{2} ) ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j}})^{∣ \cup (u_{1}, u_{2}) ∣ - 1} = j = 1 \sum + \infty \frac{∣Δ ( u _{1} , u _{2} ) ∣}{∣ \cup ( u _{1} , u _{2} ) ∣} \cdot \frac{∣ \cup ( u _{1} , u _{2} ) ∣}{2 ^{j + 1}} (1 - \frac{1}{2 ^{j}})^{∣ \cup (u_{1}, u_{2}) ∣ - 1} = α_{∣ \cup (u_{1}, u_{2}) ∣} (1 - J_{u_{1}, u_{2}}),

E (\hat{k})

E (\hat{k})

J_{u_{1}, u_{2}} = 1 - \frac{E ( k ^ )}{k α _{∣ \cup (u_{1}, u_{2}) ∣}} .

J_{u_{1}, u_{2}} = 1 - \frac{E ( k ^ )}{k α _{∣ \cup (u_{1}, u_{2}) ∣}} .

α_{n} = \frac{n}{2} j = 1 \sum + \infty \frac{1}{2 ^{j}} (1 - \frac{1}{2 ^{j}})^{n - 1} = \frac{n}{2} j = 1 \sum + \infty \frac{1}{2 ^{j}} l = 0 \sum n - 1 (l n - 1) (- \frac{1}{2 ^{j}})^{n - l - 1} = \frac{n}{2} l = 0 \sum n - 1 (- 1)^{n - l - 1} (l n - 1) j = 1 \sum + \infty \frac{1}{2 ^{j (n - l)}} = \frac{n}{2} l = 0 \sum n - 1 (- 1)^{n - l - 1} (l n - 1) \frac{1}{2 ^{n - l} - 1} .

α_{n} = \frac{n}{2} j = 1 \sum + \infty \frac{1}{2 ^{j}} (1 - \frac{1}{2 ^{j}})^{n - 1} = \frac{n}{2} j = 1 \sum + \infty \frac{1}{2 ^{j}} l = 0 \sum n - 1 (l n - 1) (- \frac{1}{2 ^{j}})^{n - l - 1} = \frac{n}{2} l = 0 \sum n - 1 (- 1)^{n - l - 1} (l n - 1) j = 1 \sum + \infty \frac{1}{2 ^{j (n - l)}} = \frac{n}{2} l = 0 \sum n - 1 (- 1)^{n - l - 1} (l n - 1) \frac{1}{2 ^{n - l} - 1} .

\hat{J}_{u_{1}, u_{2}} = 1 - \frac{k ^}{k α} .

\hat{J}_{u_{1}, u_{2}} = 1 - \frac{k ^}{k α} .

E (\hat{J}_{u_{1}, u_{2}}) - J_{u_{1}, u_{2}} = (1 - β_{∣ \cup (u_{1}, u_{2}) ∣}) (1 - J_{u_{1}, u_{2}}),

E (\hat{J}_{u_{1}, u_{2}}) - J_{u_{1}, u_{2}} = (1 - β_{∣ \cup (u_{1}, u_{2}) ∣}) (1 - J_{u_{1}, u_{2}}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Memory-Efficient Sketch Method for Estimating

High Similarities in Streaming Sets

Pinghui Wang1,2,⋆, Yiyan Qi1,⋆, Yuanming Zhang1, Qiaozhu Zhai1, Chenxu Wang2,∗

and

John C.S. Lui3, Xiaohong Guan2,1,4,∗

1NSKEYLAB, Xi’an Jiaotong University, Xi’an, China2Shenzhen Research Institute of Xi’an Jiaotong University, Shenzhen, China3The Chinese University of Hong Kong, Hong Kong4Department of Automation and NLIST Lab, Tsinghua University, Beijing, China

phwang,qzzhai,cxwang,[email protected], qiyiyan,[email protected],

[email protected]

(2019)

Abstract.

Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, $b$ -bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, $b$ -bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities.

Streaming algorithms;Sketch;Jaccard coefficient similarity

⋆Pinghui Wang and Yiyan Qi contributed equally to this work.

∗Corresponding Author.

††copyright: acmcopyright††journalyear: 2019††copyright: acmcopyright††conference: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 4–8, 2019; Anchorage, AK, USA††booktitle: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19), August 4–8, 2019, Anchorage, AK, USA††price: 15.00††doi: 10.1145/3292500.3330825††isbn: 978-1-4503-6201-6/19/08††ccs: Mathematics of computing Probabilistic algorithms††ccs: Information systems Similarity measures††ccs: Theory of computation Sketching and sampling

1. Introduction

Data streams are ubiquitous in nature. Examples range from financial transactions to Internet of things (IoT) data, network traffic, call logs, trajectory logs, etc. Due to the nature of these applications which involve massive volume of data, it is prohibitive to collect the entire data streams, especially when computational and storage resources are limited (Li2018Approximate, ). Therefore, it is important to develop memory efficient methods such as sampling and sketching techniques for mining large streaming data.

Many datasets can be viewed as collections of sets and computing set similarities is fundamental for a variety of applications in areas such as databases, machine learning, and information retrieval. For example, one can view each mobile device’s trajectory as a set and each element in the set corresponds to a tuple of time $t$ and the physical location of the device at time $t$ . Then, mining devices with similar trajectories is useful for identifying friends or devices belonging to the same person. Other examples are datasets encountered in computer networks, mobile phone networks, and online social networks (OSNs), where learning user similarities in the sets of users’ visited websites on the Internet, connected phone numbers, and friends on OSNs is fundamental for applications such as link prediction and friendship recommendation.

One of the most popular set similarity measures is the Jaccard similarity coefficient, which is defined as $\frac{|A\cap B|}{|A\cup B|}$ for two sets $A$ and $B$ . To handle large sets, MinHash (or, minwise hashing) (Broder2000, ) is a powerful set similarity estimation technique, which uses an array of $k$ registers to build a sketch for each set. Its accuracy only depends on the value of $k$ and the Jaccard similarity of two sets of interest, and it is independent from the size of two sets. MinHash has been successfully used for a variety of applications, such as similarity search (BroderSEQUENCES1997, ), compressing social networks (ChierichettiKDD2009, ), advertising diversification (GollapudiWWW2009, ), large scale learning (LiNIPS2011, ), and web spam detection (UrvoyTOW2008, ). Many of these applications focus on estimating similarity values close to 1. Take similar document search in a sufficiently large corpus as an example. For a corpus, there may be thousands of documents which are similar to the query document, therefore our goal is not just to find similar documents, but also to provide a short list (e.g., top-10) and ranking of the most similar documents. For such an application, we need effective methods that are very accurate and memory-efficient for estimating high similarities. To achieve this goal, there are two compressed MinHash methods, $b$ -bit MinHash (PingWWW2010, ) and Odd Sketch (MitzenmacherWWW14, ), which were proposed in the past few years to further reduce the memory usage of the original MinHash by dozens of times, while to provide comparable estimation accuracy especially for large similarity values. However, we observe that these two methods fail to handle data streams (the details will be given in Section 3).

To solve the above challenge, recently, Yu and Weber (YuArxiv2017, ) develop a method, HyperMinHash. HyperMinHash consists of $k$ registers, whereas each register has two parts, an FM (Flajolet-Martin) sketch (Flajolet1985, ) and a $b$ -bit string. The $b$ -bit string is computed based on the fingerprints (i.e., hash values) of set elements that are mapped to the register. Based on HyperMinHash sketches of two sets $A$ and $B$ , HyperMinhash first estimates $|A\cup B|$ and then infers the Jaccard similarity of $A$ and $B$ from the number of collisions of $b$ -bit strings given $|A\cup B|$ . Later in our experiments, we demonstrate that HyperMinHash not only exhibits a large bias for high similarities, but it is also computationally expensive for estimating similarities, which results in a large estimation error and a big delay in querying highly similar sets. More importantly, it is difficult to analytically analyze the estimation bias and variance of HyperMinHash, which are of great value in practice–the bias and variance can be used to bound an estimate’ error and determine the smallest necessary sampling budget (i.e., $k$ ) for a desired accuracy. In this paper, we develop a novel memory efficient method, MaxLogHash, to estimate Jaccard similarities in streaming sets. Similar to MinHash, MaxLogHash uses a list of $k$ registers to build a compact sketch for each set. Unlike MinHash which uses a 64-bit (resp. 32-bit) register for storing the minimum hash value of 64-bit (resp. 32-bit) set elements, our method MaxLogHash uses only 7-bit register (resp. 6-bit register) to approximately record the logarithm value of the minimum hash value, and this results in 9 times (resp. 5 times) reduction in memory usage. Another attractive property is that our MaxLogHash sketch can be computed incrementally, therefore, MaxLogHash is able to handle streaming-sets. Given any two sets’ MaxLogHash sketches, we provide a simple yet accurate estimator for their Jaccard similarity, and derive exact formulas for bounding the estimation error. We conduct experiments on a variety of publicly available datasets, and experimental results show that our method MaxLogHash reduces the amount of memory required for MinHash by 5 folds to achieve the same desired accuracy and computational cost.

The rest of this paper is organized as follows. The problem formulation is presented in Section 2. Section 3 introduces preliminaries used in this paper. Section 4 presents our method MaxLogHash. The performance evaluation and testing results are presented in Section 5. Section 6 summarizes related work. Concluding remarks then follow.

2. Problem Formulation

For ease of reading and comprehension, we say that each set belongs to a user, elements in the set are items (e.g., products) that the user connects to. Let $U$ denote the set of users and $I$ denote the set of all items. Let $\Pi=e^{(1)}e^{(2)}\cdots e^{(t)}\cdots$ denote the user-item stream of interest, where $e^{(t)}=(u^{(t)},i^{(t)})$ is the element of $\Pi$ occurred at discrete time $t>0$ , $u^{(t)}\in U$ and $i^{(t)}\in I$ are the element’s user and item, which represents a connection from user $u^{(t)}$ to item $i^{(t)}$ . We assume that $\Pi$ has no duplicate user-item pairs111Duplicated user-item pairs can be easily checked and filtered using fast and memory-efficient techniques such as Bloom filter (BloomACMCommun1970, )., that is, $e^{(i)}\neq e^{(j)}$ when $i\neq j$ . Let $I_{u}^{(t)}\subset I$ be the item set of user $u\in U$ , which consists of items that user $u$ connects to before and including time $t$ . Let $\cup^{(t)}(u_{1},u_{2})$ denote the union of two sets $I_{u_{1}}^{(t)}$ and $I_{u_{2}}^{(t)}$ , that is, $\cup^{(t)}(u_{1},u_{2})=I_{u_{1}}^{(t)}\cup I_{u_{2}}^{(t)}.$ Similarly, we define the intersection of two sets $I_{u_{1}}^{(t)}$ and $I_{u_{1}}^{(t)}$ as $\cap^{(t)}(u_{1},u_{2})=I_{u_{1}}^{(t)}\cap I_{u_{2}}^{(t)}.$ Then, the Jaccard similarity of sets $I_{u_{1}}^{(t)}$ and $I_{u_{2}}^{(t)}$ is defined as

[TABLE]

which reflects the similarity between users $u_{1}$ and $u_{2}$ . In this paper, we aim to develop a fast and accurate method to estimate $J_{u_{1},u_{2}}^{(t)}$ for any two users $u_{1}$ and $u_{2}$ over time, and to detect pairs of high similar users. When no confusion arises, we omit the superscript $(t)$ to ease exposition.

3. Preliminaries

In this section, we first introduce MinHash (Broder2000, ). Then, we elaborate two state-of-the-art memory-efficient methods $b$ -bit MinHash (PingWWW2010, ) and Odd Sketch (MitzenmacherWWW14, ) that can decrease the memory usage of the original MinHash method. At last, we demonstrate that both $b$ -bit MinHash and Odd Sketch fail to handle streaming sets.

3.1. MinHash

Given a random permutation (or hash function222MinHash assumes no hash collisions.) $\pi$ from elements in $I$ to elements in $I$ , i.e., a hash function maps integers in $I$ to distinct integers in $I$ at random. Broder et al. (Broder2000, ) observed that the Jaccard similarity of two sets $A,B\subseteq I$ equals

[TABLE]

where $\pi(A)=\{\pi(w):w\in A\}$ . Therefore, MinHash uses a sequence of $k$ independent permutations $\pi_{1},\ldots,\pi_{k}$ and estimates $J_{A,B}$ as

[TABLE]

where $\mathbf{1}(\mathbb{P})$ is an indicator function that equals 1 when the predicate $\mathbb{P}$ is true and 0 otherwise. Note that $\hat{J}_{A,B}$ is an unbiased estimator for $J_{A,B}$ , i.e., $\mathbb{E}(\hat{J}_{A,B})=J_{A,B}$ , and its variance is

[TABLE]

Therefore, instead of storing a set $A$ in memory, one can compute and store its MinHash sketch $S_{A}$ , i.e.,

[TABLE]

which reduces the memory usage when $|A|>k$ . The Jaccard similarity of any two sets can be accurately and efficiently estimated based on their MinHash sketches.

3.2. b-bit MinHash

Li and König (PingWWW2010, ) proposed a method, $b$ -bit MinHash, to further reduce the memory usage. $b$ -bit MinHash reduces the memory required for storing a MinHash sketch $S_{A}$ from $32k$ or $64k$ bits333A 32- or 64-bit register is used to store each $\min(\pi_{i}(A))$ , $i=1,\ldots,k$ . to $bk$ bits. The basic idea behind $b$ -bit MinHash is that the same hash values give the same lowest $b$ bits while two different hash values give the same lowest $b$ bits with a small probability $1/2^{b}$ . Formally, let $\min^{(b)}(\pi(A))$ denote the lowest $b$ bits of the value of $\min(\pi(A))$ for a permutation $\pi$ . Define the $b$ -bit MinHash sketch of set $A$ as

[TABLE]

To mine set similarities, Li and König (PingWWW2010, ) first compute $S_{A}$ for each set $A$ , and then store its $b$ -bit MinHash sketch $S_{A}^{(b)}$ . At last, the Jaccard similarity $J_{A,B}$ is estimated as

[TABLE]

$\hat{J}_{A,B}^{(b)}$ is also an unbiased estimator for $J_{A,B}$ , and its variance is

[TABLE]

3.3. Odd Sketch

Mitzenmacher et al. (MitzenmacherWWW14, ) developed a method Odd Sketch, which is more memory efficient than $b$ -bit MinHash when mining sets of high similarity. Odd Sketch uses a hash function $h$ that maps each tuple $(i,\min(\pi_{i}(A)))$ , $i=1,\ldots,k$ , to an integer in $\{1,\ldots,z\}$ at random. For a set $A$ , its odd sketch $S_{A}^{\text{(Odd)}}$ consists of $z$ bits. Function $h$ maps tuples $(1,\min(\pi_{1}(A))),\ldots,(k,\min(\pi_{k}(A)))$ into $z$ bits of $S_{A}^{\text{(Odd)}}$ at random. $S_{A}^{\text{(Odd)}}[j]$ , $1\leq j\leq z$ , is the parity of the number of tuples that are mapped to the $j^{\text{th}}$ bit of $S_{A}^{\text{(Odd)}}$ . Formally, $S_{A}^{\text{(Odd)}}[j]$ is computed as

[TABLE]

The Jaccard similarity $J_{A,B}$ is then estimated as

[TABLE]

Mitzenmacher et al. demonstrate that $\hat{J}_{A,B}^{\text{(Odd)}}$ is more accurate than $\hat{J}_{A,B}^{\text{(b)}}$ under the same memory usage (refer to (MitzenmacherWWW14, ) for details of the error analysis of $\hat{J}_{A,B}^{\text{(Odd)}}$ ).

3.4. Discussion

MinHash can be directly applied to stream data. We can easily find that MinHash sketch can be computed incrementally. That is, one can compute the MinHash sketch of set $A\cup\{v\}$ from the MinHash sketch of set $A$ as

[TABLE]

Variants $b$ -bit MinHash and Odd Sketch cannot be used to handle streaming sets. Let $\pi^{(b)}(v)$ denote the lowest $b$ bits of $\pi(v)$ . Then, one can easily show that

[TABLE]

It shows that computing $\text{min}^{(b)}(\pi(A\cup\{v\}))$ requires the hash value $\pi(w)$ of each $w\in A\cup\{v\}$ . In addition, we observe that $\min^{(b)}(\pi(A))$ cannot be approximated as $\min_{w\in A}\pi^{(b)}(w)$ , which can be computed incrementally, because $\min_{w\in A}\pi^{(b)}(w)$ equals 0 with a high probability when $|A|\gg 2^{b}$ . Similarly, we cannot compute the odd sketch of a set incrementally. Therefore, both $b$ -bit MinHash and Odd Sketch fail to deal with streaming sets.

4. Our Method

4.1. Basic Idea

Let $h$ be a function that maps any element $v$ in $I$ to a random number in range $(0,1)$ . i.e., $h(v)\sim Uniform(0,1)$ . Define the log-rank of $v$ with respect to hash function $h$ as $r(v)\leftarrow\lfloor-\log_{2}h(v)\rfloor.$ We compute and store

[TABLE]

Let us now develop a simple yet accurate method to estimate Jaccard similarity of streaming sets based on the following properties of function $\text{MaxLog}(h(A))$ .

Observation 1. $\text{MaxLog}(h(A))$ can be represented by an integer of no more than $\lceil\log_{2}\log_{2}|I|\rceil$ bits with a high probability. For each $v\in I$ , we have $h(v)\sim Uniform(0,1)$ , and thus $r(v)\sim Geometric(1/2),$ supported on the set $\{0,1,2,\ldots\}$ , that is,

[TABLE]

Then, one can easily find that

[TABLE]

For example, when $A\subseteq\{1,\ldots,2^{64}\}$ and $|A|\leq 2^{54}$ , we only require 6 bits to store $\text{MaxLog}(h(A))$ with probability at least 0.999.

Observation 2. $\text{MaxLog}(h(A))$ can be computed incrementally. This is because

[TABLE]

Observation 3. $J_{A,B}$ can be easily estimated from $\text{MaxLog}(h(A))$ and $\text{MaxLog}(h(B))$ with a little additional information. We find that

[TABLE]

Due to the limited space, we omit the details of how $\gamma$ is derived. Similar to MinHash, we have $P(\max(h(A))\neq\max(h(B)))=1-J_{A,B}$ . Therefore, we have $\gamma<1-J_{A,B}$ . Although $\gamma$ can be estimated similar to MinHash using $k$ hash functions $h_{1},\ldots,h_{k}$ , that is,

[TABLE]

unfortunately, it is difficult to compute $J_{A,B}$ from $\gamma$ . To solve this problem, we observe

[TABLE]

where $\delta_{A,B}=1$ indicates that there exists one and only one element in $A\cup B$ of which log-rank equals $\text{MaxLog}(h(A\cup B))$ .

Based on the above three observations, we propose to incrementally and accurately estimate the value of $P(\text{MaxLog}(h(A))\neq\text{MaxLog}(h(B))\wedge\delta_{A,B}=1)$ using $k$ hash functions $h_{1},\ldots,h_{k}$ . Then, we easily infer the value of $J_{A,B}$ .

4.2. Data Structure

The MaxLogHash sketch of a user $u$ , i.e., $S_{u}$ , consists of $k$ bit-strings, where each bit-string $S_{u}[i],1\leq i\leq k$ , has two components, $s_{u}[i]$ and $m_{u}[i]$ , i.e., $S_{u}[i]=s_{u}[i]\parallel m_{u}[i].$ At any time $t$ , $m_{u}[i]$ records the maximum hash value of items in $I_{u}^{(t)}$ with respect to hash function $r_{i}(\cdot)=\lfloor-\log_{2}h_{i}(\cdot)\rfloor$ , i.e., $m_{u}[i]=\max_{w\in I_{u}^{(t)}}r_{i}(w)$ , where $I_{u}^{(t)}$ refers to the set of items that user $u$ connected to before and including time $t$ ; $s_{u}[i]$ consists of 1 bit and its value indicates whether there exists one and only one item $w\in I_{u}$ such that $r_{i}(w)=m_{u}[i]$ . As we mentioned, we can use $\lceil\log_{2}\log_{2}|I|\rceil$ bits to record the value of $m_{u}[i]$ with a high probability (very close to 1). When $m_{u}[i]\geq 2^{\lceil\log_{2}\log_{2}|I|\rceil}$ , we use a hash table to record tuples $(u,i,m_{u}[i])$ for all users.

4.3. Update Procedure

For each user $u\in U$ , when it first connects with an item $w$ in stream $\Pi$ , we initialize the MaxLogHash sketch of user $u$ as $S_{u}[i]=1\parallel r_{i}(w),\quad i=1,\ldots,k,$ where $r_{i}(w)=\lfloor-\log_{2}h_{i}(w)\rfloor$ . That is, we set indicator $s_{u}[i]=1$ and register $m_{u}[i]=r_{i}(w)$ . For any other item $v$ that user $u$ connects to after the first item $w$ , i.e., an user-item pair $(u,v)$ occurring on stream $\Pi$ after the user-item pair $(u,w)$ , we update it as follows: We first compute the log-rank of item $v$ , i.e., $r_{i}(v)=\lfloor-\log_{2}h_{i}(v)\rfloor$ , $i=1,\ldots,k$ . When $r_{i}(v)$ is smaller than $m_{u}[i]$ , we perform no further operations for updating the user-item $(u,v)$ . When $r_{i}(v)=m_{u}[i]$ , it indicates that at least two items in $I_{u}$ has a log-rank value $m_{u}[i]$ . Therefore, we simply set $s_{u}[i]=0$ . When $r_{i}(v)>m_{u}[i]$ , we set $S_{u}[i]=1\parallel r_{i}(v)$ .

4.4. Jaccard Similarity Estimation

Define variables

[TABLE]

Let $\delta_{u_{1},u_{2}}[i]=\mathbf{1}(\chi_{u_{1},u_{2}}[i]=1)\mathbf{1}(\psi_{u_{1},u_{2}}[i]=1)$ . Note that $\delta_{u_{1},u_{2}}[i]=1$ indicates that there exists one and only one element in set $\cup(u_{1},u_{2})$ of which log-rank equals $\max_{w\in\cup(u_{1},u_{2})}r_{i}(w)$ with respect to function $r_{i}$ . Then, we have the following theorem.

Theorem 1.

For non-empty sets $I_{u_{1}}$ and $I_{u_{2}}$ , we have $P(\delta_{u_{1},u_{2}}[i]=1)=0$ , $i=1,\ldots,k$ , when $|\cup(u_{1},u_{2})|=1$ . Otherwise, we have

[TABLE]

where $\alpha_{n}=n\sum_{j=1}^{+\infty}\frac{1}{2^{j+1}}\left(1-\frac{1}{2^{j}}\right)^{n-1},\quad n\geq 2.$

Proof.* * Let $r^{*}$ be the maximum log-rank of all items in $\cup(u_{1},u_{2})$ . When two items $w$ and $v$ in $I_{u_{1}}$ or $I_{u_{2}}$ has the log-rank value $r^{*}$ , we easily find that $\psi_{u_{1},u_{2}}[i]=0$ . When only one item $w$ in $I_{u_{1}}$ and only one item $v$ in $I_{u_{2}}$ have the log-rank value $r^{*}$ , we easily find that $\chi_{u_{1},u_{2}}[i]=0$ . Let

[TABLE]

Then, we find that event $\chi_{u_{1},u_{2}}[i]=1\wedge\psi_{u_{1},u_{2}}[i]=1$ happens (i.e., $\delta_{u_{1},u_{2}}[i]=1$ ) only when one item $w$ in $\Delta(u_{1},u_{2})$ has a log-rank value larger than all items in $\cup(u_{1},u_{2})\setminus\{w\}$ . For any item $v\in I$ , we have $h_{i}(v)\sim Uniform(0,1)$ and so $r_{i}(v)\sim Geometric(1/2)$ , supported on the set $\{0,1,2,\ldots\}$ . Based on the above observations, when $|\cup(u_{1},u_{2})|\geq 2$ , we have

[TABLE]

Therefore, we have

[TABLE]

where the last equation holds because $|\Delta(u_{1},u_{2})|=|\cup(u_{1},u_{2})|-|\cap(u_{1},u_{2})|$ . $\square$

Define variable $\hat{k}=\sum_{i=1}^{k}\mathbf{1}(\delta_{u_{1},u_{2}}[i]=1).$ From Theorem 1, the expectation of $\hat{k}$ is computed as

[TABLE]

Therefore, we have

[TABLE]

Note that the cardinality of set $\cup(u_{1},u_{2})$ (i.e. $|\cup(u_{1},u_{2})|$ ) is unknown. To solve this challenge, we find that

[TABLE]

Figure 1 shows that the value of $\alpha_{n}$ , $n=2,3,\ldots.$ We easily find that $\alpha_{n}\approx\alpha=0.7213$ when $n\geq 2$ . Therefore, we estimate $J_{u_{1},u_{2}}$ as

[TABLE]

4.5. Error Analysis

The error of our method MaxLogHash is shown in the following theorem.

Theorem 2.

For any users $u_{1},u_{2}\in U$ , we have

[TABLE]

where $\beta_{n}=\frac{\alpha_{n}}{\alpha}$ . The variance of $\hat{J}_{u_{1},u_{2}}$ is computed as

[TABLE]

When $|\cup(u_{1},u_{2})|\geq 3$ , we have $|\beta_{|\cup(u_{1},u_{2})|}-1|\leq 0.01$ , and so $\mathbb{E}(\hat{J}_{u_{1},u_{2}})\approx J_{u_{1},u_{2}}$ and $\text{Var}(\hat{J}_{u_{1},u_{2}})\approx\frac{(1-J_{u_{1},u_{2}})(J_{u_{1},u_{2}}+0.3864)}{k}$ .

Proof.* * From equation (1), we easily have

[TABLE]

To derive $\text{Var}(\hat{J}_{u_{1},u_{2}})$ , we first compute

[TABLE]

Then, we have

[TABLE]

From the definition of $\hat{J}_{u_{1},u_{2}}$ , we have

[TABLE]

Then, we easily obtain a closed-form formals of $\text{Var}(\hat{J}_{u_{1},u_{2}})$ from equation (4.5). $\square$

4.6. Reduce Processing Complexity

Inspired by OPH (one permutation hashing) (Linips2012, ), which significantly reduces the time complexity of MinHash for processing each element in the set, we can use a hash function which splits items in $I_{u}$ into $k$ registers at random, and each register $S_{u}[i]$ , $1\leq i\leq k$ , records $\text{MaxLog}(h(\{v:v\in I_{u}\wedge h(v)=j\}))$ as well as the value of indicator $s_{u}[i]$ , which is similar to the regular MaxLogHash method. We name this extension as MaxLogOPH. MaxLogOPH reduces the time complexity of processing each item from $O(k)$ to $O(1)$ . When $|u_{1}\cup u_{2}|\gg k$ , our experiments demonstrate that MaxLogOPH is comparable to MaxLogHash in terms of accuracy.

5. Evaluation

The algorithms are implemented in Python, and run on a computer with a Quad-Core Intel(R) Xeon(R) CPU E3-1226 v3 CPU 3.30GHz processor. To demonstrate the reproducibility of the experimental results, we make our source code publicly available444http://nskeylab.xjtu.edu.cn/dataset/phwang/code/MaxLog.zip.

5.1. Datasets

For simplicity, we assume that elements in sets are 32-bit numbers, i.e., $I=\{0,1,\ldots,2^{32}-1\}$ . We evaluate the performance of our method MaxLogHash a variety of datasets.

Synthetic datasets. Our synthetic datasets consist of set-pairs $A$ and $B$ with various cardinalities and Jaccard similarities. We conduct our experiments on the following two different settings:

$\bullet$ Balanced set-pairs (i.e., $|A|=|B|$ ). We set $|A|=|B|=n$ and vary $J_{A,B}$ in $\{0.80,0.81,...,1.00\}$ . Specially, we generate set $A$ by randomly selecting $n$ different numbers from $I$ and generate set $B$ by randomly selecting $|A\cap B|=\frac{J_{A,B}|A|}{1+J_{A,B}}$ different numbers from set $A$ and $n-|A\cap B|$ different numbers from set $I\setminus A$ . In our experiments, we set $n=10,000$ by default.

$\bullet$ Unbalanced set-pairs (i.e., $|A|\neq|B|$ ). We set $|A|=n$ and $|B|=J_{A,B}n$ , where we vary $J_{A,B}\in\{0.80,0.81,...,0.99\}$ . Specially, we generate set $A$ by randomly selecting $n$ different numbers from $I$ and generate set $B$ by selecting $J_{A,B}n$ different elements from $A$ .

Real-world datasets. Similar to (MitzenmacherWWW14, ), we evaluate the performance of our method on the detection of item-pairs (e.g., pairs of products) that always appear together in the same records (e.g., transactions). We conduct experiments on two real-world datasets555http://fimi.ua.ac.be/data/: MUSHROOM and CONNECT, which are also used in (MitzenmacherWWW14, ). We generate a stream of item-record pairs for each dataset, where a record can be viewed as a transaction and items in the same record can be viewed as products bought together. For each record $x$ in the dataset of interest and every item $w$ in $x$ , we append an element $(w,x)$ to the stream of item-record pairs. In summary, MUSHROOM and CONNECT have $8,124$ and $67,557$ records, $119$ and $127$ distinct items, and $186,852$ and $2,904,951$ item-record pairs, respectively.

5.2. Baselines

Our methods use $k$ $6$ -bit registers to build a sketch for each set. We compare our methods with the following state-of-the-art methods: $\bullet$ MinHash (Broder2000, ). MinHash builds a sketch for each set. A MinHash sketch consists of $k$ 32-bit registers.

$\bullet$ HyperLogLog (FlajoletAOFA07, ). A HyperLogLog sketch consists of $k$ 5-bit registers, and is originally designed for estimating a set’s cardinality. One can easily obtain a HyperLogLog sketch of $A\cup B$ by merging the HyperLogLog sketches of sets $A$ and $B$ and then use the sketch to estimate $|A\cup B|$ . Therefore, HyperLogLog can also be used to estimate $J_{A,B}$ by approximating $\frac{|A|+|B|-|A\cup B|}{|A\cup B|}$ .

$\bullet$ HyperMinHash (YuArxiv2017, ). A HyperMinHash sketch consists of $k$ $q$ -bit registers and $k$ $r$ -bit registers. The first $k$ $q$ -bit registers can be viewed as a HyperLogLog sketch. To guarantee the performance for large sets (including up to $2^{32}$ elements), we set $q=5$ .

5.3. Metrics

We evaluate both efficiency and effectiveness of our methods in comparison with the above baseline methods. For efficiency, we evaluate the running time of all methods. Specially, we study the time for updating each set element and estimating set similarities, respectively. The update time determines the maximum throughput that a method can handle, and the estimation time determines the delay in querying the similarity of set-pairs. For effectiveness, we evaluate the error of estimation $\hat{J}$ with respect to its true value $J$ using metrics: bias and root mean square error (RMSE), i.e., $\text{Bias}(\hat{J})=\mathbb{E}(\hat{J})-J$ and $\text{RMSE}(\hat{J})=\sqrt{\mathbb{E}((\hat{J}-J)^{2})}$ . Our experimental results are empirically computed from $1,000$ independent runs by default. We further evaluate our method on the detection of association rules, and use precision and recall to evaluate the performance.

5.4. Accuracy of Similarity Estimation

MaxLogHash vs MinHash and HyperMinHash. From Figures 2 (a)-(d), we see that our method MaxLogHash gives comparable results to MinHash and HyperMinHash with $r=4$ . Specially, the RMSEs of these three methods differ within $0.006$ and continually decrease as the similarity increases. The RMSE of HyperMinHash with $r=1$ significantly increases as $J_{A,B}$ increases. We observe that the large estimation error occurs because HyperMinHash exhibits a large estimation bias. Figures 2 (e)-(h) show the bias of our method MaxLogHash in comparison with MinHash and HyperMinHash. We see that the empirical biases of MaxLogHash and MinHash are both very small and no systematic biases can be observed. However, HyperMinHash with $r=1$ shows a significant bias and its bias increases as the similarity value increases. To be more specific, its bias raises from $-0.06$ to $-0.089$ when the similarity increases from $0.80$ to $0.99$ . One can increase $r$ to reduce the bias of HyperMinHash. However, HyperMinHash with large $r$ desires more memory space. For example, HyperMinHash with $r=4$ has comparable accuracy but requires $1.5$ times more memory space compared to our method MaxLogHash. Compared with MinHash, MaxLogHash gives a $5.4$ times reduction in memory usage while achieves a similar estimation accuracy. Later in Section 5.6, we show that our method MaxLogHash has a computational cost similar to Minhash, but is several orders of magnitude faster than HyperMinHash when estimating set similarities.

MaxLogHash vs HyperLogLog. To make a fair comparison, we allocate the same amount of memory space, $m$ bits, to each of MaxLogHash and HyperLogLog. As discussed in Section 4, the attractive property of our method MaxLogHash is its estimation error is almost independent from the cardinality of sets $A$ and $B$ , which does not hold for HyperLogLog. Figure 3 shows the RMSEs of MaxLogHash and HyperLogLog on sets of different sizes. We see that the RMSE of our method MaxLogHash is almost a constant. Figures 3 (a) and (b) show the performance of HyperLogLog suddenly degrades when $m=2^{9}$ and the cardinalities of $A$ and $B$ are around $200$ , because HyperLogLog uses two different estimators for cardinalities within two different ranges respectively (FlajoletAOFA07, ). As a result, our method MaxLogHash decreases the RMSE of HyperLogLog by up to $36\%$ . As shown in Figures 3 (c) and (d), similarly, the RMSE of our method MaxLogHash is about 2.5 times smaller than HyperLogLog when $m=2^{10}$ and the cardinalities of $A$ and $B$ are around $500$ .

MaxLogHash vs MaxLogOPH. As discussed in Section 4.6, the estimation error of MaxLogOPH is comparable to MaxLogHash when $k$ is far smaller than the cardinalities of two sets of interest. We compare MaxLogOPH with MaxLogHash on sets with increasing cardinalities to provide some insights. As shown in Figure 4, MaxLogOPH exhibits relatively large estimation errors for small cardinalities. When $k=128$ and the cardinality increases to $200$ (about $2k$ ), we see that MaxLogOPH achieves similar accuracy to MaxLogHash. Later in Section 5.6, MaxLogOPH significantly accelerates the speed of updating elements compared with MaxLogHash.

5.5. Accuracy of Association Rule Learning

In this experiment, we evaluate the performance of our method MaxLogHash, MinHash, and HyperMinHash on the detection of items (e.g., products) that almost always appear together in the same records (e.g., transactions). We conduct the experiments on real-world datasets: MUSHROOM and CONNECT. We first estimate all pairwise similarities among items’ record-sets, and retrieve every pair of record-sets with similarity $J>J_{0}$ . As discussed previously (results in Figure 3), HyperLogLog is not robust, because it exhibits large estimation errors for sets of particular sizes. Therefore, in what follows we compare our method MaxLogHash only with MinHash and HyperMinHash. As shown in Figure 5, MaxLogHash gives comparable precision and recall to MinHash and HyperMinHash with $r=4$ . We note that MaxLogHash gives up to $5.4$ and $2.4$ times reduction in memory usage in comparison with MinHash and HyperMinHash respectively.

5.6. Efficiency

We further evaluate the efficiency of our method MaxLogHash and its extension MaxLogOPH in comparison with MinHash and HyperLogLog. Specially, we present the time for updating each coming element and computing Jaccard similarity, respectively. We conduct experiments on synthetic balanced datasets. We omit the similar results for real-world datasets and synthetic unbalanced datasets. Figure 6 (a) shows that the update time of MaxLogOPH and HyperLogLog is almost a constant and our method outperforms other baselines. The update time of HyperMinHash is almost irrelevant to its parameter $r$ and thus we only plot the curve for $r=1$ . Specially, MaxLogOPH is about $2$ and $420$ times faster than HyperMinHash and MinHash. Figure 6 (b) shows that our methods MaxLogHash and MaxLogOPH have estimation time similar to MinHash, while they are about $10$ times faster than HyperLogLog and 4 to 5 orders of magnitude faster than HyperMinHash.

6. Related Work

Jaccard similarity estimation for static sets. Broder et al. (Broder2000, ) proposed the first sketch method MinHash to compute the Jaccard similarity of sets, which builds a sketch consisting of $k$ registers for each set. To reduce the amount of memory space required for MinHash, (PingWWW2010, ; MitzenmacherWWW14, ) developed methods $b$ -bit MinHash and Odd Sketch, which are dozens of times more memory efficient than the original MinHash. The basic idea behind $b$ -bit MinHash and Odd Sketch is to use probabilistic methods such as sampling and bitmap sketching to build a compact digest for each set’s MinHash sketch. Recently, several methods (Linips2012, ; ShrivastavaUAI2014, ; ShrivastavaICML2014, ; ShrivastavaICML2017, ) were proposed to reduce the time complexity of processing each element in a set from $O(k)$ to $O(1)$ .

Weighted similarity estimation for static vectors. SimHash (or, sign normal random projections) (CharikarSTOC2002, ) was developed for approximating angle similarity (i.e., cosine similarity) of weighted vectors. CWS (Manasse2010, ; HaeuplerMT2014, ), ICWS (IoffeICDM2010, ), 0-bit CWS (LiKDD2015, ), CCWS (WuICDM2016, ), Weighted MinHash (ShrivastavaNIPS2016, ), PCWS (WuWWW2017, ), and BagMinHash (Ertl2018, ) were developed for approximating generalized Jaccard similarity of weighted vectors666The Jaccard similarity between two positive real value vectors $\vec{x}=(x_{1},x_{2},\ldots,x_{p})$ and $\vec{y}=(y_{1},y_{2},\ldots,y_{p})$ is defined as $J(\vec{x},\vec{y})=\frac{\sum_{1\leq j\leq p}\min(x_{j},y_{j})}{\sum_{1\leq j\leq p}\max(x_{j},y_{j})}$ ., and Datar et al. (DatarSOCG2004, ) developed an LSH method using $p$ -stable distribution for estimating $l_{p}$ distance for weighted vectors, where $0<p\leq 2$ . Campagna and Pagh (CampagnaKAIS2012, ) developed a biased sampling method for estimating a variety of set similarity measures beyond Jaccard similarity.

Similarity estimation for data streams. The above weighted similarity estimation methods fail to deal with streaming weighted vectors, whereas elements in vectors come in a stream fashion. To solve this problem, Kutzkov et al. (KutzkovCIKM2015, ) extended AMS sketch (AlonSTOC1996, ) for the estimation of cosine similarity and Pearson correlation in streaming weighted vectors. Yang et al. (YangICDM2017, ) developed a streaming method HistoSketch for approximating Jaccard similarity with concept drift. Set intersection cardinality (i.e., the number of common elements in two sets) is also a popular metric for evaluating the similarity in sets. A variety of sketch methods such as LPC (Whang1990, ), FM (Flajolet1985, ), LogLog (Durand2003, ), HyperLogLog (FlajoletAOFA07, ), HLL-TailCut+ (XiaoZC17, ), and MinCount (GiroireDAMNew2009, ) were proposed to estimate the stream cardinality (i.e., the number of distinct elements in the stream), and can be easily extended to estimate $|A\cup B|$ by merging the sketches of sets $A$ and $B$ . Then, one can approximate $|A\cap B|$ because $|A\cap B|=|A|+|B|-|A\cup B|$ . To further improve the estimation accuracy, Cohen et al. (CohenKDD2017, ) developed a method combining MinHash and HyperLogLog to estimate set intersection cardinalities. Our experiments reveal that these sketch methods have large errors when first estimating $|A\cap B|$ and $|A\cup B|$ , and then approximating the Jaccard similarity $J_{A,B}$ . As mentioned in Section 3, MinHash can be easily extended to handle streaming sets, but its two compressed versions, $b$ -bit MinHash and Odd Sketch fail to handle data streams. To solve this problem, Yu and Weber (YuArxiv2017, ) developed a method, HyperMinHash, which can be viewed as a joint of HyperLogLog and $b$ -bit MinHash. HyperMinHash consists of $k$ registers, whereas each register has two parts, an FM sketch and a $b$ -bit string. The $b$ -bit string is computed based on the fingerprints (i.e., hash values) of set elements that map to the register. HyperMinhash first estimates $|A\cup B|$ and then infers the Jaccard similarity of sets $A$ and $B$ from the number of collisions of $b$ -bit strings given $|A\cup B|$ . Our experiments demonstrates that HyperMinHash exhibits a large bias for high similarities and it is several orders of magnitude slower than our methods when estimating the similarity.

7. Conclusions and Future Work

We develop a memory efficient sketch method MaxLogHash to estimate the similarity of two sets given in a streaming fashion. We provide a simple yet accurate estimator for Jaccard similarity, and derive exact formulas for the estimator’s bias and variance. Experimental results demonstrate that MaxLogHash can reduce around 5 times the amount of memory required for MinHash with the same desired accuracy and computational cost. Compared with our method MaxLogHash, the state-of-the-art method HyperMinHash exhibits a larger estimation bias and its estimation time is 4 to 5 orders of magnitude larger. Although HyperLogLog can be extended to estimate Jaccard similarity, its estimation error (resp. estimation time) is about 2.5 times (resp. 10 times) larger than our methods. In the future, we plan to extend MaxLogHash to weighted streaming vectors and fully dynamic streaming sets that include both set element insertions and deletions.

Acknowledgment

The research presented in this paper is supported in part by National Key R&D Program of China (2018YFC0830500), National Natural Science Foundation of China (U1736205, 61603290), Shenzhen Basic Research Grant (JCYJ20170816100819428), Natural Science Basic Research Plan in Shaanxi Province of China (2019JM-159). The work of John C.S. Lui is supported in part by the GRF R4032-18.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Kaiyu Li and Guoliang Li. Approximate query processing: What is new and where to go? Data Science and Engineering , 2018.
2[2] Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci. , 60(3):630–659, June 2000.
3[3] A. Broder. On the resemblance and containment of documents. In SEQUENCES , 1997.
4[4] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. On compressing social networks. In KDD , pages 219–228, 2009.
5[5] Sreenivas Gollapudi and Aneesh Sharma. An axiomatic approach for result diversification. In WWW , pages 381–390.
6[6] Ping Li, Anshumali Shrivastava, Joshua L. Moore, and Arnd Christian König. Hashing algorithms for large-scale learning. In NIPS. , pages 2672–2680, 2011.
7[7] Tanguy Urvoy, Emmanuel Chauveau, Pascal Filoche, and Thomas Lavergne. Tracking web spam with html style similarities. ACM Trans. Web , 2(1), March 2008.
8[8] Ping Li and Arnd Christian König. b-bit minwise hashing. In WWW , pages 671–680, 2010.