PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs

Zhewei Wei; Xiaodong He; Xiaokui Xiao; Sibo Wang; Yu Liu; Xiaoyong Du,; Ji-Rong Wen

arXiv:1905.02354·cs.DS·May 8, 2019

PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs

Zhewei Wei, Xiaodong He, Xiaokui Xiao, Sibo Wang, Yu Liu, Xiaoyong Du,, Ji-Rong Wen

PDF

TL;DR

PRSim introduces a sublinear time algorithm for single-source SimRank queries on large power-law graphs, enabling efficient real-time similarity computations with high accuracy and small index size.

Contribution

The paper presents PRSim, a novel algorithm that exploits graph structure to achieve sublinear query time for SimRank, with theoretical guarantees and superior empirical performance.

Findings

01

PRSim achieves sublinear query time on power-law graphs.

02

PRSim outperforms existing algorithms in speed and accuracy.

03

The empirical analysis confirms the theoretical advantages of PRSim.

Abstract

{\it SimRank} is a classic measure of the similarities of nodes in a graph. Given a node $u$ in graph $G = (V, E)$ , a {\em single-source SimRank query} returns the SimRank similarities $s (u, v)$ between node $u$ and each node $v \in V$ . This type of queries has numerous applications in web search and social networks analysis, such as link prediction, web mining, and spam detection. Existing methods for single-source SimRank queries, however, incur query cost at least linear to the number of nodes $n$ , which renders them inapplicable for real-time and interactive analysis. { This paper proposes \prsim, an algorithm that exploits the structure of graphs to efficiently answer single-source SimRank queries. \prsim uses an index of size $O (m)$ , where $m$ is the number of edges in the graph, and guarantees a query time that depends on the {\em reverse PageRank} distribution of the input…

Figures22

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1. Comparison of single-source SimRank algorithms with 𝜺 𝜺 \boldsymbol{\varepsilon} additive error and 𝟏 − 𝜹 1 𝜹 \boldsymbol{1-\delta} success probability.

Algorithm	Query Time	Query Time (Power-Law Graphs)		Space Overhead	Preprocessing Time
PRSim	$O (\frac{n \log \frac{n}{δ}}{ε^{2}} \cdot \sum_{w \in V} π {(w)}^{2})$	$O (\log \frac{n}{δ} / ε^{2})$	for $γ > 2$	$O (\min {n / ε, m})$	$O (m / ε)$
		$O (\log \frac{n}{δ} \cdot \log n / ε^{2})$	for $γ = 2$
		$O (\min {n^{\frac{1}{γ}} / ε^{2 - \frac{1}{γ}}, n^{\frac{2}{γ} - 1} / ε^{2}})$	for $1 < γ < 2$
TSF (SLX15, )	$O (n \log \frac{n}{δ} / ε^{2})$			$O (n \log \frac{n}{δ} / ε^{2})$	$O (n \log \frac{n}{δ} / ε^{2})$
READS (jiang2017reads, )	$O (n \log \frac{n}{δ} / ε^{2})$			$O (n \log \frac{n}{δ} / ε^{2})$	$O (n \log \frac{n}{δ} / ε^{2})$
ProbeSim (liu2017probesim, )	$O (n \log \frac{n}{δ} / ε^{2})$			0	0
SLING (TX16, )	$O (n / ε)$			$O (n / ε)$	$O (m / ε + n \log \frac{n}{δ} / ε^{2})$

Table 2. Table 2. Table of notations.

Notation	Description
$n, m$	the numbers of nodes and edges in $G$
$ℐ (v), 𝒪 (v)$	the set of in-neighbors and out-neighbors of a node $v$
$d_{o u t} (v)$ , $d_{i n} (v)$	the out-degree and in-degree of node $v$
$s (u, v)$	the SimRank similarity of nodes $u$ and $v$
$\hat{s} (u, v)$	an estimation of $s (u, v)$
$c$	the decay factor of SimRank
$ε$	the maximum absolute error allowed in SimRank computation
$π (w)$	the reverse PageRank of node $w$
$π (u, w), π_{ℓ} (u, w)$	the RPPR and $ℓ$ -hop RPPR values of $w$ with respect to $u$
$\hat{π} (u, w), {\hat{π}}_{ℓ} (u, w)$	estimators of $π (u, w)$ and $π_{ℓ} (u, w)$
$r_{ℓ} (v, w)$ , $ψ_{ℓ} (v, w)$	the residue and reserve of $v$ at level $ℓ$ from $w$ in the backward search

Table 3. Table 3. Data Sets.

Data Set	Type	$𝒏$	$𝒎$
DBLP-Author (DB)	undirected	5,425,963	17,298,033
LiveJournal (LJ)	directed	4,847,571	68,993,773
It-2004 (IT)	directed	41,291,594	1,150,725,436
Twitter (TW)	directed	41,652,230	1,468,365,182
UK-Union (UK)	directed	133,633,040	5,507,679,822

Equations135

s (u, v) = ⎩ ⎨ ⎧ 1, \frac{c}{∣ I ( u ) ∣ \cdot ∣ I ( v ) ∣} u^{'} \in I (u) \sum v^{'} \in I (v) \sum s (u^{'}, v^{'}), if u = v otherwise

s (u, v) = ⎩ ⎨ ⎧ 1, \frac{c}{∣ I ( u ) ∣ \cdot ∣ I ( v ) ∣} u^{'} \in I (u) \sum v^{'} \in I (v) \sum s (u^{'}, v^{'}), if u = v otherwise

∣ \overset{s}{^} (u, v) - s (u, v) ∣ \leq ε

∣ \overset{s}{^} (u, v) - s (u, v) ∣ \leq ε

\mathrm{E}[Cost]=\left\{\begin{array}[]{ll}O({1\over\varepsilon^{2}}\log{n\over\delta}),&\textrm{for }\gamma>2;\\ O({1\over\varepsilon^{2}}\log{n\over\delta}\cdot\log n),&\textrm{for }\gamma=2;\\ O\left(\min\left\{{n^{1\over\gamma}\over\varepsilon^{2-{1\over\gamma}}},{n^{{2\over\gamma}-1}\over\varepsilon^{2}}\right\}\right),&\textrm{for }1<\gamma<2,\end{array}\right.

\mathrm{E}[Cost]=\left\{\begin{array}[]{ll}O({1\over\varepsilon^{2}}\log{n\over\delta}),&\textrm{for }\gamma>2;\\ O({1\over\varepsilon^{2}}\log{n\over\delta}\cdot\log n),&\textrm{for }\gamma=2;\\ O\left(\min\left\{{n^{1\over\gamma}\over\varepsilon^{2-{1\over\gamma}}},{n^{{2\over\gamma}-1}\over\varepsilon^{2}}\right\}\right),&\textrm{for }1<\gamma<2,\end{array}\right.

π_{ℓ + 1} (y, w) = x \in I (y) \sum \frac{c}{d _{in} ( y )} π_{ℓ} (x, w) .

π_{ℓ + 1} (y, w) = x \in I (y) \sum \frac{c}{d _{in} ( y )} π_{ℓ} (x, w) .

ℓ = 0 \sum \infty u \in V \sum π_{ℓ} (u, w) = nπ (w) .

ℓ = 0 \sum \infty u \in V \sum π_{ℓ} (u, w) = nπ (w) .

s (u, v) = ℓ = 0 \sum \infty w \in V \sum h_{ℓ} (u, w) h_{ℓ} (v, w) η (w) .

s (u, v) = ℓ = 0 \sum \infty w \in V \sum h_{ℓ} (u, w) h_{ℓ} (v, w) η (w) .

s (u, v) = \frac{1}{( 1 - c ) ^{2}} ℓ = 0 \sum \infty w \in V \sum π_{ℓ} (u, w) π_{ℓ} (v, w) η (w) .

s (u, v) = \frac{1}{( 1 - c ) ^{2}} ℓ = 0 \sum \infty w \in V \sum π_{ℓ} (u, w) π_{ℓ} (v, w) η (w) .

s_{I} (u, v) = \frac{1}{( 1 - c ) ^{2}} ℓ = 0 \sum \infty j = 1 \sum j_{0} π_{ℓ} (u, w_{j}) π_{ℓ} (v, w_{j}) η (w_{j}),

s_{I} (u, v) = \frac{1}{( 1 - c ) ^{2}} ℓ = 0 \sum \infty j = 1 \sum j_{0} π_{ℓ} (u, w_{j}) π_{ℓ} (v, w_{j}) η (w_{j}),

s_{B} (u, v) = \frac{1}{( 1 - c ) ^{2}} ℓ = 0 \sum \infty j = j_{0} + 1 \sum n π_{ℓ} (u, w_{j}) π_{ℓ} (v, w) η (w_{j}) .

s_{B} (u, v) = \frac{1}{( 1 - c ) ^{2}} ℓ = 0 \sum \infty j = j_{0} + 1 \sum n π_{ℓ} (u, w_{j}) π_{ℓ} (v, w) η (w_{j}) .

Pr [∣ \overset{s}{^}_{I} (u, v) - s_{I} (u, v) ∣ > \frac{ε}{2}] \leq \frac{δ}{2 n} .

Pr [∣ \overset{s}{^}_{I} (u, v) - s_{I} (u, v) ∣ > \frac{ε}{2}] \leq \frac{δ}{2 n} .

Pr [∣ \overset{s}{^}_{B} (u, v) - s_{B} (u, v) ∣ > \frac{ε}{2}] \leq \frac{δ}{2 n} .

Pr [∣ \overset{s}{^}_{B} (u, v) - s_{B} (u, v) ∣ > \frac{ε}{2}] \leq \frac{δ}{2 n} .

Pr [∣ \overset{s}{^} (u, v) - s (u, v) ∣ > ε] \leq \frac{δ}{2 n} + \frac{δ}{2 n} = \frac{δ}{n} .

Pr [∣ \overset{s}{^} (u, v) - s (u, v) ∣ > ε] \leq \frac{δ}{2 n} + \frac{δ}{2 n} = \frac{δ}{n} .

E [C_{I}] = O min ⎩ ⎨ ⎧ \frac{n}{ε} j = 1 \sum \frac{c _{1}}{ε} π (w_{j}), \frac{n}{ε ^{2}} j = 1 \sum j_{0} π (w_{j})^{2} ⎭ ⎬ ⎫ .

E [C_{I}] = O min ⎩ ⎨ ⎧ \frac{n}{ε} j = 1 \sum \frac{c _{1}}{ε} π (w_{j}), \frac{n}{ε ^{2}} j = 1 \sum j_{0} π (w_{j})^{2} ⎭ ⎬ ⎫ .

E [C] = O (\frac{n lo g \frac{n}{δ}}{ε ^{2}} \cdot w \in V \sum π (w)^{2}) .

E [C] = O (\frac{n lo g \frac{n}{δ}}{ε ^{2}} \cdot w \in V \sum π (w)^{2}) .

π (w_{j}) = κ \cdot j^{- β} / n^{1 - β} = κ \cdot j^{- \frac{1}{γ}} / n^{1 - \frac{1}{γ}},

π (w_{j}) = κ \cdot j^{- β} / n^{1 - β} = κ \cdot j^{- \frac{1}{γ}} / n^{1 - \frac{1}{γ}},

\mathrm{E}[C]=\left\{\begin{array}[]{ll}O({1\over\varepsilon^{2}}\log{n\over\delta}),&\textrm{for }\gamma>2;\\ O({1\over\varepsilon^{2}}\log{n\over\delta}\log n),&\textrm{for }\gamma=2;\\ O\left(\min\left\{{n^{1\over\gamma}\over\varepsilon^{2-{1\over\gamma}}},{n^{{2\over\gamma}-1}\over\varepsilon^{2}}\right\}\right),&\textrm{for }1<\gamma<2.\end{array}\right.

\mathrm{E}[C]=\left\{\begin{array}[]{ll}O({1\over\varepsilon^{2}}\log{n\over\delta}),&\textrm{for }\gamma>2;\\ O({1\over\varepsilon^{2}}\log{n\over\delta}\log n),&\textrm{for }\gamma=2;\\ O\left(\min\left\{{n^{1\over\gamma}\over\varepsilon^{2-{1\over\gamma}}},{n^{{2\over\gamma}-1}\over\varepsilon^{2}}\right\}\right),&\textrm{for }1<\gamma<2.\end{array}\right.

S = (c A^{⊤} S A) \lor I,

S = (c A^{⊤} S A) \lor I,

S = c A^{⊤} S A + (1 - c) \cdot I .

S = c A^{⊤} S A + (1 - c) \cdot I .

A v g E r r or @ k = \frac{1}{k} 1 \leq i \leq k \sum ∣ \overset{s}{^} (u, v_{i}) - s (u, v_{i}) ∣.

A v g E r r or @ k = \frac{1}{k} 1 \leq i \leq k \sum ∣ \overset{s}{^} (u, v_{i}) - s (u, v_{i}) ∣.

\sum_{k=i+1}^{j}k^{-\alpha}=\left\{\vspace{-1mm}\begin{array}[]{ll}O(j^{1-\alpha}),&\textrm{for }\alpha<1;\\ O(\log j-\log i),&\textrm{for }\alpha=1;\\ O\left(i^{1-\alpha}\right),&\textrm{for }\alpha>1.\end{array}\right.

\sum_{k=i+1}^{j}k^{-\alpha}=\left\{\vspace{-1mm}\begin{array}[]{ll}O(j^{1-\alpha}),&\textrm{for }\alpha<1;\\ O(\log j-\log i),&\textrm{for }\alpha=1;\\ O\left(i^{1-\alpha}\right),&\textrm{for }\alpha>1.\end{array}\right.

\overset{π}{^}_{i + 1} (y, w) = x \in A \sum R_{i} (x) \frac{π ^ _{i} ( x , w )}{d _{in} ( y )} + x \in B \sum R_{i} (x) Z_{i} (x, y) (1 - c) .

\overset{π}{^}_{i + 1} (y, w) = x \in A \sum R_{i} (x) \frac{π ^ _{i} ( x , w )}{d _{in} ( y )} + x \in B \sum R_{i} (x) Z_{i} (x, y) (1 - c) .

E [\overset{π}{^}_{i + 1} (y, w) ∣ \overset{π}{^}_{i} (x, w), x \in V]

E [\overset{π}{^}_{i + 1} (y, w) ∣ \overset{π}{^}_{i} (x, w), x \in V]

= x \in A \sum E [R_{i} (x)] \frac{π ^ _{i} ( x , w )}{d _{in} ( y )} + x \in B \sum E [R_{i} (x) Z_{i} (x, y)] (1 - c) .

E [Z_{i} (x, y)] = Pr [r < \frac{π ^ _{i} ( x , w )}{d _{in} ( y ) ( 1 - c )}] = \frac{π ^ _{i} ( x , w )}{d _{in} ( y ) ( 1 - c )} . \vspace - 1 mm

E [Z_{i} (x, y)] = Pr [r < \frac{π ^ _{i} ( x , w )}{d _{in} ( y ) ( 1 - c )}] = \frac{π ^ _{i} ( x , w )}{d _{in} ( y ) ( 1 - c )} . \vspace - 1 mm

E [\overset{π}{^}_{i + 1} (y, w) ∣ \overset{π}{^}_{i} (x, w), x \in V]

E [\overset{π}{^}_{i + 1} (y, w) ∣ \overset{π}{^}_{i} (x, w), x \in V]

= x \in A \sum \frac{c π ^ _{i} ( x , w )}{d _{in} ( y )} + x \in B \sum \frac{c π ^ _{i} ( x , w ) ( 1 - c )}{d _{in} ( y ) ( 1 - c )} = x \in I (y) \sum \frac{c π ^ _{i} ( x , w )}{d _{in} ( y )} .

E [i = 0 \sum \infty x \in V \sum cos t_{i} (x)] = \frac{1}{1 - c} i = 0 \sum \infty x \in V \sum π_{i} (x, w)) = O (nπ (w)) . \vspace - 1 mm

E [i = 0 \sum \infty x \in V \sum cos t_{i} (x)] = \frac{1}{1 - c} i = 0 \sum \infty x \in V \sum π_{i} (x, w)) = O (nπ (w)) . \vspace - 1 mm

E [\overset{π}{^}_{i + 1} (y, w)^{2} ∣ \overset{π}{^}_{i} (x, w), x \in V] =

E [\overset{π}{^}_{i + 1} (y, w)^{2} ∣ \overset{π}{^}_{i} (x, w), x \in V] =

E (x \in A \sum R_{i} (x) \frac{π ^ _{i} ( x , w )}{d _{in} ( y )} + x \in B \sum R_{i} (x) Z_{i} (x, y) (1 - c))^{2} .

E [\overset{π}{^}_{i + 1} (y, w)^{2} ∣ \overset{π}{^}_{i} (x, w), x \in V] = X_{1} + X_{2} + X_{3} + X_{4} + X_{5}

E [\overset{π}{^}_{i + 1} (y, w)^{2} ∣ \overset{π}{^}_{i} (x, w), x \in V] = X_{1} + X_{2} + X_{3} + X_{4} + X_{5}

= x \in A \sum E [R_{i} (x)^{2}] \frac{π ^ _{i} ( x , w ) ^{2}}{d _{in} ( y ) ^{2}} + x \in B \sum E [R_{i} (x)^{2} Z_{i} (x, y)^{2}] (1 - c)^{2}

+ x_{1} \neq = x_{2} \in A \sum E [R_{i} (x_{1}) R_{i} (x_{2})] \frac{π ^ _{i} ( x _{1} , w ) π ^ _{i} ( x _{1} , w )}{d _{in} ( y ) ^{2}}

+ x_{1} \neq = x_{2} \in B \sum E [R_{i} (x_{1}) Z_{i} (x_{1}, y) R_{i} (x_{2}) Z_{i} (x_{2}, y)] \cdot (1 - c)^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs

[Technical Report]

Zhewei Wei

[email protected]

School of Information, DEKE MOE, Renmin University of China

,

Xiaodong He

[email protected]

4Paradigm Inc.BeijingChina

,

Xiaokui Xiao

[email protected]

School of Computing, National University of Singapore

,

Sibo Wang

[email protected]

The Chinese University of Hong Kong

,

Yu Liu

[email protected]

Peking University

,

Xiaoyong Du

and

Ji-Rong Wen

duyong,[email protected]

Renmin University of China

(2019)

Abstract.

SimRank is a classic measure of the similarities of nodes in a graph. Given a node $u$ in graph $G=(V,E)$ , a single-source SimRank query returns the SimRank similarities $s(u,v)$ between node $u$ and each node $v\in V$ . This type of queries has numerous applications in web search and social networks analysis, such as link prediction, web mining, and spam detection. Existing methods for single-source SimRank queries, however, incur query cost at least linear to the number of nodes $n$ , which renders them inapplicable for real-time and interactive analysis.

This paper proposes PRSim, an algorithm that exploits the structure of graphs to efficiently answer single-source SimRank queries. PRSim uses an index of size $O(m)$ , where $m$ is the number of edges in the graph, and guarantees a query time that depends on the reverse PageRank distribution of the input graph. In particular, we prove that PRSim runs in sub-linear time if the degree distribution of the input graph follows the power-law distribution, a property possessed by many real-world graphs. Based on the theoretical analysis, we show that the empirical query time of all existing SimRank algorithms also depends on the reverse PageRank distribution of the graph. Finally, we present the first experimental study that evaluates the absolute errors of various SimRank algorithms on large graphs, and we show that PRSim outperforms the state of the art in terms of query time, accuracy, index size, and scalability.

SimRank; Power-Law Graphs; Personalized PageRank

††journalyear: 2019††copyright: acmcopyright††conference: 2019 International Conference on Management of Data; June 30-July 5, 2019; Amsterdam, Netherlands††booktitle: 2019 International Conference on Management of Data (SIGMOD ’19), June 30-July 5, 2019, Amsterdam, Netherlands††price: 15.00††doi: 10.1145/3299869.3319873††isbn: 978-1-4503-5643-5/19/06††ccs: Mathematics of computing Graph algorithms††ccs: Information systems Data mining

1. Introduction

Measuring similarities and proximities of nodes in the graph is a classic task in graph analytics. Several link-based similarity measures have been proposed, including Personalized PageRank (page1999pagerank, ), Simfusion (xi2005simfusion, ), P-rank (zhao2009p, ) and Panther (zhang2015panther, ). Among them, SimRank (JW02, ), proposed by Jeh and Widom, is regarded as one of the most influential similarity measures, and has been adopted in numerous applications such as web mining (Jin11, ), social network analysis (NK07, ), and spam detection (SH11, ). Given a graph $G=(V,E)$ , the SimRank similarity of nodes $u$ and $v$ , denoted as $s(u,v)$ , is defined as

[TABLE]

where $\mathcal{I}(u)$ denotes the set of in-neighbors of $u$ , and $c\in(0,1)$ is a decay factor typically set to 0.6 or 0.8 (JW02, ; LVGT10, ). This formulation is based on two intuitive statements: (1) two objects are similar if they are referenced by similar objects, and (2) an object is most similar to itself. Due to its recursive nature, SimRank computation is a non-trivial problem and has been extensively studied for more than a decade. Existing work mostly considers three types of SimRank queries: (1) Single-pair queries, which ask for the SimRank similarity between two given nodes $u$ and $v$ ; (2) All-pair queries, which ask for the SimRank similarity between any pair of nodes $u$ and $v$ ; (3) Single-source queries, which ask for the SimRank similarity between every node and $u$ . All-pair queries require storing $O(n^{2})$ node pairs, and thus is infeasible for large graphs. Meanwhile, single-source queries has become the focus of recent research (KMK14, ; MKK14, ; TX16, ; FRCS05, ; LeeLY12, ; LiFL15, ; SLX15, ; YuM15b, ; LiFL15, ; jiang2017reads, ; liu2017probesim, ), due to its connections to recommendation applications. In this paper, we aim to answer approximate single-source SimRank queries, defined as follows:

Definition 1.1 (Approximate Single-Source Queries).

Given a node $u$ in a directed graph $G$ and an absolute error threshold $\varepsilon$ , an approximate single-source SimRank query returns an estimated value $\hat{s}(u,v)$ for each node $v$ in $G$ , such that

[TABLE]

holds for any $v$ with at least $1-\delta$ probability. $\square$

Power-law graphs. It was experimentally observed that most real-world networks are scale-free and follow power-law degree distribution. In particular, let $P_{o}(k)$ and $P_{i}(k)$ denote the fraction of nodes in the graph having out-degree and in-degree at least $k$ , respectively. Then, on a power-law graph, $P_{o}(k)$ and $P_{i}(k)$ satisfy that $P_{o}(k)\sim k^{-\gamma}$ and $P_{i}(k)\sim k^{-\gamma^{\prime}}$ (BollobasBCR03, ), where $\gamma$ and $\gamma^{\prime}$ are the (cumulative) power-law exponents that usually take values from $1$ to $2$ . Recent work has demonstrated that by exploiting this fact, we can improve the asymptotic bounds for various graph algorithms such as triangle counting (brach2016algorithmic, ), transitive closure (brach2016algorithmic, ), perfect matching (brach2016algorithmic, ), PageRank computation (lofgren2015personalized, ; wei2018topppr, ) and maximum independent set (liu2015towards, ).

Motivations. Since many graph algorithms can benefit from the structure of real-world graphs, a natural question is: Can we do the same for SimRank algorithms? On one hand, we are interested in designing a more efficient SimRank algorithm by exploiting the structure of the graphs, since existing work for SimRank computation (KMK14, ; MKK14, ; TX16, ; FRCS05, ; LeeLY12, ; LiFL15, ; SLX15, ; YuM15b, ; LiFL15, ; jiang2017reads, ; liu2017probesim, ) has missed this opportunity for optimization.

On the other hand, we are also interested in analyzing how the graph structure affects the performance of existing SimRank algorithms. More precisely, it has been observed in previous work (zhangexperimental, ) that the performance of existing SimRank algorithms may vary dramatically on graphs with similar numbers of nodes and edges. A typical example is the Twitter (TW) and IT-2004 (IT) data sets, both of which have around 40 million nodes and 1 billion edges. However, as shown in (zhangexperimental, ) and in our experiments, the query times of most SimRank algorithms are significantly smaller on IT-2004 than on Twitter. Based on this phenomenon, (zhangexperimental, ) suggests that Twitter (TW) is “locally dense” and IT-2004 (IT) is “locally sparse”. However, it is still desirable to obtain a quantifiable measure that describes the hardness of each graph in terms of SimRank computation. Finally, since obtaining ground truth for single-source SimRank queries requires $n^{2}$ space, which is infeasible for large graphs, most existing work only evaluate the accuracy of the algorithms on small graphs. The only exception is recent work (liu2017probesim, ), which evaluates precision for approximate top- $k$ queries on graphs with billion edges using the idea of pooling. However, there is no prior experimental study that evaluates absolute error for single-source queries on large graphs.

Our contributions. This paper studies the approximate single-source SimRank queries, and makes the following contributions.

(1) We propose PRSim, an algorithm that leverages the graph structure to efficiently answer approximate single-source SimRank queries. The query time complexity of PRSim is related to the reverse PageRank of the input graph $G$ , which is defined as the PageRank of the graph $G^{\prime}$ constructed by reversing the direction of each edge in $G$ . Let $\pi(w)$ denote reverse PageRank of node $w$ , and $\sum_{w\in V}\pi(w)^{2}$ denote the second moment of the reverse PageRanks. The average expected query cost for PRSim on worst-case graphs is bounded by $O\left({n\log{n\over\delta}\over\varepsilon^{2}}\cdot\sum_{w\in V}\pi(w)^{2}\right)$ . By the fact that $\sum_{w\in V}\pi(w)^{2}\leq\left(\sum_{w\in V}\pi(w)\right)^{2}=1$ , PRSim provides at least the same complexity as the random walk based algorithms (ProbeSim, TSF, and READS) do on worst-case graphs. Furthermore, PRSim uses an index of size $O(m)$ , which significantly improves the scalability of the algorithm. See Table 1 for the theoretical comparison between our algorithm and the state of the art.

On the other hand, we show that on power-law graphs, the second moment $\sum_{w\in V}\pi(w)^{2}$ is an asymptotic variable that is close to [math], which means PRSim actually achieves sub-linear query cost on real-world graphs. More precisely, Let $\gamma$ denote the cumulative power-law exponent of the out-degree distribution. We show that the average expected query cost for PRSim on power-law graphs is bounded by:

[TABLE]

for ${1\over n^{\Omega(1)}}<\varepsilon<1$ and $\delta>{1\over n^{\Omega(1)}}$ . To understand this complexity, we first note that when $\gamma\geq 2$ , our bounds depend only on $\log n$ , which is significantly better than the corresponding bound of any previous SimRank algorithms. For $1<\gamma<2$ , since $\varepsilon>{1\over n}$ , we have ${n^{1\over\gamma}\over\varepsilon^{2-{1\over\gamma}}}\leq{n\over\varepsilon}$ . This implies that PRSim also outperforms SLING on power-law graphs. To the best of our knowledge, this is the first sublinear algorithm for single-source SimRank queries on power-law graphs.

(2) To achieve the desired query cost in Table 1, we design several novel techniques for computing SimRank and Personalized PageRank (PPR) . First, we propose an algorithm that estimates the last meeting probabilities (TX16, ) (see Section for definition) for ALL nodes in $O(\log{n\over\delta}/\varepsilon^{2})$ time. This improves the $O(n\log{n\over\delta}/\varepsilon^{2})$ bounds in (TX16, ) by an order of $O(n)$ and is the key to achieve sub-linearity. Second, we propose an index scheme which performs the backward search (lofgren2015personalized, ) algorithm only on a number $j_{0}$ of hub nodes. The parameter $j_{0}$ enables us to manipulate the tradeoffs between index size and query time, which improves the scalability of our algorithm. Finally, we design Variance Bounded Backward Walk, an algorithm that estimates the Personalized PageRank values to a given target node $w$ with additive error $\varepsilon$ in $O(n\pi(w)\log{n\over\delta}/\varepsilon^{2})$ time, where $\pi(w)$ is the reverse PageRank of node $w$ . Since the average value of $\pi(w)$ is $1/n$ , this significantly improves the $O(n\log{n\over\delta}/\varepsilon^{2})$ time complexity of the Randomized Probe algorithm (liu2017probesim, ), and is the key to the relation between the time complexity and the reverse PageRank distribution. We also note that the Variance Bounded Backward Walk algorithm actually improves the time complexity of state-of-the-art PPR algorithms to target nodes for dense graphs (wang2018efficient, ), and may be of independent interest.

(3) Based on the time complexity of PRSim, we conduct experiments to confirm that the hardness of SimRank queries is indeed reversely related to the out-degree power-law exponent $\gamma$ of the graph. This observation provides a quantifiable measure for the concept of locally dense and locally sparse networks introduced in (zhangexperimental, ). In particular, the out-degree distribution of IT-2004 is significantly more skewed than that of Twitter (see Figure 1), which explains the performance discrepancy of existing SimRank algorithms on these two datasets. We also conduct a large set of experiments that evaluate PRSim against the state of the art on benchmark data sets. In particular, our experiments include the first empirical study on the tradeoffs between absolute error and query cost for single-source SimRank algorithms on graphs with billions of edges. Our empirical study shows that PRSim outperforms the state of the art in terms of query time, accuracy, index size, and scalability.

2. Preliminaries

Table 2 shows the notations that are frequently used in the remainder of the paper.

$\boldsymbol{\sqrt{c}}$ -walk and Reverse PageRank. We unify the definition of SimRank and reverse PageRank under the notation of $\sqrt{c}$ -walk. Let $G=(V,E)$ be a directed graph with $n$ nodes and $m$ edges. Given a source node $u\in V$ and a decay factor $c$ , a reverse $\sqrt{c}$ -discounted random walk (or $\sqrt{c}$ -walk in short) from $u$ is a traversal of $G$ that starts from $u$ and, at each step, either (i) terminates at the current node with $1-\sqrt{c}$ probability, or (ii) proceeds to a randomly selected in-neighbor of the current node with $\sqrt{c}$ probability. We define the reverse PageRank $\pi(w)$ of a node $w$ to be the probability that an $\sqrt{c}$ -walk from a uniformly chosen source node terminates at $w$ . It is easy to see that the reverse PageRank of a node $w$ in the original graph $G$ equals to the PageRank of $w$ in the reverse graph $G^{\prime}$ constructed by reversing the direction of each edge in $G$ .

Given a source node $u$ and a target node $w$ , we further define the reverse Personalized PageRank (RPPR) $\pi(u,w)$ of $w$ with respect to $u$ to be the probability that an $\sqrt{c}$ -walk from $u$ terminates at $w$ . Again, the reverse Personalized PageRank on the original graph $G$ equals to the Personalized PageRank on the reverse graph $G^{\prime}$ . Since the RPPR values from a given source node $u$ form a probability distribution, we have $\sum_{w\in V}\pi(u,w)=1.$ Meanwhile, since the reverse PageRank $\pi(w)$ is equal to the probability that an $\sqrt{c}$ -walk from a random source node terminates at $w$ , we have $\sum_{u\in V}\pi(u,w)=n\pi(w).$

$\boldsymbol{\ell}$ -Hop RPPR. In this paper, we will mainly use a variant of Personalized PageRank called $\ell$ -hop Reverse Personalized PageRank ( $\ell$ -hop RPPR). Given a source node $u$ , the $\ell$ -hop RPPR $\pi_{\ell}(u,w)$ of node $w$ respected to $u$ is the probability that a reverse $\sqrt{c}$ -walk from $u$ terminates at node $w$ with exactly $\ell$ steps. By the definition of $\ell$ -hop RPPR, we have

[TABLE]

On the other hand, it is easy to see that RPPR $\pi(u,w)$ can be expressed as the sum of $\ell$ -hop RPPR, that is, $\sum_{\ell=0}^{\infty}\pi_{\ell}(u,w)=\pi(u,w).$ Thus, we have $\sum_{\ell=0}^{\infty}\sum_{w\in V}\pi_{\ell}(u,w)=1,$ and

[TABLE]

SimRank, $\boldsymbol{\sqrt{c}}$ -walk, and hitting probability. It is shown in (TX16, ) that the SimRank similarity $s(u,v)$ between two different nodes $u$ and $v$ can also be formulated using $\sqrt{c}$ -walks. Given two distinct nodes $u$ and $v$ , we start a $\sqrt{c}$ -walk from each node. If the two $\sqrt{c}$ -walks visit the same node after exactly $i$ steps, we say the two $\sqrt{c}$ -walks meet at step $i$ . (TX16, ) shows that $s(u,v)$ is equal to the probability that the two $\sqrt{c}$ -walks meet.

Moreover, (TX16, ) proposes SLING, an algorithm that uses the following formula to estimate SimRank values:

[TABLE]

Here $h_{\ell}(u,w)$ denote the hitting probability that an $\sqrt{c}$ -walk from node $u$ visits $w$ in its $\ell$ -step, and $\eta(w)$ is a parameter that characterizes the last-meeting probability:

Definition 2.1 (Last-meeting probability).

The last-meeting probability $\eta(w)$ for node $w$ is the probability that two $\sqrt{c}$ -walk from $w$ do not meet at $i$ step for any $i\geq 1$ .

SLING precomputes $h_{\ell}(u,w)$ and $\eta(w)$ with an additive error up to $\varepsilon$ , and stores them in the index. Given a query node $u$ , it retrieves all levels $\ell$ and nodes $w$ such that $h_{\ell}(u,w)>\varepsilon$ . For each $(\ell,w)$ pair, SLING retrieves all nodes $v$ with $h_{\ell}(v,w)>\varepsilon$ and $\eta(w)$ , and estimates $s(u,v)$ with Equation (5).

There are two major issues with SLING. First, storing all $h_{\ell}(u,w)$ with additive error up to $\varepsilon$ takes $O(n/\varepsilon)$ space, which can be significantly larger than the graph size for reasonable choices of $\varepsilon$ . Second, approximating $\eta(w)$ for each $w\in V$ requires sampling a large number of random walks from each node in the graph, which makes the preprocessing time infeasible on very large graphs. Our algorithm overcomes these two drawbacks by (1) providing an index size that is at most the size of the graph, and (2) designing an algorithm that estimates $\eta(w)$ on-the-fly, using only $O(\log n/\varepsilon^{2})$ time.

3. PRSim algorithm

In this section, we present PRSim, an index-based algorithm that exploits the graph structure to efficiently answer approximate single-source SimRank queries. We first provide the estimating formula that relates SimRank and $\ell$ -hop RPPR.

3.1. SimRank and $\ell$ -hop RPPR

The relation between SimRank and reverse Personalized PageRank can be directly derived from equation (5). Observe the fact that $\ell$ -hop RPPR $\pi_{\ell}(u,w)$ equals to the hitting probability $h_{\ell}(u,w)$ multiplied the the termination probability $\alpha=1-\sqrt{c}$ , and we have

[TABLE]

There are two reasons for using $\ell$ -hop RPPR over hitting probability. Firstly, we have $\sum_{\ell=0}^{\infty}\sum_{w\in V}\pi_{\ell}(u,w)=1$ . As we will show later, this is critical for estimating $\eta(w)$ in $O(\log{n\over\delta}/\varepsilon^{2})$ time. Secondly, we have $\sum_{\ell=0}^{\infty}\sum_{u\in V}\pi_{\ell}(u,w)=n\pi(w)$ . This property relates SimRank with the reverse PageRank, and thus is essential for achieving sublinear query time.

Recall that given a source node $u$ , our goal is to estimate SimRank values $s(u,v)$ with additive error $\varepsilon$ for any node $v\in V$ . By Equation (6), we can decompose the query process into three subroutines: 1) Given a source node $u$ , compute the $\ell$ -hop RPPR values $\pi_{\ell}(u,w)$ for any nodes $w\in V$ ; 2) Compute last meeting probabilities $\eta(w)$ for each $w\in V$ ; 3) For any node $v\in V$ , compute $\ell$ -hop RPPR values $\pi_{\ell}(v,w)$ to any target node $w$ . For the first task, we can employ a simple Monte Carlo algorithm which generates a number $n_{r}=O(\log{n\over\delta}/\varepsilon^{2})$ of $\sqrt{c}$ -walks from $u$ and uses the proportion of $\sqrt{c}$ -walks that terminate at $w$ with exact $\ell$ steps to approximate $\pi_{\ell}(u,w)$ . This algorithm runs in $O(\log{n\over\delta}/\varepsilon^{2})$ time, so we will focus on the remaining two tasks.

3.2. Computing Last Meeting Probability

The first challenge is how to estimate $\eta(w)$ for each $w\in V$ efficiently. SLING (TX16, ) generates $n_{r}=\Theta\left({\log{n\over\delta}\over\varepsilon^{2}}\right)$ pair of $\sqrt{c}$ -walks for each $w\in V$ , and obtains an approximation to $\eta(w)$ with error $\varepsilon$ for each $w\in V$ . However, this solution leads to a preprocessing time of $O\left({n\log{n\over\delta}\over\varepsilon^{2}}\right)$ , and thus, is not feasible if we need small error $\varepsilon$ on large graphs.

Our first key insight is that, instead of estimating the $\ell$ -hop PPR $\pi_{\ell}(u,w)$ and last meeting probability $\eta(w)$ separately, we can estimate their product $\eta(w)\pi_{\ell}(u,w)$ in the query phase, using only $n_{r}=\Theta\left({\log{n\over\delta}\over\varepsilon^{2}}\right)$ samples. More precisely, we observe that $\eta(w)\pi_{\ell}(u,w)$ is the probability that an $\sqrt{c}$ -walk from $u$ terminates at $w$ with $\ell$ steps, and then, two independent $\sqrt{c}$ -walks from $w$ do not meet. Therefore, we can generate an $\sqrt{c}$ -walk $\mathcal{W}(u)$ from $u$ , and then two $\sqrt{c}$ -walks $\mathcal{W}_{1}(w)$ and $\mathcal{W}_{2}(w)$ from the node $w$ where $\mathcal{W}(u)$ terminates. If $\mathcal{W}_{1}(w)$ and $\mathcal{W}_{2}(w)$ do not meet, we set the estimator $\widehat{\eta\pi}_{\ell}(u,w)=1$ . This way we obtain an unbiased estimator for each $\eta(w)\pi_{\ell}(u,w)$ , $w\in V$ and $\ell=0,\ldots,\infty$ . We also note that the summation $\sum_{w\in V}\sum_{\ell=0}^{\infty}\eta(w)\pi_{\ell}(u,w)\leq\sum_{w\in V}\sum_{\ell=0}^{\infty}\pi_{\ell}(u,w)=1$ , which means we can use Chernoff bound A.1 to estimates $\eta(w)\pi_{\ell}(u,w)$ with additive error $\varepsilon$ for any $w\in V,\ell\geq 0$ with only $n_{r}=\Theta\left({\log{n\over\delta}\over\varepsilon^{2}}\right)$ samples.

3.3. Precomputing RPPR to Hub Nodes

Given a target node $w$ , computing $\ell$ -hop RPPR $\pi_{\ell}(v,w)$ for any node $v\in V$ is time-consuming, especially when $w$ is a hub node with many out-neighbors. Therefore, we will use index to help reduce the cost. SLING (TX16, ) proposes the following approach: for each (source) node $v$ , we precompute $\pi_{\ell}(v,w)$ for any $w\in V$ and put $\pi_{\ell}(v,w)$ into an inverted list, so we can efficiently track $\pi_{\ell}(v,w),v\in V$ for a given target node $w$ . This approach, however, essentially builds an index for every target node $w\in V$ and results in an index of size $O\left({n\over\varepsilon}\right)$ , which is usually significantly larger than the graph size $m$ for reasonably small $\varepsilon$ .

To reduce the index size, we propose to build index only for hub nodes. In particular, we identify $j_{0}$ nodes with the largest reverse PageRanks as hub nodes, where $j_{0}$ is a user-specified parameter. We then perform the backward search (lofgren2015personalized, ) algorithm on each hub node $w$ to precompute $\pi_{\ell}(v,w)$ for any $v\in V$ and any $\ell>0$ . The definition of hub nodes is based on two intuitions. First, recall that the reverse PageRank of node $w$ is the probability that an $\sqrt{c}$ -walk from a random node $u$ terminates at $w$ . Therefore, a hub node $w$ is more likely to be visited in a single-source SimRank query on $u$ . Second, since $\sum_{\ell=0}^{\infty}\sum_{v\in V}\pi_{\ell}(v,w)=n\pi(w)$ , a hub node will also have more $(\ell,w)$ -tuples with $\pi_{\ell}(v,w)>\varepsilon$ , which makes it more difficult to compute $\pi_{\ell}(v,w)$ on the fly. Therefore, pre-computing $\pi_{\ell}(v,w)$ for nodes $w$ with highest reverse PageRank reduces the query cost most efficiently. We also note that we can choose the value of $j_{0}$ to balance the query time, index size and preprocessing time. For ease of presentation, we select $j_{0}$ such that the index size is bounded by $O(m)$ in this section.

Algorithm 1 illustrates the pseudocode for the preprocessing algorithm. For reasons we shall see later, for each node $u$ with out-neighbor set $\mathcal{O}(x)=\{y_{1},\ldots,y_{d}\}$ , we store the adjacency list of $x$ in a way such that $d_{in}(y_{1})\leq\ldots\leq d_{in}(y_{d})$ . To sort the adjacency list of each node in total $O(m)$ time, we first construct a tuple $(x,y,d_{in}(y))$ for each edge $(x,y)\in E$ . Then we employ the counting sort algorithm to sort the $m$ tuples $(x,y,d_{in}(y))$ according to the ascending order of $d_{in}(y)$ . Since $d_{in}(y)$ is an integer in range $[0,n]$ , the counting sort algorithm runs in time $O(m+n)$ . We then scan the $m$ sorted tuples and, for each tuple $(x,y,d_{in}(y))$ , we append $y$ to the end of $x$ ’s out-adjacency list. This algorithm sorts the out-adjacency list of each node in $O(m+n)$ time. (Lines 1-4). We then calculate the reverse PageRanks for each node $w\in V$ , and retrieve the $j_{0}$ nodes with the largest reverse PageRank as the hub nodes (line 5). For each hub node $w$ , we use backward search (lofgren2015personalized, ) to compute an estimator $\psi_{\ell}(v,w)$ for the $l$ -hop RPPR $\pi_{\ell}(v,w)$ , for each $\ell=0,\ldots,\infty$ and $v\in V$ . More precisely, we first set residue $r_{\ell}(v,w)$ and a reserve $\psi_{\ell}(v,w)=0$ to each node $v$ and $\ell=0,\ldots,\infty$ . Then, we set $r_{0}(w,w)=1$ and the residue threshold $r_{max}={(1-\sqrt{c})^{2}\varepsilon\over 12}$ (Lines 6-8). Note that we choose the constant $(1-\sqrt{c})^{2}$ to compensate the denominator $(1-\sqrt{c})^{2}$ in equation (6), and the constant $12$ so that we can sum various errors up to at most $\varepsilon$ . Starting from level [math], we traverse from $w$ , following the out-going edges of each node (Line 9). On visiting a node $v$ at level $\ell$ , we check if $v$ ’s residue $r_{\ell}(v,t)$ is larger than the threshold $r_{max}$ . If so, for each out-neighbor $z$ of $v$ , we increase the residue $r_{\ell+1}(z,w)$ of $z$ at level $\ell+1$ by $\sqrt{c}\cdot\frac{r_{\ell}(v,w)}{d_{in}(z)}$ (Lines 10-12). Next, we increase $\psi_{\ell}(v,w)$ , $v$ ’s backward reserve at level $\ell$ by $\sqrt{c}r_{\ell}(v,w)$ (line 13). After that, we reset $v$ ’s backward residue $r_{\ell}(v,w)$ to [math] (line 14). After all nodes $v$ with residue $r_{\ell}(v,w)>r_{max}$ are processed, we append tuples $(v,\psi_{\ell}(v,w))$ to a list $L_{\ell}(w)$ for each $v$ with reserve $\psi_{\ell}(v,w)>r_{max}$ (line 15-17). Note that for each a node $w$ and a level $\ell$ with at least one $\psi_{\ell}(v,w)>r_{max}$ , we store all tuples $(v,\psi_{\ell}(v,w))$ with $\psi_{\ell}(v,w)>\varepsilon$ in a list $L_{\ell}(w)$ , so we can quickly retrieve them given $w$ and $\ell$ in the query phase. The following lemma can be directly derived from (lofgren2015personalized, )

Lemma 3.1 ((lofgren2015personalized, )).

For any hub node $w$ , any $v\in V$ and $\ell\geq 0$ , Algorithm1 ensures $|\psi_{\ell}(v,w)-\pi_{\ell}(v,w)|<r_{max}={(1-\sqrt{c})^{2}\varepsilon\over 12}$ .

We have the following lemma that bounds the space usage and running time of Algorithm 1 on worst-case graphs.

Lemma 3.2.

The size of the index generated by Algorithm 1 is bounded by $O\left({n\over\varepsilon}\sum_{j=1}^{j_{0}}\pi(w_{j})\right)$ . The preprocessing time is bounded by $O\left(m\over\varepsilon\right)$ .

We set $j_{0}$ so that $O\left({n\over\varepsilon}\sum_{j=1}^{j_{0}}\pi(w_{j})\right)=O(m)$ in the theoretical analysis of PRSim, for ease of presentation. Note that if the largest reverse PageRank $\pi(w_{1})$ satisfies $\pi(w_{1})>\varepsilon m/n$ , we need to set $j_{0}=0$ , in which case PRSim becomes an index-free algorithm. However, in practice, we can manipulate $j_{0}$ to get a tradeoff between the index size and query cost.

3.4. Sampling RPPR to Non-Hub Nodes

The third key component of our method is a sampling-based algorithm that efficiently computes $\ell$ -Hop PPR values to non-hub target nodes (i.e., nodes with small reverse PPR values and thus are not in the index). Given a node $w$ , the goal is to provide an unbiased estimator $\hat{\pi}_{\ell}(v,w)$ for $\pi_{\ell}(v,w)$ for each $v\in V$ and any $\ell\geq 0$ . Once we obtain such a sampler, we can estimate each $\pi_{\ell}(v,w)$ with additive error $\varepsilon$ using $\log{n\over\delta}/\varepsilon^{2}$ samples. (liu2017probesim, ) provides such a sampler by employing a Randomized Probe algorithm, which runs in $O(n)$ time for a single sample. This time complexity, however, is unacceptable if we want sub-linear query time.

In this section, we propose an algorithm that achieves the following goals: 1) Given a node $w$ , the algorithm provides an unbiased estimator $\hat{\pi}_{\ell}(v,w)$ for $\pi_{\ell}(v,w)$ , for each $v\in V$ and any $\ell\geq 0$ ; 2) the algorithm runs in $O(n\pi(w))$ expected time. Note that $n\pi(w)=\sum_{i=0}^{\infty}\sum_{v\in V}\pi_{i}(v,w)$ is the expected output size and consequently the minimum cost for generating unbiased estimators $\hat{\pi}_{i}(v,w)$ for $i=0,\ldots,\infty$ , $v\in V$ . (3) The variance of $\hat{\pi}_{i}(v,w)$ is bounded, so we can use Chebyshev’s inequality to bound the error, and the Median Trick to boost the success probability.

Simple Backward Walk with Unbounded Variance. For ease of exposition, we first present a simple Backward Walk that achieves the first two goals. The pseudocode is illustrated by Algorithm 2. Given a node $w$ and a level $\ell$ , this algorithm also gives an unbiased estimator $\hat{\pi}_{\ell}(v,w)$ for each $v\in V$ . We first initialize $\hat{\pi}_{0}(w,w)=1-\sqrt{c}$ and $\hat{\pi}_{\ell}(x,w)=0$ for other $\ell$ or $x\in V$ (Lines 1-2). Then, we iterate $i$ from [math] to $\ell-1$ (Line 3). At level $i$ , for each $x\in V$ with non-zero $\hat{\pi}_{i}(x,w)$ , we generate a random number $r$ from $(0,1)$ (Line 4-5), and scan the out-neighbors of $x$ until we encounter the first node $y$ with $d_{in}(y)>{\sqrt{c}\over r}$ . Recall that in the preprocessing phase, we sort the out adjacency list of $x$ so that nodes in $\mathcal{O}(x)$ are ordered according to their in-degrees (see Algorithm 1). Therefore, we only have to visit the nodes with $d_{in}(y)\leq{\sqrt{c}\over r}$ , which is a subset of $\mathcal{O}(x)$ . For each out-neighbor $y$ of $x$ with $d_{in}(y)\leq{\sqrt{c}\over r}$ , we add $\hat{\pi}_{i}(x,w)$ to $\hat{\pi}_{i+1}(y,w)$ (Lines 6-7). Finally, after level $\ell-1$ is processed, we return each non-zero $\hat{\pi}_{\ell}(v,w)$ as the estimator for $\pi_{\ell}(v,w)$ (Line 8).

We can use a simple induction to prove the unbiasedness of Algorithm 2. For the base case, we have $\mathrm{E}[\hat{\pi}_{0}(w,w)]=1-\sqrt{c}=\pi_{0}(w,w)$ . Assume that $\mathrm{E}[\hat{\pi}_{i}(x,w)]=\pi_{i}(x,w)$ for any $x\in V$ . For a node $y$ at level $i+1$ , each $\hat{\pi}_{i}(x,w),x\in\mathcal{I}(y)$ is added to $\hat{\pi}_{i+1}(y,w)$ with probability ${\sqrt{c}\over d_{in}(y)}$ , and thus $\mathrm{E}[\hat{\pi}_{i+1}(y,w)]=\sum_{x\in\mathcal{I}(y)}{\sqrt{c}\over d_{in}(y)}\mathrm{E}[\hat{\pi}_{i}(x,w)]$ . Therefore, we have $\mathrm{E}[\hat{\pi}_{i+1}(y,w)]=\sum_{x\in\mathcal{I}(y)}{\sqrt{c}\over d_{in}(y)}\pi_{i}(x,w)=\pi_{i+1}(y,w).$ To analyze the running time, note that the cost for computing $\hat{\pi}_{i}(x,w)$ is bounded by the number of times that $\hat{\pi}_{i}(x,w)$ is incremented. Since each increment adds at least $(1-\sqrt{c})$ to $\hat{\pi}_{i}(x,w)$ , this cost is bounded by ${\hat{\pi}_{i}(x,w)\over 1-\sqrt{c}}$ . Summing over $i=0,\ldots,\infty$ and $x\in V$ , and using equation (4), the total cost is at most $O(n\pi(w))$ .

Unfortunately, the estimator $\hat{\pi}_{\ell}(v,w)$ returned by Algorithm 2 can be unbounded, since we may sum up all estimators from level $i$ to form an estimator of level $i+1$ . To make thing worse, it is even unclear if $\hat{\pi}_{\ell}(v,w)$ has bounded variance. This means that $\hat{\pi}_{\ell}(v,w)$ may not be sub-gaussian or sub-exponential, and thus we are unable to apply concentration inequality to bound the error.

Variance Bounded Backward Walk. To overcome the drawback of simple Backward Walk, we propose the Variance Bounded Backward Walk algorithm, which achieves bounded variance without sacrificing the $O(n\pi(w))$ query bound or the unbiasedness guarantee. Algorithm 3 illustrates the pseudocode of the Variance Bounded Backward Walk algorithm. We set $\hat{\pi}_{0}(w,w)=1-\sqrt{c}$ and $\hat{\pi}_{\ell}(x,w)=0$ for other $\ell$ or $x\in V$ (Lines 1-2). Then we iterate $i$ from [math] to $\ell-1$ (Line 3). At level $i$ , for each $x\in V$ with non-zero $\hat{\pi}_{i}(x,w)$ , we first generate a random number $r_{0}$ so that we can stop the process at $x$ with probability $1-\sqrt{c}$ (Lines 4-5). With probability $\sqrt{c}$ , we first scan through the out-neighbors of $x$ until we encounter the first node $y$ with $d_{in}(y)>{\hat{\pi}_{i}(x,w)\over 1-\sqrt{c}}$ . For each out-neighbor $y$ with $d_{in}(y)\leq{\hat{\pi}_{i}(x,w)\over 1-\sqrt{c}}$ we increase $\hat{\pi}_{i}(y,w)$ by ${\hat{\pi}_{i}(x,w)\over d_{in}(y)}$ (Lines 6-7). Then, we choose a random number $r$ from $(0,1)$ (Line 8), and continue to scan the out-neighbors of $x$ until we encounter the first node $y$ with $d_{in}(y)>{\hat{\pi}_{i}(x,w)\over r(1-\sqrt{c})}$ . Again, we only visit a subset of $\mathcal{O}(x)$ , as the nodes in $\mathcal{O}(x)$ are ordered according to their in-degrees. For each out-neighbor $y$ of $x$ with $d_{in}(y)\leq{\hat{\pi}_{i}(x,w)\over r(1-\sqrt{c})}$ , we increment $\hat{\pi}_{i+1}(y,w)$ by $1-\sqrt{c}$ (Lines 9-10). After $\ell$ levels are processed, we return all non-zero $\hat{\pi}_{\ell}(v,w)$ as estimators for $\pi_{\ell}(v,w)$ (Line 11).

**Analysis. ** We prove three properties of the Variance Bounded Backward Walk algorithm. First, the algorithm gives an unbiased estimator $\hat{\pi}_{\ell}(v,w)$ for $\pi_{i}(v,w)$ for each $v\in V$ and $i\leq\ell$ . In particular, we have the following lemma.

Lemma 3.3.

*Consider a node $v$ on a target level $\ell$ , and let $\hat{\pi}_{\ell}(v,w)$ be an estimator provided by Algorithm 3. We have $\mathrm{E}[\hat{\pi}_{\ell}(v,w)]=\pi_{\ell}(v,w)$ . *

Next, we show that the running time of Algorithm 3 on node $w$ is proportional to its reverse PageRank $\pi(w)$ . In particular, we have the following lemma.

Lemma 3.4.

The complexity of Algorithm 3 on node $w$ , regardless of the target level $\ell$ , is bounded by $O(n\pi(w))$ .

Note that $n\pi(w)=\sum_{i=0}^{\infty}\sum_{v\in V}\pi_{i}(v,w)$ , which implies that the minimum number of operations to return a unbiased estimator $\hat{\pi}_{i}(v,w)$ for each $\pi_{i}(v,w)$ is $\Omega(n\pi(w))$ . This essentially means that Algorithm 3 achieves optimal sampling complexity for this task.

Finally, we note that although the estimator $\hat{\pi}_{\ell}(v,w)$ is unbiased, it may be unbounded on certain graphs. To see this, consider a graph that has $n+2$ nodes $w,v,x_{1},\ldots,x_{n}$ . For each $i=1,\ldots,n$ , there is an edge from $w$ to $x_{i}$ and an edge from $x_{i}$ to $v$ . Suppose we run Algorithm 2 on node $w$ with target level $\ell=2$ . The algorithm first sets $\hat{\pi}_{0}(w,w)=1-\sqrt{c}$ . For each $i=1,\ldots,n$ , the algorithm sets $\hat{\pi}_{1}(x_{i},w)=1-\sqrt{c}$ with probability $\sqrt{c}$ . This means there are approximately $\sqrt{c}$ fraction of $x_{i}$ ’s with $\hat{\pi}_{1}(x_{i},w)=1-\sqrt{c}$ . Finally, for each $i=1,\ldots,n$ and $\hat{\pi}_{1}(x_{i},w)=1-\sqrt{c}$ , the algorithm increments $\hat{\pi}_{2}(v,w)$ by $1-\sqrt{c}$ with probability ${1\over n}$ . This implies that in the worst-case, all $\hat{\pi}_{1}(x_{i},w)=1-\sqrt{c}$ for $i=1,\ldots,n$ , and $\hat{\pi}_{2}(v,w)$ can be as large as $(1-\sqrt{c})n$ .

Fortunately, we can bound the variance of Algorithm 3, which enables us to use the Median Trick to boost accuracy. The following lemma states that the variance of $\hat{\pi}_{\ell}(v,w)$ is bounded by $\pi_{\ell}(v,w)$ , the actual value of the $\ell$ -hop RPPR.

Lemma 3.5.

For any level $\ell\geq 0$ and node $v\in V$ , we have $\mathrm{Var}\left[\hat{\pi}_{\ell}(v,w)\right]\leq\mathrm{E}\left[\hat{\pi}_{\ell}(v,w)^{2}\right]\leq\pi_{\ell}(v,w).$

3.5. Putting Things Together

Based on the definition of hub nodes, we divide the SimRank value $s(u,v)$ of nodes $u$ and $v$ into two terms $s(u,v)=s_{I}(u,v)+s_{B}(u,v),$ where

[TABLE]

and

[TABLE]

PRSim algorithm uses pre-computed index to generate an estimator $\hat{s}_{I}(u,v)$ for $s_{I}(u,v)$ , and uses backward walks to generate an estimator $\hat{s}_{B}(u,v)$ for $s_{B}(u,v)$ .

Algorithm 4 shows the pseudo-code of the query algorithm for PRSim. Given a source node $u$ on a directed graph $G=(V,E)$ , a decay factor $c$ and an error parameter $\varepsilon$ , the algorithm returns an estimator $\hat{s}(u,v)$ for each $v\in V$ . We set the constant $c_{1}={12\over(1-\sqrt{c})^{2}}$ , the number of samples in a round to $d_{r}={c_{1}\over\varepsilon^{2}}$ , the number of rounds to $f_{r}=3\log{n\over\delta}$ , and the total sample number to $n_{r}=d_{r}f_{r}=\Theta\left({\log{n\over\delta}\over\varepsilon^{2}}\right)$ (Line 1). Note that for the constant $c_{1}$ , we choose $(1-\sqrt{c})^{2}$ to compensate the denominator $(1-\sqrt{c})^{2}$ in equation (6), and $12$ so that we can sum various errors up to at most $\varepsilon$ . We choose the value of $d_{r}$ according to Chernoff bound A.1, and the value of $f_{r}$ according to the Median Trick A.3. Then we initialize estimators $\hat{s}(u,v)$ $\hat{s}_{I}(u,v)$ , $\hat{s}_{B}(u,v)$ and $s_{B}^{i}(u,v)$ to be [math] for $v\in V$ and $i=1,\ldots,f_{r}$ (Line 2). We also set $\widehat{\eta\pi}_{\ell}(u,w)$ , the estimator for $\eta(w)\cdot\pi_{\ell}(u,w)$ , to be [math] for $w\in V$ and $\ell=0,\ldots,\infty$ (Line 3). Note that in order to achieve sublinear query time, we can use hash maps to store only the non-zero entries in $\hat{s}$ , $\hat{s}_{B}$ $\hat{s}_{I}$ , $\hat{s}_{B}^{i}$ and $\widehat{\eta\pi}$ .

For each $i$ from $1$ to $f_{r}$ and $j$ from $1$ to $d_{r}$ , we sample an $\sqrt{c}$ -walk $\mathcal{W}(u)$ from $u$ (Lines 4-6). If $\mathcal{W}(u)$ terminates at node $w$ in $\ell$ steps, we further sample a pair of $\sqrt{c}$ -walks $\mathcal{W}_{1}(w)$ and $\mathcal{W}_{2}(w)$ from $w$ (Line 8). Recall that the probability that the two $\sqrt{c}$ -walks do not meet is exactly $\eta(w)$ . If this event happens, we increase the estimator $\widehat{\eta\pi}_{\ell}(u,w)$ by ${1\over n_{r}}$ (Lines 9-10). If $w$ is not stored in the index, we estimate $\pi_{\ell}(v,w)$ for each $v\in V$ with Algorithm 3, and update the $i$ -th estimator $\hat{s}_{B}^{i}(u,v)$ by ${\hat{\pi}_{\ell}(v,w)\over(1-\sqrt{c})^{2}d_{r}}$ for each $v\in V$ (Lines 11-13). After $n_{r}=d_{r}\cdot f_{r}$ samples are processed, we return $\hat{s}_{B}(u,v)=\textrm{Median}_{1\leq i\leq f_{r}}\hat{s}_{B}^{i}(u,v)$ as an estimator for $s_{B}(u,v)$ (Lines 14-15). Again, to ensure sublinear query time, we only compute median for a node $v$ if there is at least one non-zero $\hat{s}_{B}^{i}(u,v)$ for some $1\leq i\leq f_{r}$ . Finally, for each $(w,\ell)$ -tuple with $\widehat{\eta\pi}_{\ell}(u,w)>{\varepsilon\over c_{1}}$ and $w$ in the index, we retrieve $\hat{\pi}_{\ell}(v,w)$ for each $v\in V$ from the index, and update $\hat{s}_{I}(u,v)$ by ${\widehat{\eta\pi}_{\ell}(v,w)\over(1-\sqrt{c})^{2}}$ (Lines 16-18). We return all non-zero $\hat{s}(u,v)=\hat{s}_{I}(u,v)+\hat{s}_{B}(u,v)$ as the estimator for $s(u,v)$ , for $v\in V$ (Line 19).

Error Analysis. We now analyze the overall error bounds of the PRSim algorithm. Recall that given a source node $u$ and a target node $v$ , $s(u,v)=s_{I}(u,v)+s_{B}(u,v)$ where $s_{I}(u,v)$ and $s_{B}(u,v)$ are defined by equations (7) and (8), respectively. Algorithm 4 uses index to generate an estimator $\hat{s}_{I}(u,v)$ for each $s_{I}(u,v),v\in V$ , and uses backward walks to generate an estimator $\hat{s}_{B}(u,v)$ for each $s_{B}(u,v),v\in V$ . We have the following two lemmas that bound the errors of the two approximations.

Lemma 3.6.

Given a source node $u$ , for any $v\in V$ , Algorithm 4 provides an estimator $\hat{s}_{I}(u,v)$ for $s_{I}(u,v)$ such that:

[TABLE]

Lemma 3.7.

Given a source node $u$ , for any $v\in V$ , Algorithm 4 provides an estimator $\hat{s}_{B}(u,v)$ for $s_{B}(u,v)$ such that:

[TABLE]

Combining Lemmas 3.6 and 3.7 follows that

[TABLE]

Applying union bound on $n$ nodes follows Theorem 3.8.

Theorem 3.8.

PRSim answers single-source SimRank queries with additive error $\varepsilon$ with probability at least $1-\delta$ .

Query Time Analysis for Worst-Case Graphs. We first analyze the query time of the PRSim algorithm on worst-case graphs. Given a node $u\in V$ , let $C(u)$ denote the query cost of PRSim on $u$ , and $C={1\over n}\sum_{u\in V}C(u)$ denote the average query cost. We divide $C(u)$ into three terms: $C(u)=C_{F}(u)+C_{I}(u)+C_{B}(u),$ where $C_{F}(u)$ denote the cost for computing $\widehat{\eta\pi}_{\ell}(u,w)$ from source node $u$ , $C_{I}(u)$ denote the query cost for retrieving reserves $\psi_{\ell}(v,w)$ from the index, and $C_{B}(u)$ denote the query cost for estimating $\hat{\pi}_{\ell}(v,w)$ with backward walks. Let $C_{F}={1\over n}\sum_{u\in V}C_{F}(u)$ , $C_{I}={1\over n}\sum_{u\in V}C_{I}(u)$ and $C_{B}={1\over n}\sum_{u\in V}C_{B}(u)$ denote the average query cost of $C_{F}(u)$ , $C_{I}(u)$ and $C_{B}(u)$ , respectively. We can express the expected average query cost of Algorithm 4 as $\mathrm{E}[C]=\mathrm{E}[C_{F}]+\mathrm{E}[C_{I}]+\mathrm{E}[C_{B}].$

For $\mathrm{E}[C_{F}]$ , recall that we generate a number $n_{r}=\Theta\left({\log{n\over\delta}\over\varepsilon^{2}}\right)$ of $\sqrt{c}$ -walks to estimate $\widehat{\eta\pi}_{\ell}(u,w)$ . Since each $\sqrt{c}$ -walk takes constant time, we have $C_{F}(u)=O\left({\log{n\over\delta}\over\varepsilon^{2}}\right)$ , and $\mathrm{E}[C_{F}]=O\left({\log{n\over\delta}\over\varepsilon^{2}}\right)$ . We have the following lemmas for $\mathrm{E}[C_{I}]$ and $\mathrm{E}[C_{B}]$ .

Lemma 3.9.

Let $c_{1}={12\over(1-\sqrt{c})^{2}}$ and $C_{I}$ denote the average cost for querying the index. We have

[TABLE]

Lemma 3.10.

Let $C_{B}$ denote the average cost for performing Variance Bounded Backward Walks. We have $\mathrm{E}[C_{B}]=O\left({n\log{n\over\delta}\over\varepsilon^{2}}\sum_{j=j_{0}+1}^{n}\pi(w_{j})^{2}\right).$

By Lemma 3.9, we have $\mathrm{E}[C_{I}]\leq O\left({n\log{n\over\delta}\over\varepsilon^{2}}\sum_{j=j_{0}+1}^{n}\pi(w_{j})^{2}\right)$ . Combining with Lemma 3.10 follows Theorem 3.11.

Theorem 3.11.

Suppose the query node $u$ is uniformly chosen from $V$ . The expected query cost of PRSim on worst-case graphs is bounded by

[TABLE]

Query Time Analysis for Power-Law Graphs. Recall that on a power-law graph, the fractions $P_{o}(k)$ and $P_{i}(k)$ of nodes with out- and in-degree at least $k$ satisfy that $P_{o}(k)\sim k^{-\gamma}$ and $P_{i}(k)\sim k^{-\gamma^{\prime}}$ (BollobasBCR03, ), where $\gamma$ and $\gamma^{\prime}$ are the cumulative power-law exponents that usually take values from $1$ to $3$ . It is shown in (BahmaniCG10, ; lofgren2015personalized, ; wei2018topppr, ) that the PageRank of a power-law graph also follows power-law with same exponent $\gamma^{\prime}$ as the in-degree distribution. Thus, the reverse PageRank follows the same power-law distribution as the out-degree distribution. In particular, let $P_{\pi}(x)$ denote the portion of nodes with reverse PageRank value at least $x$ , then $P_{\pi}(x)\sim x^{-\gamma}.$

Now consider the following alternating statement of the above power-law distribution: let $w_{1},\ldots,w_{n}$ denote the nodes in the graph sorted in descending order of their reverse PageRank values, that is, $\pi(w_{1})\geq\pi(w_{2})\geq\ldots\geq\pi(w_{n})$ . We have that the $j$ -th largest reverse PageRank value $\pi(w_{j})$ is proportional to $j^{-\beta}$ . Here $\beta$ is the power-law exponent that takes value from $(0,1)$ . This assumption has been widely adopted in the literature of PageRank computations (BahmaniCG10, ; lofgren2015personalized, ; wei2018topppr, ). To understand the relation between two exponents $\gamma$ and $\beta$ , note that there are $j$ nodes with reverse PageRank value at least $x={\kappa j^{-\beta}\over n^{1-\beta}}$ , and thus we have $j\sim\left({j^{-\beta}\over n^{1-\beta}}\right)^{-\gamma}\sim j^{\beta\cdot\gamma}.$ It follows that $\beta={1\over\gamma}$ . Therefore, for power-law graphs, we have

[TABLE]

where $\kappa$ is a normalization constant such that $\kappa\sum_{j=1}^{n}{j^{-{1\over\gamma}}\over n^{1-{1\over\gamma}}}=1$ .

Combing equation (12) and Lemma 3.2, the index size is bounded by $O\left({n\over\varepsilon}\sum_{j=1}^{j_{0}}{j^{-{1\over\gamma}}\over n^{1-{1\over\gamma}}}\right)=O\left({n\over\varepsilon}\cdot{j^{1-{1\over\gamma}}\over n^{1-{1\over\gamma}}}\right)=O\left({n^{1\over\gamma}j_{0}^{1-{1\over\gamma}}\over\varepsilon}\right).$ Here we use the property of Riemann zeta function (see Lemma A.4). By setting $j_{0}=n(\varepsilon\bar{d})^{\gamma\over\gamma-1}$ , we have index size is bounded by $O\left({n^{1\over\gamma}n^{1-{1\over\gamma}}\varepsilon\bar{d}\over\varepsilon}\right)=O(m).$ Plugging $\pi(w_{j})=\kappa\cdot{j^{-{1\over\gamma}}\over n^{1-{1\over\gamma}}}$ and $j_{0}=n(\varepsilon\bar{d})^{\gamma\over\gamma-1}$ into Lemma 3.10 and Lemma 3.9, and we have the following theorem.

Theorem 3.12.

Assume that the out-degree distribution of the graph follows power-law distribution with exponent $\gamma\geq 1$ , and let $\varepsilon\geq\log^{\gamma-1\over 2-\gamma}n/(n^{\gamma-1\over\gamma}\bar{d}^{2-\gamma})$ , $\delta>1/n^{\Omega(1)}$ . Suppose the query node $u$ is uniformly chosen from $V$ . By setting $j_{0}=n(\varepsilon\bar{d})^{\gamma\over\gamma-1}$ , the expected cost of Algorithm 4 is bounded by

[TABLE]

The size of the index generated by Algorithm 1 is bounded by $O(m)$ . The preprocessing time is bounded by $O\left(m\over\varepsilon\right)$ .

Dynamic Graphs. Our algorithm is able to support dynamic graphs where edges may be inserted or deleted. Recall that PRSim generates the index by performing the backward search algorithm. It is shown in (ZhangLG16, ) that the results of the backward search to a randomly selected target node $w$ can be maintained with cost $O(k+{\bar{d}\over\varepsilon})$ , where $k$ is the total number of insertions/deletions. Since our index stores the results of the backward search for $j_{0}$ target nodes, it can process $k$ insertions/deletions in $O(kj_{0}+{m\over\varepsilon})$ time. Therefore, the per-update-cost for processing $k$ updates is bounded by $O(j_{0}+{m\over\varepsilon k})$ . However, a thorough investigation of this issue is beyond the scope of our paper.

4. Related Work

In what follows, we briefly review some of the state-of-the-art solutions for SimRank computation. We exclude SLING (TX16, ), which we have discussed in Section 2.

Monte Carlo and READS. Based on the $\sqrt{c}$ -walk interpretation, we can use the following Monte Carlo algorithm (FRCS05, ; TX16, ) to estimate the SimRank value $s(u,v)$ : we generate $n_{r}$ pairs of $\sqrt{c}$ -walks from $u$ and $v$ , and use the percentage of $\sqrt{c}$ -walks that meet as an estimation of $s(u,v)$ . Using concentration inequality, one can show that by setting $n_{r}=\Theta\left({\log{n\over\delta}\over\varepsilon^{2}}\right)$ , the Monte Carlo algorithm estimates $s(u,v)$ with an additive error $\varepsilon$ with probability at least $1-\delta$ . For a single-source query on node $u$ , we can generate $n_{r}$ walks from each node $v\in V$ and estimate $s(u,v)$ with additive error $\varepsilon$ . The query cost is $O\left({n\log{n\over\delta}\over\varepsilon^{2}}\right)$ , which is inefficient on large graphs.

A recent work proposes the READS algorithm (jiang2017reads, ) based on the Monte Carlo approach. READS pre-computes the $\sqrt{c}$ -walks from each node, and compresses the $\sqrt{c}$ -walks by merging them into trees. Given a query node $u$ , READS retrieves the $\sqrt{c}$ -walks starting from $u$ , finds all $\sqrt{c}$ -walks that meet with $u$ ’s $\sqrt{c}$ -walks, and then updates the SimRank estimator for each $v$ related to these $\sqrt{c}$ -walks. Several optimization techniques were adopted to improve the query efficiency of READS. The major issue of READS is that it requires generating and storing a large number of $\sqrt{c}$ -walks from each node in the preprocessing phase. The query cost also remains $O(n\log{n\over\delta}/\varepsilon^{2})$ , which is the same as that of the classic Monte Carlo algorithm.

ProbeSim. ProbeSim (liu2017probesim, ) is an index-free algorithm that computes single-source and top-k SimRank queries on large graphs. Given a query node $u$ , the ProbeSim algorithm samples a $\sqrt{c}$ -walk $\mathcal{W}(u)$ from $u$ . For a node $w$ visited by $\mathcal{W}(u)$ at the $\ell$ -th step, the algorithm performs a Probe procedure that computes the probability of an $\sqrt{c}$ -walk from each node $v$ visiting $w$ at the $\ell$ -th step. To rule out the probability that a pair of $\ell$ -walks may meet multiple times, the Probe algorithm avoids the nodes previously visited by $\mathcal{W}(u)$ . It is shown in (liu2017probesim, ) that the ProbeSim algorithm gives an unbiased estimator for the SimRank values $s(u,v),v\in V$ . Therefore, by repeating the sampling procedure $O(\log{n\over\delta}/\varepsilon^{2})$ times, ProbeSim answers single-source SimRank queries with probability at least $1-\delta$ .

There are two subtle problems with ProbeSim. First, to avoid multiple meeting nodes, the Probe from node $w$ has to avoid the nodes on $\mathcal{W}(u)$ , which means it is impossible to pre-compute the Probe results to speed up the query time. Second, as we will show later, the probability that a node $w$ in the graph is visited by the $\sqrt{c}$ -walk from $u$ is proportional to $\pi(w)$ , the reverse PageRank of $w$ . On the other hand, the complexity of the Probe algorithm on $w$ is also proportional to $\pi(w)$ . This essentially means it is likely that a hub node with high reverse PageRank value is visited by the $\sqrt{c}$ -walk from $u$ , and it will incur significant cost in the Probe phase. Finally, the algorithm also requires $O(n\log{n\over\delta}/\varepsilon^{2})$ query cost to answer a single-source query.

TSF. TSF (SLX15, ) is a two-stage random-walk sampling algorithm for single-source and top- $k$ SimRank queries on dynamic graphs. Given a parameter $R_{g}$ , TSF starts by building $R_{g}$ one-way graphs as an index structure. Each one-way graph is constructed by uniformly sampling one in-neighbor from each vertex’s in-coming edges. The one-way graphs are then used to simulate random walks during query processing. To achieve high efficiency, TSF allows two $\sqrt{c}$ -walks to meet multiple times, and thus overestimate the actual SimRank values. Furthermore, TSF assumes that every random walk would not contain any cycle, which does not hold in practice.

Other Related Work. Power method (JW02, ) is the classic algorithm that computes all-pair SimRank similarities for a given graph. Let $S$ be the SimRank matrix such that $S_{ij}=s(i,j)$ , and $A$ be the transition matrix of $G$ . Power method recursively computes the SimRank Matrix $S$ using the following formula (KMK14, )

[TABLE]

where $\vee$ is the element-wise maximum operator. Several follow-up works (LVGT10, ; YZL12, ; YuJulie15gauging, ) improve the efficiency or effectiveness of the power method in terms of either efficiency or accuracy. However, these methods still incur $O(n^{2})$ space overheads, as there are $O(n^{2})$ pairs of nodes in the graph. A recent work (wang2018efficientsimrank, ) reduces the cost to $O(NNZ)$ , where $NNZ$ is the number of node pairs with large SimRank similarities. However, as shown in (wang2018efficientsimrank, ), there are still a constant fraction of $O(n^{2})$ node pairs with large SimRank similarities, so the worst case complexity remains $O(n^{2})$ .

Motivated by difficulty in dealing with the element-wise maximum operator $\vee$ in Equation 14, some existing work (FNSO13, ; He10, ; Yu13, ; Li10, ; Yu14, ; YuM15b, ; KMK14, ) consider the following alternative formula for SimRank:

[TABLE]

However, it is shown that the similarities calculated by this formula are different from SimRank (KMK14, ).

For single-source queries, Fogaras and Rácz (FRCS05, ) propose a Monte Carlo algorithm that uses random walks to approximate SimRank values. Maehara et al. (MKK14, ) propose an index structure for top- $k$ SimRank queries, but it relies on heuristic assumptions about $G$ , and hence, does not provide any worst-case error guarantee. Li et al. (LiFL15, ) propose a distributed version of the Monte Carlo approach in (FRCS05, ), but it achieves scalability at the cost of significant computation resources. Finally, there is existing work on variants of SimRank (AMC08, ; FR05, ; YuM15a, ; ZhaoHS09, ) and on various graph applications (bhuiyan2018representing, ; ye2018using, ; lee2018evaluations, ), but the proposed solutions are inapplicable for top- $k$ and single-source SimRank queries.

5. Experiments

This section experimentally evaluates the proposed solutions against the state of the art. All experiments are conducted on a machine with a Xeon(R) CPU [email protected] CPU and 196GB memory.

5.1. Experimental Settings

Methods. We compare PRSim against five SimRank algorithms: READS (jiang2017reads, ), SLING (TX16, ), TSF (SLX15, ), ProbeSim (liu2017probesim, ) and TopSim (LeeLY12, ). As mentioned in Section 4, READS, SLING and TSF are the state-of-the-art index-based methods, and ProbeSim and TopSim are the state-of-the-art index-free methods.

Ground Truth for single-pair queries. Given a pair of nodes $u$ and $v$ , we use the Monte Carlo algorithm to estimate $s(u,v)$ with high precisions, and then use the result as the ground truth for $s(u,v)$ . In particular, we set the parameters of the Monte Carlo algorithm such that it incurs an error less than $0.00001$ with confidence over $99.999\%$ .

Pooling. We extend the pooling idea (liu2017probesim, ) to evaluate the effectiveness of the single-source algorithms on large graphs. Given a source node $u$ , we run each single-source algorithm, order the nodes according to their estimated SimRank values, and retrieve the top- $k$ nodes. We merge the top- $k$ nodes returned by each algorithm, remove the duplicates, and put them into a pool. As such, if we were to evaluate $\ell$ algorithms, then the pool size is between $k$ and $\ell k$ . For each node $v$ in the pool, we obtain the ground truth of $s(u,v)$ using the Monte Carlo algorithm, and retrieve $V_{k}=\{v_{1},\ldots,v_{k}\}$ , namely, the $k$ nodes with the highest SimRank values from the pool.

**Metrics. ** To evaluate the absolute error of single-source SimRank algorithms, we calculate the average absolute errors for approximating $s(u,v_{i})$ for each $v_{i}$ in the pool. More precisely, for each $v_{i}\in V_{k}$ returned by the pool, let $\hat{s}(u,v_{i})$ be the estimator for $s(u,v_{i})$ returned by the algorithm to be evaluated. We set

[TABLE]

To evaluate the algorithms’ abilities to return the top- $k$ results, we use $V_{k}=\{v_{1},\ldots,v_{k}\}$ as the ground truth for the top- $k$ nodes. Note that these nodes are the best possible results that can be returned by any of the algorithms to be evaluated. Let $V_{k}^{\prime}=\{v_{1}^{\prime},\ldots,v_{k}^{\prime}\}$ denote the top- $k$ node set returned by the algorithm to be evaluated. Note that Precision@k evaluates how many correct (or best possible) nodes are included in $V_{k}^{\prime}$ .

5.2. Experiments on Real-World Graphs

We evaluate the tradeoffs between accuracy and complexity for each algorithm on real world graphs. We use $5$ data sets, as shown in Table 3. All data sets are obtained from public sources (SNAP, ; LWA, ).

Parameters. SLING (TX16, ) has a parameter $\varepsilon_{a}$ , the upper bound on the absolute error. We vary $\varepsilon_{a}$ in $\{0.5,0.1,0.05,0.01,0.005\}$ , where $\varepsilon_{a}=0.05$ is the default value in (TX16, ). TSF has two parameters $R_{g}$ and $R_{q}$ , where $R_{g}$ is the number of one-way graphs stored in the index, and $R_{q}$ is the number of times each one-way graph is reused in the query stage. We vary $(R_{g},R_{q})$ in $\{(10,2),(100,20),(200,30),(300,40),(600,80)\}$ , where (SLX15, ) sets $(R_{g},R_{q})=(300,40)$ by default. TopSim has four internal parameters $T$ , $h$ , $\eta$ and $H$ , where $T$ is the depth of the random walks, $1/h$ is the minimal degree threshold used to identify a high degree node, $\eta$ is the similarity threshold for trimming a random walk, and $H$ is the number of random walks to be expanded at each level. We fix $H$ and $\eta$ to their default values $100$ and $0.001$ , and vary $(T,1/h)$ in $\{(1,10),(3,100),(3,1000),(3,10000),(4,10000)\}$ . Note that (LeeLY12, ) sets $(T,1/h)=(3,100)$ by default. The READS paper (jiang2017reads, ) proposed three algorithms: READS, READS-D, and READS-Rq. We only include the static version of READS in our experiments, as it is the fastest among the three (jiang2017reads, ). READS has two parameters $r$ and $t$ , where $r$ is the number of $\sqrt{c}$ -walks generated for each node in the preprocessing stage and $t$ is the maximum depth of the $\sqrt{c}$ -walks. We vary $(r,t)$ in $\{(10,2),(50,5),(100,10),(500,10),(1000,20)\}$ , where $(r,t)\\ =(100,10)$ is the default setting in (jiang2017reads, ). For ProbeSim (liu2017probesim, ), we vary the error parameter $\varepsilon_{a}$ in $\{0.5,0.1,0.05,0.01,0.005\}$ , where $\varepsilon_{a}=0.1$ is the default setting in (liu2017probesim, ). For PRSim, we vary $\varepsilon$ in $\{0.5,0.1,0.05,0.01,0.005\}$ . We also set $j_{0}$ to $\sqrt{n}$ so that the index size of PRSim increases with $1/\varepsilon$ . We fix the failure probability $\delta=0.0001$ unless otherwise specified. We set the decay factor $c$ of SimRank to 0.6, following previous work (MKK14, ; Yu13, ; LVGT10, ; YuM15a, ; YuM15b, ).

Experimental results. On each data set, we issue 100 single-source queries and 100 top- $50$ queries for each algorithm and each parameter set, and record the averages of the query time, index sizes, preprocessing time, AvgError@50 and Precision@50. For each algorithm and each dataset, we omit a parameter set if it runs out of 196GB memory or takes over 10 hours to finish queries or preprocessing on that data set.

Figures 2, 3 show the tradeoffs between AvgError@50 and the query time and the tradeoffs between Precision@50 and the query time. The overall observation is that PRSim outperforms all competitors by achieving lower errors and higher precisions with less query time on all datasets. Most notably, on the TW dataset, PRSim achieves a Precision@50 of $92\%$ using a query time of $5$ seconds, while the closest competitor, ProbeSim, achieves a precision around $75\%$ using over 50 seconds. Furthermore, on the 5-billion-edge UK data set, PRSim is the only two index-based algorithms that are able to finish preprocessing and queries, which demonstrates the scalability of our algorithms. We also note that the query time of SLING and READS are not sensitive to the choices of parameters. This is as expected, since the majority of their query cost is spent on reading the index, which is a cache-friendly task. After observing the skewed trend of READS on DB in Figure 2, we decide to evaluate an extra parameter set $(r,t)=(5000,20)$ to see if READS can outperform PRSim in terms of query-time-error tradeoff, given significantly more indexing space. The result shows that PRSim still achieves better accuracy with less query time.

Figure 4, and 5 show the tradeoffs betweenAvgError@50 and the index size and the tradeoffs between AvgError@50 and the preprocessing time, respectively. Again, our algorithm manages to outperform all index-based algorithms (SLING, TSF, READS) by achieving a lower error with less index size and preprocessing time. In particular, on the DB dataset, our algorithm is able to achieve an average error of $10^{-3}$ using an index of size $200MB$ , while the closest competitor READS needs $100GB$ .

5.3. Experiments on Synthetic Data Sets

We now evaluate PRSim and the competitors with fixed parameters on synthetic datasets with varying network structure and sizes. We set $\varepsilon_{a}=0.25$ for SLING, $R_{g}=300$ and $R_{q}=40$ for TSF, $T=3$ , $1/h=100$ , $\eta=0.001$ , and $H=100$ for TopSim, $\varepsilon_{a}=0.25$ for ProbeSim, $r=100$ and $t=10$ for READS, and $\varepsilon=0.25$ for PRSim. We fix the failure probability $\delta=0.001$ unless otherwise specified. On each data set, we issue 100 single-source queries with each algorithm to be evaluated, and report the corresponding measures.

Hardness of SimRank computation and degree distributions. We first investigate the relation between the hardness of SimRank computation and degree distributions. We generate a set of undirected power-law graphs with various power-law exponents using the hyperbolic graph generator (aldecoa2015hyperbolic, ). In particular, we fix the number of nodes $n$ to be $100,000$ and the average degree $\bar{d}$ to be $10$ , and vary the degree power-law exponent $\gamma$ from $1$ to $9$ . Figure 6(a) reports the average query time of each algorithm. Recall that the theoretical analysis of PRSim suggests that its query time increases with $1/\gamma$ . Figure 6(a) concurs with this analysis. In fact, we observe that the query time of all algorithms follows a similar distribution as the function $y=1/\gamma$ on the log-log plot: the query time decreases as we increase $\gamma$ from $1$ to $4$ , and becomes stable after $\gamma>4$ . Based on this observation and on the theoretical analysis for PRSim, we make the following conjuncture:

Conjuncture 1.

The hardness of SimRank computation is correlated to the reciprocal of the power-law exponent $\gamma$ of the out-degree distribution.

Scalability analysis. To evaluate the scalability of our algorithm, we generate synthetic power-law graphs by fixing the exponent $\gamma=3$ and average degree $\bar{d}=10$ , and vary the graph size $n$ from $10^{4}$ to $10^{7}$ . Figure 6(b) shows the running time of PRSim on these graphs. The results show that the running time of PRSim forms a concave curve in a log-log plot, which proves the sub-linearity of PRSim.

Experiments on non-power-law Graphs. We generate random graphs using the Erős and Rényi (ER) model, where we assign an edge to each node pair with a user-specified probability $p$ . We fix the number of nodes to $n=10,000$ and set the value of $p$ so that the average degree $\bar{d}$ of each graph varies from $5$ to $10,000$ . Figure 7 shows the query time of each algorithm on these synthetic graphs. We observe that the query performance of ProbeSim degrades dramatically as we increase $\bar{d}$ . On the other hand, PRSim is able to answer queries on very dense graphs efficiently. We attribute this quality to the fact that the Randomized Probe algorithm in ProbeSim always goes through all out-neighbors of a target node, while our Variance Bounded Backward Walk algorithm only needs to visit a fraction of the out-neighbors.

6. Conclusions

This paper presents PRSim, an algorithm for single-source SimRank queries. PRSim connects the time complexity of SimRank computation with the distribution of the reverse PageRank, and achieves sublinear query time on power-law graphs with small index size. Our experiments show that the algorithm significantly outperforms the existing methods in terms of query time, accuracy, index size and scalability.

7. ACKNOWLEDGEMENTS

This research was supported in part by National Natural Science Foundation of China (No. 61832017 and No. 61732014), by MOE, Singapore under grant MOE2015-T2-2-069, and by NUS, Singapore under an SUG. Sibo Wang was supported by CUHK Direct Grant No. 4055114. He was also supported by the CUHK University Startup Grant No. 4930911 and No. 5501570.

Appendix A Inequalities

A.1. Chernoff Bound

Lemma A.1 (Chernoff Bound (ChungL06, )).

For a set $\{x_{i}\}$ ( $i\in[1,n_{r}]$ ) of i.i.d. random variables with mean $\mu$ and $x_{i}\in[0,1]$ , $\Pr\left[\left|{1\over n_{r}}\sum_{i=1}^{n_{x}}x_{i}-\mu\right|\geq\varepsilon\right]\leq\exp\left(-\dfrac{n_{r}\cdot\varepsilon^{2}}{\frac{2}{3}\varepsilon+2\mu}\right).\vspace{-1mm}$

A.2. Chebyshev’s Inequality

Lemma A.2 (Chebyshev’s inequality).

Let $X$ be a random variable, then $\Pr\left[\left|X-E[X]\right|\geq\varepsilon\right]\leq{\mathrm{Var}[X]\over\varepsilon^{2}}.$

A.3. Median Trick

Lemma A.3 ((charikar2002finding, )).

Let $X_{1},\ldots,X_{k}$ be $k\geq 3\log{1\over\delta}$ i.i.d. random variables, such that $\Pr\left[\left|X_{i}-E[X_{i}]\right|\geq\varepsilon\right]\leq{1\over 3}$ . Let $X=\textrm{Median}_{1\leq i\leq k}X_{i}$ , then $\Pr\left[\left|X-E[X]\right|\geq\varepsilon\right]\leq\delta$ .

A.4. Partial sum of Riemann zeta function

Lemma A.4.

The partial sum of Riemann zeta function satisfies the following property:

[TABLE]

Appendix B Proofs

B.1. Proof of Lemma 3.2

Proof.

Let $w_{1},\ldots,w_{n}$ be the nodes of the graph sorted in descending order of the reverse PageRank value $\pi(w_{j})$ . Let $size(w_{j})$ denote index size for node $w_{j}$ . Then, $size=\sum_{j=1}^{j_{0}}size(w_{j})$ is the total size of the index. For each $w_{j}$ , recall that Algorithm 1 uses backward search to find node $x$ and level $\ell$ with $\ell$ -hop RPPR $\pi_{\ell}(x,w)\geq\varepsilon$ , and record the tuple $(x,\ell,\pi_{\ell}(x,w))$ . Hence, the space usage $size(w_{j})$ is bounded by the total number of pairs $(x,\ell)$ with $\ell$ -hop RPPR $\pi_{\ell}(x,w)\geq\varepsilon$ , i.e., $size(w_{j})\leq\sum_{\ell=0}^{\infty}\sum_{x\in V}I(\pi_{\ell}(x,w)\geq\varepsilon),$ where $I(\pi_{\ell}(x,w)\geq\varepsilon)$ is an indicating function such that $I(\pi_{\ell}(x,w)\geq\varepsilon)=1$ if $\pi_{\ell}(x,w)\geq\varepsilon$ and $I(\pi_{\ell}(x,w)\geq\varepsilon)=0$ otherwise. We observe that $I(\pi_{\ell}(x,w)\geq\varepsilon)\leq{\pi_{\ell}(x,w)\over\varepsilon}$ , and thus $size(w_{j})\leq\sum_{\ell=0}^{\infty}\sum_{x\in V}{\pi_{\ell}(x,w)\over\varepsilon}={n\pi(w_{j})\over\varepsilon}.$ ∎

B.2. Proof of Lemma 3.3

Notations. We begin by defining two types of random variables. Consider a node $y$ at level $i+1$ and a node $x\in\mathcal{I}(y)$ . For ease of presentation, we let $A$ denote the set of $x\in\mathcal{I}(y)$ such that $\hat{\pi}(x,w)>d_{in}(y)(1-\sqrt{c})$ and $B$ denote the set of $x\in\mathcal{I}(y)$ such that $\hat{\pi}(x,w)\leq d_{in}(y)(1-\sqrt{c})$ . We use $R_{i}(x)$ to denote the random variable indicating that the random number $r_{0}<\sqrt{c}$ . For each $x\in B$ , we define random variable $Z_{i}(x,y)=1$ if random number $r\leq{\hat{\pi}_{i}(x,w)\over d_{in}(y)(1-\sqrt{c})}$ , and $Z_{i}(x,y)=0$ otherwise. Recall that for a node $x\in A$ , we increment $\hat{\pi}_{i+1}(y,w)$ by ${\hat{\pi}_{i}(x,w)\over d_{in}(y)}$ if and only if $R_{i}(x)=1$ ; for a node $x\in B$ , we increment $\hat{\pi}_{i+1}(y,w)$ by $1-\sqrt{c}$ if and only if $R_{i}(x)=1$ and $Z_{i}(x,y)=1$ . We can express $\hat{\pi}_{i+1}(y,w)$ as

[TABLE]

Proof of Lemma 3.3.

We prove the lemma by induction. For the base case, we have $\hat{\pi}_{0}(w,w)=1-\sqrt{c}=\pi_{0}(w,w)$ . Assume that $\mathrm{E}[\hat{\pi}_{i}(x,w)]=\pi_{i}(x,w)$ for any $x\in V$ . For an node $y\in V$ , we will show that $\mathrm{E}[\hat{\pi}_{i+1}(y,w)]=\pi_{i+1}(y,w)$ . Conditioning on $\hat{\pi}_{i}(x,w)$ in equation (17) follows that

[TABLE]

We have $E[R_{i}(x)]=\Pr[r_{0}\leq\sqrt{c}]=\sqrt{c}$ and

[TABLE]

Since $R_{i}(x)$ and $Z_{i}(x,y)$ are independent random variables, we have $E[R_{i}(x)Z_{i}(x,y)]={\sqrt{c}\hat{\pi}_{i}(x,w)\over d_{in}(y)(1-\sqrt{c})}$ . It follows that

[TABLE]

By the induction hypothesis, we have $E[\hat{\pi}_{i}(x,w)]=\pi_{i}(x,w)$ for $x\in\mathcal{I}(y)$ , and thus $E[\hat{\pi}_{i+1}(y,w)]=\sum_{x\in\mathcal{I}(y)}{\sqrt{c}\pi_{i}(x,w)\over d_{in}(y)}=\pi_{i+1}(y,w)$ , which proves the lemma. ∎

B.3. Proof of Lemma 3.4

Proof.

Let $cost_{i+1}(y)$ denote the number of times that $\hat{\pi}_{i+1}(y,w)$ gets incremented at level $i+1$ . Note that the total cost is bounded by $\sum_{i=0}^{\ell}\sum_{x\in V}cost_{i}(x)$ . A key observation is that each increment performed by Algorithm 3 adds at least $1-\sqrt{c}$ to $\hat{\pi}_{i+1}(y,w)$ . To see this, note that Algorithm 3 increments $\hat{\pi}_{i+1}(y,w)$ by ${\hat{\pi}_{i}(x,w)\over d_{in}(y)}$ only if $d_{in}(y)<{\hat{\pi}_{i}(x,w)\over 1-\sqrt{c}}$ , or equivalently ${\hat{\pi}_{i}(x,w)\over d_{in}(y)}>1-\sqrt{c}$ . Therefore the number of times that $\hat{\pi}_{i+1}(y,w)$ gets incremented is bounded by $\pi_{i+1}(y,w)\over(1-\sqrt{c})$ , and thus the total cost is bounded by

[TABLE]

This proves the lemma. ∎

B.4. Proof of Lemma 3.5

Proof.

We will prove $\mathrm{E}[\hat{\pi}_{\ell}(x,w)^{2}]\leq\pi_{\ell}(x,w)$ by induction. For the base case, we have $E[\hat{\pi}_{0}(w,w)^{2}]=(1-\sqrt{c})^{2}\leq\pi_{0}(w,w).$ Assume that $\mathrm{E}[\hat{\pi}_{i}(x,w)^{2}]\leq\pi_{i}(x,w)$ for any $x\in V$ . For an node $y\in V$ , we will show that $\mathrm{E}[\hat{\pi}_{i+1}(y,w)^{2}]\leq\pi_{i+1}(y,w)$ . Conditioning on $\hat{\pi}_{i}(x,w)$ for all $x\in V$

[TABLE]

We expand equation (18) into 5 terms:

[TABLE]

We use $X_{1},X_{2},X_{3},X_{4}$ and $X_{5}$ to denote these 5 terms, and calculate them individually. Since $E\left[R_{i}(x)^{2}\right]=E\left[R_{i}(x)\right]=\sqrt{c}$ , we have $X_{1}=\sum_{x\in A}{\sqrt{c}\hat{\pi}_{i}(x,w)^{2}\over d_{in}(y)^{2}}.$ Using the induction hypothesis, we have $\mathrm{E}[\hat{\pi}_{i}(x,w)^{2}]\leq\pi(x,w)^{2}$ , and thus

[TABLE]

where $S_{A}=\sum_{x\in A}{\sqrt{c}\pi_{i}(x,w)\over d_{in}(y)}$ . Since $\mathrm{E}[\hat{\pi}_{i}(x,w)]=\pi_{i}(x,w)$ , and $E\left[R_{i}(x)^{2}Z_{i}(x,y)^{2}\right]={\sqrt{c}\hat{\pi}_{i}(x,w)\over d_{in}(y)(1-\sqrt{c})}$ , we have

[TABLE]

Here we define $S_{B}=\sum_{x\in B}{\sqrt{c}\pi_{i}(x,w)\over d_{in}(y)}$ . Note that $S_{A}+S_{B}=\sum_{x\in\mathcal{I}(y)}{\sqrt{c}\pi_{i}(x,w)\over d_{in}(y)}=\pi_{i+1}(y,w).$

By the independence of $R_{i}(x_{1}),Z_{i}(x_{1},y),R_{i}(x_{2}),Z_{i}(x_{2},y)$ for $x_{1}\neq x_{2}$ , we have $X_{3}=\sum_{x_{1}\neq x_{2}\in A}{c\hat{\pi}_{i}(x_{1},w)\hat{\pi}_{i}(x_{2},w)\over d_{in}(y)^{2}}$ , $X_{4}=\sum_{x_{1}\neq x_{2}\in B}{c\hat{\pi}_{i}(x_{1},w)\hat{\pi}_{i}(x_{2},w)\over d_{in}(y)^{2}}$ , $X_{5}=\sum_{x_{1}\in A,x_{2}\in B}{c\hat{\pi}_{i}(x_{1},w)\hat{\pi}_{i}(x_{2},w)\over d_{in}(y)^{2}}$ . Therefore, $X_{3}+X_{4}+X_{5}$ can be expressed as

[TABLE]

Using the inequality that $\hat{\pi}_{i}(x_{1},w)\hat{\pi}_{i}(x_{2},w)\leq{1\over 2}\hat{\pi}_{i}(x_{1},w)^{2}+{1\over 2}\hat{\pi}_{i}(x_{1},w)^{2},$ and we have

[TABLE]

The last equation is due to the fact that each $\hat{\pi}_{i}(x,w)^{2}$ appears exactly $d_{in}(y)-1$ times in the summation. By the induction hypothesis that $\mathrm{E}[\hat{\pi}_{i}(x,w)^{2}]\leq\pi_{i}(x,w)$ , we have

[TABLE]

Combining Equations (19)-(21), it follows that

[TABLE]

And the lemma follows. ∎

B.5. Proof of Lemma 3.6

Proof.

Recall that for $s_{I}(u,v)$ ,we have the estimator

[TABLE]

where $\widehat{\eta\pi}^{\prime}_{\ell}(u,w_{j})=\widehat{\eta\pi}_{\ell}(u,w_{j})$ if $\widehat{\eta\pi}_{\ell}(u,w_{j})>{(1-\sqrt{c})^{2}\varepsilon\over 12}$ and $\widehat{\eta\pi}^{\prime}_{\ell}(u,w_{j})=0$ if otherwise. $\widehat{\eta\pi}_{\ell}(u,w_{j})$ is an estimator for $\eta(w_{j})\pi_{\ell}(u,w_{j})$ computed by Monte Carlo approach, and $\psi_{\ell}(v,w_{j})$ is the reserve computed by $\ell$ -hop backward search. To bound the error of $\hat{s}_{I}(u,v)$ , we further define

[TABLE]

and

[TABLE]

First, we claim that $\hat{s}_{I}(u,v)$ and $\hat{s}^{1}_{I}(u,v)$ differ by at most ${\varepsilon\over 6}$ . More precisely, observe that $\widehat{\eta\pi}^{\prime}_{\ell}(u,w)$ and $\widehat{\eta\pi}_{\ell}(u,w)$ differ by at most ${(1-\sqrt{c})^{2}\varepsilon\over 6}$ , and thus

[TABLE]

For the last inequality, we use the fact that the reserve $\psi_{\ell}(v,w_{j})$ is at most $\pi_{\ell}(v,w_{j})$ , and thus $\sum_{\ell=0}^{\infty}\sum_{j=1}^{n}\psi_{\ell}(v,w_{j})\\ \leq\sum_{\ell=0}^{\infty}\sum_{j=1}^{n}\pi_{\ell}(v,w_{j})=1.$

Next, we show that $\hat{s}^{1}_{I}(u,v)$ and $\hat{s}^{2}_{I}(u,v)$ differ by at most ${\varepsilon\over 6}$ . To see this, note that by the property of backward search, we have $\left|\pi_{\ell}(v,w_{j})-\psi_{\ell}(v,w_{j})\right|\leq 2r_{max}={(1-\sqrt{c})^{2}\varepsilon\over 6}$ for a node $w_{j}$ in the index. It follows that

[TABLE]

For the last inequality, recall that Algorithm 4 increments $\widehat{\eta\pi}$ at most $n_{r}$ times, and each increment is ${1\over n_{r}}$ .

Finally, we show that $\hat{s}^{2}_{I}(u,v)$ approximates $s_{I}(u,v)$ with error ${\varepsilon\over 4}$ with target probability. Following the definition of $\widehat{\eta\pi}_{\ell}(u,w)$ , we use a slightly different approach to construct $\hat{s}^{2}(u,v)$ . For the $i$ -th iteration, we sample a node $w$ and a level $\ell$ with probability $\eta(w)\pi_{\ell}(u,w)$ , and set $X_{i}$ to be ${\pi_{\ell}(v,w_{j})\over(1-\sqrt{c})^{2}}$ . It can be verify that $\hat{s}^{2}_{I}(u,v)={1\over n_{r}}\sum_{i=1}^{n_{r}}X_{i}$ . For each $X_{i}$ ,

[TABLE]

and $X_{i}\leq\max_{\ell,v}\left\{{\pi_{\ell}(v,w_{j})\over(1-\sqrt{c})^{2}}\right\}\leq{1\over(1-\sqrt{c})^{2}}$ . Since $n_{r}=\Theta(\log{n\over\delta}/\varepsilon^{2})$ , by Chernoff bound,

[TABLE]

Combining Equations (22)-(24), we prove the lemma. ∎

B.6. Proof of Lemma 3.7

Proof.

Consider a single $\sqrt{c}$ -walk from $u$ . Recall that Algorithm 4 first samples a node-level pair $(w_{j},\ell)$ with probability $\pi_{\ell}(u,w_{j})\eta(w_{j})$ . If $j>j_{0}$ , it performs backward walk to generate an unbiased estimator $\hat{\pi}_{\ell}(v,w)$ for each $v\in V$ , and set the estimator $\hat{s}_{B}(u,v)$ to be ${\hat{\pi}_{\ell}(v,w_{j})\over(1-\sqrt{c})^{2}}$ . It follows that

[TABLE]

We can bound the variance $\mathrm{Var}\left[\hat{s}_{B}(u,v)\right]\leq E\left[\hat{s}_{B}(u,v)^{2}\right]$ by

[TABLE]

Lemma 3.5 implies that $E\left[\hat{\pi}_{\ell}(v,w_{j})^{2}\right]\leq\pi_{\ell}(v,w_{j})$ , and

[TABLE]

Recall that for a fixed $i$ with $1\leq i\leq f_{r}$ , Algorithm 4 repeats above sampling process $d_{r}$ time and use the mean over $d_{r}={12\over(1-\sqrt{c})^{2}\varepsilon^{2}}$ samples, denoted $\hat{s}_{B}^{i}(u,v)$ , as an estimator for $s_{B}(u,v)$ . It follows that

[TABLE]

By Chebyshev’s inequality, we have

[TABLE]

Finally, Algorithm 4 use $\hat{s}_{B}(u,v)=\textrm{Median}_{1\leq i\leq f_{r}}\hat{s}_{B}^{i}(u,v)$ as the estimator for $\hat{s}_{B}(u,v)$ . By setting $f_{r}=3\log{n\over\delta}$ and applying the Median Trick (see Lemma A.3), we have

[TABLE]

and the lemma follows. ∎

B.7. Proof of Lemma 3.9

Proof.

Fix the source node $u$ and consider a node $w_{j}$ and a level $\ell$ . Recall that we retrieve all nodes $v$ with $\psi_{\ell}(v,w_{j})$ from the index if and only if 1) $w_{j}$ is in the index, that is, $j\leq j_{0}$ , and 2) $\widehat{\eta\pi}_{\ell}(u,w_{j})\geq{(1-\sqrt{c})^{2}\varepsilon\over 8}={\varepsilon\over c_{1}}$ Let $size_{\ell}(w_{j})=\Theta\left({n\pi_{\ell}(w_{j})\over\varepsilon}\right)$ denote the upper bound for the index size of $w_{j}$ at level $\ell$ , and $size_{\ell}(w_{j})=\sum_{\ell=0}^{\infty}size_{\ell}(w_{j})=\Theta\left({n\pi(w_{j})\over\varepsilon}\right)$ denote the upper bound for the index size of $w_{j}$ . We further define $\widehat{\eta\pi}(u,w_{j})=\sum_{\ell=0}^{\infty}\widehat{\eta\pi}_{\ell}(u,w_{j})$ . Note that $\widehat{\eta\pi}(u,w_{j})$ is an unbiased estimator for $\sum_{\ell=0}^{\infty}\eta(w_{j})\pi_{\ell}(u,w_{j})=\eta(w_{j})\pi(u,w_{j})$ . We can bound the $C_{I}(u)$ as

[TABLE]

where $I\left(\widehat{\eta\pi}_{\ell}(u,w_{j})>{\varepsilon\over c_{1}}\right)$ equals $1$ if $\widehat{\eta\pi}_{\ell}(u,w_{j})>{\varepsilon\over c_{1}}$ and equals [math] if otherwise. Since $\widehat{\eta\pi}_{\ell}(u,w_{j})\leq\widehat{\eta\pi}(u,w_{j})$ , we have $I\left(\widehat{\eta\pi}_{\ell}(u,w_{j})>{\varepsilon\over c_{1}}\right)\leq I\left(\widehat{\eta\pi}(u,w_{j})>{\varepsilon\over c_{1}}\right)$ , and thus

[TABLE]

We now use two different approaches to bound $C_{I}(u)$ . First, observe that for a given $u$ , we have $\sum_{j=1}^{j_{0}}\widehat{\eta\pi}(u,w_{j})\leq 1$ , which implies that there are at most ${c_{1}\over\varepsilon}$ node $w_{j}$ with $\widehat{\eta\pi}(u,w_{j})\geq{\varepsilon\over c_{1}}$ . Since $size(w_{1})\geq\ldots\geq size(w_{j_{0}})$ , we can choose $\pi(u,w_{1})\geq\varepsilon,\ldots\pi(u,w_{c_{1}\over\varepsilon})\geq\varepsilon$ to maximize the query cost $C_{I}(u)$ . It follows that $C_{I}(u)\leq\sum_{j=1}^{c_{1}\over\varepsilon}size(w_{j})\leq O\left(\sum_{j=1}^{c_{1}\over\varepsilon}{n\pi(w_{j})\over\varepsilon}\right)$ hence proves the first part of the lemma.

For the second part, note that $I\left(\widehat{\eta\pi}(u,w_{j})>{\varepsilon\over c_{1}}\right)$ is bounded by ${\widehat{\eta\pi}_{\ell}(u,w_{j})\over{\varepsilon/c_{1}}}$ . It follows that

[TABLE]

Here we use the fact that $\widehat{\eta\pi}(u,w_{j})$ is an unbiased estimator for $\eta(w_{j})\pi(u,w_{j})$ and that $\eta(w_{j})\leq 1$ . Taking average over all nodes $u\in V$ , we have

[TABLE]

By $size(w_{j})=O\left({n\pi(w_{j})\over\varepsilon}\right)$ , we have $C_{I}=O\left({n\over\varepsilon^{2}}\sum_{j=1}^{j_{0}}\pi(w_{j})^{2}\right),$ and the lemma follows. ∎

B.8. Proof of Lemma 3.10

Proof.

Next, we bound $C_{B}={1\over n}\sum_{v\in V}C_{B}(u)$ , the average query cost for estimating the $\hat{\pi}_{\ell}(v,w)$ for each node $w$ that is not in the Index. Given a source node $u$ , for each node $w_{j}$ with $j>j_{0}$ , recall that we perform $\pi_{\ell}(u,w_{j})n_{r}$ backward walk on $w_{j}$ to estimate $\hat{\pi}_{\ell}(v,w),v\in V$ . By Lemma 3.4, the cost of a single backward walk on $w_{j}$ , regardless of the level $\ell$ , can be bounded by $O(n\pi(w_{j}))$ . Ignoring the big-Oh,

[TABLE]

Taking average over all nodes $u\in V$ , we have

[TABLE]

The last equation is due to $\sum_{u\in V}\pi(u,w)=n\pi(w)$ . ∎

B.9. Proof of Theorem 3.11

Proof.

We use $\beta=1/\gamma$ to simplify the proof. Ignoring the big-Oh notation in Lemma 3.9, we have $\mathrm{E}[C_{I}]\leq{n\over\varepsilon}\sum_{j=1}^{c_{1}\over\varepsilon}\pi(w_{j})$ and $\mathrm{E}[C_{I}]\leq{n\over\varepsilon^{2}}\sum_{j=1}^{j_{0}}\pi(w_{j})^{2}$ . Plugging $\pi(w_{j})={\kappa j^{-\beta}\over n^{1-\beta}}$ into ${n\over\varepsilon}\sum_{j=1}^{c_{1}\over\varepsilon}\pi(w_{j})$ , and we have

[TABLE]

Plugging $\pi(w_{j})={\kappa j^{-\beta}\over n^{1-\beta}}$ into ${n\over\varepsilon^{2}}\sum_{j=1}^{j_{0}}\pi(w_{j})^{2}$ follows that

[TABLE]

For $\beta<1/2$ , we have $\sum_{j=1}^{j_{0}}j^{-2\beta}=O(j_{0}^{1-2\beta})=O(n^{1-2\beta}),$ and thus $\mathrm{E}[C_{I}]=O\left({n^{2\beta-1}\over\varepsilon^{2}}\cdot n^{1-2\beta}\right)=O\left({1\over\varepsilon^{2}}\right)$ . For $\beta=1/2$ , we have $\sum_{j=1}^{j_{0}}j^{-2\beta}=O(\log j_{0})$ . Since $\log j_{0}\leq\log n$ and $n^{2\beta-1}=1$ , we have $\mathrm{E}[C_{I}]=O\left({n^{2\beta-1}\over\varepsilon^{2}}\cdot\log j_{0}\right)=O\left({\log n\over\varepsilon^{2}}\right)$ . For $\beta>1/2$ , we have $\sum_{j=1}^{j_{0}}j^{-2\beta}=O(1)$ and consequently $\mathrm{E}[C_{I}]=O\left({n^{2\beta-1}\over\varepsilon^{2}}\right)$ . Combining Equation (26) and above analysis, we have the following equation:

[TABLE]

By Lemma 3.10 and the assumption $\pi(w_{j})={\kappa j^{-\beta}\over n^{1-\beta}}$ , we have $\mathrm{E}[C_{B}]=O\left({c_{1}n^{2\beta-1}\log n\over\varepsilon^{2}}\sum_{j=j_{0}+1}^{n}j^{-2\beta}\right).$ For $j<1/2$ , we have $\sum_{j=j_{0}+1}^{n}j^{-2\beta}=O(n^{1-2\beta})$ . Thus

[TABLE]

For $j=1/2$ , we have $\sum_{j=j_{0}+1}^{n}j^{-2\beta}=O(\log n)$ , and thus $\mathrm{E}[C_{B}]=O\left({\log n\log{n\over\delta}\over\varepsilon^{2}}\right)$ . For $j>1/2$ , we have $\sum_{j=j_{0}+1}^{n}j^{-2\beta}=O(j_{o}^{1-2\beta})$ . Plugging $j_{0}\leq n\left(\varepsilon\bar{d}\right)^{1\over 1-\beta}$ follows that

[TABLE]

By $\varepsilon\geq{\log^{1-\beta\over 2\beta-1}n/n^{1-\beta}\bar{d}^{2\beta-1\over\beta}}$ and $\delta>{1\over n^{\Omega(1)}}$ , it follows that ${\log{n\over\delta}/\varepsilon^{1\over 1-\beta}\bar{d}^{2\beta-1\over 1-\beta}}\leq{n^{2\beta-1}\over\varepsilon^{2}}$ and ${\log n/\varepsilon^{1\over 1-\beta}\bar{d}^{2\beta-1\over 1-\beta}}\leq{n^{\beta}\over\varepsilon^{2-\beta}}$ , and thus $\mathrm{E}[C_{B}]$ is bounded by $O\left(\min\left\{{n^{2\beta-1}\over\varepsilon^{2}},{n^{\beta}\over\varepsilon^{2-\beta}}\right\}\right)$ for $\beta>1/2$ . In summary, we have

[TABLE]

Combing $C_{F}$ , $C_{I}$ , $C_{B}$ and $\beta=1/\gamma$ , the theorem follows. ∎

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] http://snap.stanford.edu/data/index.html .
2[2] http://law.di.unimi.it/datasets.php .
3[3] Rodrigo Aldecoa, Chiara Orsini, and Dmitri Krioukov. Hyperbolic graph generator. Computer Physics Communications , 196:492–496, 2015.
4[4] Ioannis Antonellis, Hector Garcia Molina, and Chi Chao Chang. Simrank++: query rewriting through link analysis of the click graph. PVLDB , 1(1):408–421, 2008.
5[5] Bahman Bahmani, Abdur Chowdhury, and Ashish Goel. Fast incremental and personalized pagerank. VLDB , 4(3):173–184, 2010.
6[6] Mansurul Bhuiyan and Mohammad Al Hasan. Representing graphs as bag of vertices and partitions for graph classification. Data Science and Engineering , 3(2):150–165, 2018.
7[7] Béla Bollobás, Christian Borgs, Jennifer T. Chayes, and Oliver Riordan. Directed scale-free graphs. In SODA , pages 132–139, 2003.
8[8] Pawel Brach, Marek Cygan, Jakub Lkacki, and Piotr Sankowski. Algorithmic complexity of power law networks. In SODA , pages 1306–1325. Society for Industrial and Applied Mathematics, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs

Abstract.

1. Introduction

Definition 1.1 (Approximate Single-Source Queries).

2. Preliminaries

Definition 2.1 (Last-meeting probability).

3. PRSim algorithm

3.1. SimRank and ℓ\ellℓ-hop RPPR

3.2. Computing Last Meeting Probability

3.3. Precomputing RPPR to Hub Nodes

Lemma 3.1 ((lofgren2015personalized, )).

Lemma 3.2.

3.4. Sampling RPPR to Non-Hub Nodes

Lemma 3.3.

Lemma 3.4.

Lemma 3.5.

3.5. Putting Things Together

Lemma 3.6.

Lemma 3.7.

Theorem 3.8.

Lemma 3.9.

Lemma 3.10.

Theorem 3.11.

Theorem 3.12.

4. Related Work

5. Experiments

5.1. Experimental Settings

5.2. Experiments on Real-World Graphs

5.3. Experiments on Synthetic Data Sets

Conjuncture 1.

6. Conclusions

7. ACKNOWLEDGEMENTS

Appendix A Inequalities

A.1. Chernoff Bound

Lemma A.1 (Chernoff Bound (ChungL06, )).

A.2. Chebyshev’s Inequality

Lemma A.2 (Chebyshev’s inequality).

A.3. Median Trick

Lemma A.3 ((charikar2002finding, )).

A.4. Partial sum of Riemann zeta function

Lemma A.4.

Appendix B Proofs

B.1. Proof of Lemma 3.2

Proof.

B.2. Proof of Lemma 3.3

Proof of Lemma 3.3.

B.3. Proof of Lemma 3.4

Proof.

B.4. Proof of Lemma 3.5

Proof.

B.5. Proof of Lemma 3.6

Proof.

B.6. Proof of Lemma 3.7

Proof.

B.7. Proof of Lemma 3.9

Proof.

B.8. Proof of Lemma 3.10

Proof.

B.9. Proof of Theorem 3.11

Proof.

3.1. SimRank and $\ell$ -hop RPPR