On the role of clustering in Personalized PageRank estimation

Daniel Vial; Vijay Subramanian

arXiv:1706.01091·cs.SI·February 10, 2020

On the role of clustering in Personalized PageRank estimation

Daniel Vial, Vijay Subramanian

PDF

1 Repo

TL;DR

This paper explores how the clustering structure of graphs can be exploited to improve the efficiency of estimating Personalized PageRank for many node pairs, especially in large networks and distributed computing environments.

Contribution

It introduces an enhanced PPR estimation algorithm and demonstrates how leveraging clustering reduces computational complexity in both centralized and distributed settings.

Findings

01

Joint estimation is easier with high clustering.

02

Clustering-aware task assignment improves distributed computation efficiency.

03

Enhanced Bidirectional-PPR improves multi-pair estimation performance.

Abstract

Personalized PageRank (PPR) is a measure of the importance of a node from the perspective of another (we call these nodes the $target$ and the $source$ , respectively). PPR has been used in many applications, such as offering a Twitter user (the source) recommendations of who to follow (targets deemed important by PPR); additionally, PPR has been used in graph-theoretic problems such as community detection. However, computing PPR is infeasible for large networks like Twitter, so efficient estimation algorithms are necessary. In this work, we analyze the relationship between PPR estimation complexity and clustering. First, we devise algorithms to estimate PPR for many source/target pairs. In particular, we propose an enhanced version of the existing single pair estimator $Bidirectional-PPR$ that is more useful as a primitive for many pair estimation. We then…

Figures40

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1 . Summary of notation

Notation	Defined in	Description
$π, π_{s}$	Section 2	Global PageRank vector, PPR vector for $s \in V$
$p^{s}, r^{s}$	Section 4	Output vectors of Algorithm 1 (forward DP); satisfy (5)
$σ_{s}$	Section 4	Starting distribution for walks in FW-BW-MCMC; $σ_{s} = r^{s} / {‖ r^{s} ‖}_{1}$
$r_{\max}^{s}$	Section 4	Upper bound on ${‖ r^{s} ‖}_{1}$ at conclusion of Algorithm 1
$p^{t}, r^{t}$	Section 4	Output vectors of Algorithm 2 (backward DP); satisfy (6)
$r_{\max}^{t}$	Section 4	Upper bound on ${‖ r^{t} ‖}_{\infty}$ at conclusion of Algorithm 2
$Σ_{S}$	Section 5.1.1	Matrix with rows ${σ_{s}}_{s \in S}$ (subscript $S$ omitted at times)
${‖ Σ_{S} ‖}_{\infty, 1}$	Section 5.1.1	Source clustering quantity; see (12)
$c_{T}$	Section 5.1.2	Target clustering quantity; see (22)
$srank (A)$	Section 5.2	Stable rank of matrix $A$ ; $srank (A) = {({‖ A ‖}_{F} / {‖ A ‖}_{2})}^{2}$
$P_{S}, R_{S}, P_{T}, R_{T}$	Section 5.2	Matrices containing ${p^{s}}_{s \in S}$ , ${r^{s}}_{s \in S}$ , ${p^{t}}_{t \in T}$ , ${r^{t}}_{t \in T}$ , resp.
$Π (S, T)$	Section 5.2	Matrix containing ${π_{s} (t)}_{s \in S, t \in T}$
$σ_{avg}, σ_{\max}$	Section 5.2	Starting distributions for walks in matrix estimators; see (27)
$Φ (U)$	Section 6.1.1	Conductance of $U \subset V$ ; see (30)
$σ_{S}$	Section 7	Vector with $σ_{s} (v) = \max_{s \in S} σ_{s} (v)$

Table 2. Table 2 . Dataset details

Dataset	Description	$n$	$m$
com-Amazon	Amazon co-purchasing	334863	925872
com-dblp	Scientific co-authorship	317080	1049866
roadNet-PA	Roads in Pennsylvania	1087532	1541514
Slashdot	Friendships on technology news site	71307	912381
web-BerkStan	berkley.edu, stanford.edu web graph	334857	4523232
web-Google	Partial web crawl	434818	3419124
Wiki-Talk	Friendships among Wikipedia editors	111881	1477893

Table 3. Table 3 . Experiment parameters and single pair performance

Graph	Algorithm	${\tilde{r}}_{\max}^{s}$ $\times 10^{3}$	$r_{\max}^{t}$ $\times 10^{3}$	$δ$	$c$	DP time (ms)	MC time (ms)	Error
Direct-ER	FW-BW-MCMC-Prac	$1.8$	$3.8$	$1 / n$	$7$	10.61	7.63	0.075
Direct-ER	Bidirectional-PPR	N/A	$1.6$	$1 / n$	$12$	6.94	7.52	0.072
Direct-SBM	FW-BW-MCMC-Prac	$1$	$4$	$1 / n$	$7$	15.43	7.08	0.052
Direct-SBM	Bidirectional-PPR	N/A	$3$	$1 / n$	$10$	10.19	12.01	0.061
com-amazon	FW-BW-MCMC-Prac	$3.6$	$18.2$	$10 / n$	$12$	22.55	22.54	0.12
com-amazon	Bidirectional-PPR	N/A	$7.4$	$10 / n$	$13$	22.13	22.21	0.11
com-dblp	FW-BW-MCMC-Prac	$2.9$	$14.3$	$10 / n$	$13$	20.27	20.31	0.12
com-dblp	Bidirectional-PPR	N/A	$6$	$10 / n$	$15$	20.03	19.65	0.11
roadNet-PA	FW-BW-MCMC-Prac	$15.1$	$34.8$	$10 / n$	$6$	55.04	56.58	0.11
roadNet-PA	Bidirectional-PPR	N/A	$12.8$	$10 / n$	$6$	53.19	55.96	0.10
Slashdot	FW-BW-MCMC-Prac	$2$	$12.2$	$10 / n$	$7$	3.08	3.38	0.10
Slashdot	Bidirectional-PPR	N/A	$4.2$	$10 / n$	$17$	3.30	4.03	0.11
web-BerkStan	FW-BW-MCMC-Prac	$6.9$	$23$	$10 / n$	$3$	11.13	11.02	0.12
web-BerkStan	Bidirectional-PPR	N/A	$11.6$	$10 / n$	$3$	8.40	8.42	0.12
web-Google	FW-BW-MCMC-Prac	$4.5$	$17.6$	$10 / n$	$8$	23.33	22.83	0.11
web-Google	Bidirectional-PPR	N/A	$6.7$	$10 / n$	$11$	26.07	22.29	0.11
WikiTalk	FW-BW-MCMC-Prac	$2.3$	$7.5$	$10 / n$	$8$	4.40	3.99	0.11
WikiTalk	Bidirectional-PPR	N/A	$2.9$	$10 / n$	$20$	5.84	5.10	0.11

Equations310

π (v) = N \to \infty lim \frac{1}{N} i = 0 \sum N - 1 \mathbbm 1_{{X_{i} = v}} \forall v \in V .

π (v) = N \to \infty lim \frac{1}{N} i = 0 \sum N - 1 \mathbbm 1_{{X_{i} = v}} \forall v \in V .

π_{σ} = α σ^{T} (I - (1 - α) P)^{- 1} = α σ^{T} i = 0 \sum \infty (1 - α)^{i} P^{i} .

π_{σ} = α σ^{T} (I - (1 - α) P)^{- 1} = α σ^{T} i = 0 \sum \infty (1 - α)^{i} P^{i} .

π_{σ} (v) = P [Y_{L} = v ∣ Y_{0} \sim σ] \forall v \in V,

π_{σ} (v) = P [Y_{L} = v ∣ Y_{0} \sim σ] \forall v \in V,

π_{σ} (v) = s \in V \sum σ (s) π_{s} (v) \forall v \in V .

π_{σ} (v) = s \in V \sum σ (s) π_{s} (v) \forall v \in V .

π_{s} (u) = p^{s} (u) + w \in V \sum r^{s} (w) π_{w} (u) \forall u \in V .

π_{s} (u) = p^{s} (u) + w \in V \sum r^{s} (w) π_{w} (u) \forall u \in V .

π_{v} (t) = p^{t} (v) + w \in V \sum π_{v} (w) r^{t} (w) \forall v \in V .

π_{v} (t) = p^{t} (v) + w \in V \sum π_{v} (w) r^{t} (w) \forall v \in V .

π_{s} (t) = p^{t} (s) + ⟨ p^{s}, r^{t} ⟩ + w, w^{'} \in V \sum r^{s} (w) π_{w} (w^{'}) r^{t} (w^{'}),

π_{s} (t) = p^{t} (s) + ⟨ p^{s}, r^{t} ⟩ + w, w^{'} \in V \sum r^{s} (w) π_{w} (w^{'}) r^{t} (w^{'}),

∥ r^{s} ∥_{1} w^{'} \in V \sum w \in V \sum σ_{s} (w) π_{w} (w^{'}) r^{t} (w^{'}) = ∥ r^{s} ∥_{1} w^{'} \in V \sum π_{σ_{s}} (w^{'}) r^{t} (w^{'}) = ∥ r^{s} ∥_{1} E_{U \sim π_{σ_{s}}} [r^{t} (U)] .

∥ r^{s} ∥_{1} w^{'} \in V \sum w \in V \sum σ_{s} (w) π_{w} (w^{'}) r^{t} (w^{'}) = ∥ r^{s} ∥_{1} w^{'} \in V \sum π_{σ_{s}} (w^{'}) r^{t} (w^{'}) = ∥ r^{s} ∥_{1} E_{U \sim π_{σ_{s}}} [r^{t} (U)] .

\overset{π}{^}_{s} (t) = p^{t} (s) + ⟨ p^{s}, r^{t} ⟩ + \frac{∥ r ^{s} ∥ _{1}}{w} v \in V : X_{s}^{(w)} (v) > 0 \sum i = 1 \sum X_{s}^{(w)} (v) r^{t} (U_{i}^{v}) .

\overset{π}{^}_{s} (t) = p^{t} (s) + ⟨ p^{s}, r^{t} ⟩ + \frac{∥ r ^{s} ∥ _{1}}{w} v \in V : X_{s}^{(w)} (v) > 0 \sum i = 1 \sum X_{s}^{(w)} (v) r^{t} (U_{i}^{v}) .

w > \frac{3 lo g ( 2 \sum _{s \in S, v \in V} \mathbbm 1 _{{σ_{s} (v) > 0}} / p _{fail} )}{ϵ ^{2} min _{s \in S, v \in V : σ_{s} (v) > 0} σ _{s} ( v )} .

w > \frac{3 lo g ( 2 \sum _{s \in S, v \in V} \mathbbm 1 _{{σ_{s} (v) > 0}} / p _{fail} )}{ϵ ^{2} min _{s \in S, v \in V : σ_{s} (v) > 0} σ _{s} ( v )} .

v \in V \sum X^{(w)} (v) - w v \in V \sum s \in S max σ_{s} (v) \leq ϵ w v \in V \sum s \in S max σ_{s} (v) .

v \in V \sum X^{(w)} (v) - w v \in V \sum s \in S max σ_{s} (v) \leq ϵ w v \in V \sum s \in S max σ_{s} (v) .

∥Σ ∥_{\infty, 1} = v \in V \sum s \in S max σ_{s} (v)

∥Σ ∥_{\infty, 1} = v \in V \sum s \in S max σ_{s} (v)

∥ A ∥_{p, q} = (\sum_{j} (\sum_{i} ∣ A (i, j) ∣^{p})^{q / p})^{1/ q} .

∥ A ∥_{p, q} = (\sum_{j} (\sum_{i} ∣ A (i, j) ∣^{p})^{q / p})^{1/ q} .

d_{out} (v) = ∣ {u \in V_{n} : v \to u \in E_{n}} ∣, d_{out}^{-} (v) = ∣ {u \in V_{n} ∖ V_{n, i (v)} : v \to u \in E_{n}} ∣ \forall v \in V_{n} .

d_{out} (v) = ∣ {u \in V_{n} : v \to u \in E_{n}} ∣, d_{out}^{-} (v) = ∣ {u \in V_{n} ∖ V_{n, i (v)} : v \to u \in E_{n}} ∣ \forall v \in V_{n} .

n \to \infty lim P (∥ Σ_{S_{n}} ∥_{\infty, 1} \leq C q_{n} n) = 1,

n \to \infty lim P (∥ Σ_{S_{n}} ∥_{\infty, 1} \leq C q_{n} n) = 1,

n \to \infty lim P (∥ Σ_{S_{n}} ∥_{\infty, 1} \leq C lo g n / lo g lo g n) = 1,

n \to \infty lim P (∥ Σ_{S_{n}} ∥_{\infty, 1} \leq C lo g n / lo g lo g n) = 1,

n \to \infty lim P (∥ Σ_{S_{n}} ∥_{\infty, 1} \geq (1 - δ) n) = 1,

n \to \infty lim P (∥ Σ_{S_{n}} ∥_{\infty, 1} \geq (1 - δ) n) = 1,

p^{t_{2}} \leftarrow p^{t_{2}} + r^{t_{2}} (t_{1}) p^{t_{1}}, r^{t_{2}} \leftarrow r^{t_{2}} + r^{t_{2}} (t_{1}) (r^{t_{1}} - e_{t_{1}}) .

p^{t_{2}} \leftarrow p^{t_{2}} + r^{t_{2}} (t_{1}) p^{t_{1}}, r^{t_{2}} \leftarrow r^{t_{2}} + r^{t_{2}} (t_{1}) (r^{t_{1}} - e_{t_{1}}) .

p^{t_{2}} (s) + r^{t_{2}} (t_{1}) p^{t_{1}} (s) + u \in V \sum π_{s} (u) (r^{t_{2}} (u) + r^{t_{2}} (t_{1}) (r^{t_{1}} (u) - e_{t_{1}} (u)))

p^{t_{2}} (s) + r^{t_{2}} (t_{1}) p^{t_{1}} (s) + u \in V \sum π_{s} (u) (r^{t_{2}} (u) + r^{t_{2}} (t_{1}) (r^{t_{1}} (u) - e_{t_{1}} (u)))

= p^{t_{2}} (s) + u \in V \sum π_{s} (u) r^{t_{2}} (u) + r^{t_{2}} (t_{1}) ((p^{t_{1}} (s) + u \in V \sum π_{s} (u) r^{t_{1}} (u)) - π_{s} (t_{1})))

= π_{s} (t_{2}) + r^{t_{2}} (t_{1}) (π_{s} (t_{1}) - π_{s} (t_{1})) = π_{s} (t_{2}),

c_{T} = i = 1 \sum ∣ T ∣ {j \in {1, 2, ..., i - 1} : π_{t_{j}} (t_{i}) > r_{m a x}^{t}},

c_{T} = i = 1 \sum ∣ T ∣ {j \in {1, 2, ..., i - 1} : π_{t_{j}} (t_{i}) > r_{m a x}^{t}},

P_{S} \in R^{n \times l} s.t. P_{S} (i, j) = p^{s_{j}} (i), R_{S} \in R^{n \times l} s.t. R_{S} (i, j) = r^{s_{j}} (i),

P_{S} \in R^{n \times l} s.t. P_{S} (i, j) = p^{s_{j}} (i), R_{S} \in R^{n \times l} s.t. R_{S} (i, j) = r^{s_{j}} (i),

P_{T} \in R^{n \times l} s.t. P_{T} (i, j) = p^{t_{j}} (i), R_{T} \in R^{n \times l} s.t. R_{T} (i, j) = r^{t_{j}} (i) .

Π (S, T) = P_{T} (S, :) + P_{S}^{T} R_{T} + R_{S}^{T} Π R_{T} .

Π (S, T) = P_{T} (S, :) + P_{S}^{T} R_{T} + R_{S}^{T} Π R_{T} .

R_{S}^{T} Π R_{T} = R_{S}^{T} diag (1/ σ) diag (σ) Π R_{T} .

R_{S}^{T} Π R_{T} = R_{S}^{T} diag (1/ σ) diag (σ) Π R_{T} .

σ_{avg} (i) = \frac{1}{l} s \in S \sum σ_{s} (i), σ_{max} (i) = \frac{1}{∥Σ ∥ _{\infty, 1}} s \in S max σ_{s} (i),

σ_{avg} (i) = \frac{1}{l} s \in S \sum σ_{s} (i), σ_{max} (i) = \frac{1}{∥Σ ∥ _{\infty, 1}} s \in S max σ_{s} (i),

w \geq l^{2} srank (Π (S, T)) lo g (2 l / p_{fail}) r_{m a x}^{s} r_{m a x}^{t} (6 + 4 ϵ) / (3 ϵ^{2}) .

w \geq l^{2} srank (Π (S, T)) lo g (2 l / p_{fail}) r_{m a x}^{s} r_{m a x}^{t} (6 + 4 ϵ) / (3 ϵ^{2}) .

w \geq l^{3/2} ∥Σ ∥_{\infty, 1} lo g (2 l / p_{fail}) r_{m a x}^{s} r_{m a x}^{t} (6 + 4 ϵ) / (3 ϵ^{2}) .

w \geq l^{3/2} ∥Σ ∥_{\infty, 1} lo g (2 l / p_{fail}) r_{m a x}^{s} r_{m a x}^{t} (6 + 4 ϵ) / (3 ϵ^{2}) .

Φ (U) = \frac{\sum _{i \in U, j \in / U} A _{ij}}{min { \sum _{u \in U} d _{out} ( u ) , \sum _{u \in / U} d _{out} ( u )}} .

Φ (U) = \frac{\sum _{i \in U, j \in / U} A _{ij}}{min { \sum _{u \in U} d _{out} ( u ) , \sum _{u \in / U} d _{out} ( u )}} .

i \in {1, \dots, k} max ∥ Σ_{S_{i}} ∥_{\infty, 1} .

i \in {1, \dots, k} max ∥ Σ_{S_{i}} ∥_{\infty, 1} .

d (s, S^{'}) = v \in V \sum max {σ_{s} (v) - σ_{S^{'}} (v), 0} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danielvial/clusteringPpr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On the role of clustering in Personalized PageRank estimation

Daniel Vial

University of Michigan

[email protected]

and

Vijay Subramanian

University of Michigan

[email protected]

Abstract.

Personalized PageRank (PPR) is a measure of the importance of a node from the perspective of another (we call these nodes the target and the source, respectively). PPR has been used in many applications, such as offering a Twitter user (the source) recommendations of who to follow (targets deemed important by PPR); additionally, PPR has been used in graph-theoretic problems such as community detection. However, computing PPR is infeasible for large networks like Twitter, so efficient estimation algorithms are necessary.

In this work, we analyze the relationship between PPR estimation complexity and clustering. First, we devise algorithms to estimate PPR for many source/target pairs. In particular, we propose an enhanced version of the existing single pair estimator Bidirectional-PPR that is more useful as a primitive for many pair estimation. We then show that the common underlying graph can be leveraged to efficiently and jointly estimate PPR for many pairs, rather than treating each pair separately using the primitive algorithm. Next, we show the complexity of our joint estimation scheme relates closely to the degree of clustering among the sources and targets at hand, indicating that estimating PPR for many pairs is easier when clustering occurs. Finally, we consider estimating PPR when several machines are available for parallel computation, devising a method that leverages our clustering findings, specifically the quantities computed in situ, to assign tasks to machines in a manner that reduces computation time. This demonstrates that the relationship between complexity and clustering has important consequences in a practical distributed setting.

1. Introduction

Many systems, ranging from social networks to financial markets to the human brain, can be represented as graphs. When analyzing such systems, questions regarding the importance of nodes and the relationships between them arise. Which nodes are most influential, globally and locally? From the perspective of a given node, which nodes are important, and which nodes consider the given node to be important? How can these notions be quantified?

PageRank and Personalized PageRank (PPR) can help answer such questions. PageRank is a measure of the importance or centrality of a target node; PPR “personalizes” this measure to the perspective of a source node. Proposed to rank Internet search results (Page et al., 1999), and later to personalize these rankings (Haveliwala, 2002), PageRank and PPR have been used in applications such as recommending Twitter followers (Gupta et al., 2013) and YouTube videos (Baluja et al., 2008), as well as “beyond the web” (Gleich, 2015), in fields such as bioinformatics (Morrison et al., 2005; Freschi, 2007). In graph theory, PPR has been used as a primitive for tasks such as detecting communities near a seed node (Andersen et al., 2006) and assessing similarity between graphs (Koutra et al., 2013).

The widespread use of PageRank and PPR can be attributed to the notion of relational “importance” they convey, as well as the simplicity of the model from which they are derived. However, the scale of modern networks often makes them difficult (or impossible) to compute. As such, strategies for efficient estimation of PageRank and PPR are necessary.

Our contributions: In this work, we analyze the relationship between clustering and PPR estimation complexity. In particular, we devise algorithms to estimate PPR for many source/target pairs, and we show that the complexity of these methods decreases with increased clustering among the sources and targets at hand. To demonstrate the consequences of our findings, we consider a distributed setting in which this relationship between complexity and clustering can be leveraged to design more efficient algorithms. More specifically, our contributions are as follows:

(1)

In Section 4, we propose an enhancement of Bidirectional-PPR (Lofgren et al., 2016), the state-of-the-art PPR estimator for a single source/target pair. As the name suggests, Bidirectional-PPR estimates PPR in two stages: random walks forward from the source node and dynamic programming (DP) backward from the target node. Our algorithm, called FW-BW-MCMC, adds a DP stage forward from the source that allows it to serve as a primitive in the many pair setting. In Appendix A, we establish similar guarantees to those for Bidirectional-PPR. 2. (2)

In Section 5, we use FW-BW-MCMC as a primitive to estimate PPR for many pairs, proposing methods that accelerate the naive scheme of separately sampling random walks for each source and separately running DP for each target. For the sources, we show the forward DP allows random walk samples to be shared, decreasing the number of samples required. For the targets, we define a new iterative update for the backward DP, which eliminates repeated computations that may occur when treating each target separately. Using these ideas, we devise an algorithm with accuracy guarantees on each scalar estimate, and two algorithms with accuracy guarantees on the matrix containing all estimates. Across a diverse set of graphs, our methods are roughly 1.1 to 9.3 times faster than baseline methods (Fig. 1(a)). 3. (3)

We show analytically in Section 5 and empirically in Section 6 that the accelerations offered by our algorithms are most significant when the sources and targets are each clustered together in the graph, i.e. PPR estimation is “easier” when clustering occurs. For example, our algorithms typically accelerate baseline methods by factors of 3-4 when clustering occurs (Fig. 1(a)). More specifically, we prove the number of random walks for the sources and the number of DP iterations for the targets scale with quantities that describe clustering among the sources and targets, respectively; we find empirically that these clustering quantities scale with a more traditional clustering quantity, conductance (Fig. 1(b)). Also, while these clustering quantities are difficult to analyze in general, we provide analytical results for the stochastic block model, the prototypical model for networks with community structure. 4. (4)

Finally, in Section 7, we demonstrate an application of our results, showing that our findings can be used to devise efficient algorithms when estimating PPR in a distributed setting. Specifically, we show that quantities computed during the forward DP can be used to predict the random walk sampling time for different assignments of tasks to machines, and we propose a natural but heuristic method to compute an assignment that (locally) minimizes this time. At a high level, our method “learns” the clustering structure present at runtime; empirically, this learning is quite successful, in the sense that our method performs nearly as well as an oracle method that knows the clustering structure a priori (Fig. 1(c)).

The remainder of the paper is organized as follows. We begin with preliminaries and related work in Sections 2 and 3, respectively. Sections 4-7 follow the outline above. We close in Section 8.

2. Preliminaries

We begin with some preliminary definitions. Let $G=(V,E)$ be a directed graph, and define $n=|V|,m=|E|$ . For $v\in V$ , let $N_{\textrm{out}}(v)=\{u\in V:v\rightarrow u\in E\}$ denote $v$ ’s outgoing neighbors, and let $d_{\textrm{out}}(v)=|N_{\textrm{out}}(v)|$ denote the out-degree of $v$ . For simplicity, we assume $d_{\textrm{out}}(v)>0$ $\forall\ v\in V$ . Similarly define $N_{\textrm{in}}(v)$ and $d_{\textrm{in}}(v)$ as $v$ ’s incoming neighbors and in-degree. Finally, let $A$ denote the adjacency matrix of $G$ , let $D$ be the diagonal matrix with $D(v,v)=d_{\textrm{out}}(v)$ , and define $P=D^{-1}A$ .

PageRank (Page et al., 1999) is the stationary distribution $\{\pi(v)\}_{v\in V}$ of the Markov chain with transition matrix $(1-\alpha)P+\alpha\frac{1}{n}1_{n}1_{n}^{T}$ , where $\alpha\in(0,1)$ and $1_{n}$ denotes the all ones vector of length $n$ . In words, this chain is a random walk on $G$ for which, with probability $\alpha$ at each step, the random walker “jumps” to a uniform node, rather than following the walk. Let us denote this chain by $\{X_{i}\}_{i\in\mathbb{N}}$ . Clearly, $\{X_{i}\}_{i\in\mathbb{N}}$ is irreducible and aperiodic, as $P(u,v)\geq\frac{\alpha}{n}>0\ \forall\ u,v\in V$ . Assuming $V$ is finite, positive recurrence follows, so $\{\pi(v)\}_{v\in V}$ exists, is unique, and satisfies

[TABLE]

It is by (1) that PageRank gives a measure of “importance”: we consider $v$ important when $\pi(v)$ is large, which occurs when $v$ is visited often on the chain $\{X_{i}\}_{i\in\mathbb{N}}$ .

We will henceforth refer to PageRank as global PageRank to distinguish it from the generalization to PPR. Formally, PPR is the stationary distribution $\{\pi_{\sigma}(v)\}_{v\in V}$ of the Markov chain with transition matrix $P_{\sigma}=(1-\alpha)P+\alpha 1_{n}\sigma^{T}$ . Here $\sigma$ is a nonnegative vector that sums to 1; hence, it yields a distribution on the jump locations, generalizing the uniform jumps of global PageRank. This gives $\pi_{\sigma}$ an interpretation like (1), while also accounting for the preference on jump locations given by $\sigma$ .

There are two important mathematical viewpoints of PPR that serve as the foundation for many estimation techniques; we will make use of both in the sections that follow. The first viewpoint is linear algebraic. Here we let $\pi_{\sigma}$ denote the stationary distribution as a row vector. By global balance, $\pi_{\sigma}$ satisfies $\pi_{\sigma}=\pi_{\sigma}P_{\sigma}$ ; solving for $\pi_{\sigma}$ (and assuming $\pi_{\sigma}$ is normalized to sum to 1) gives

[TABLE]

Note this immediately suggests estimating PPR by computing the first $k$ terms of the summation. The second PPR viewpoint is probabilistic. Denoting by $\{Y_{i}\}_{i\in\mathbb{N}}$ the Markov chain with transition matrix $P$ , and letting $L\sim\textrm{geometric}(\alpha)$ , (Athreya and Stenflo, 2003) shows

[TABLE]

which again suggests estimation techniques; in this case, using Markov chain Monte Carlo (MCMC). We note this viewpoint is closely related to exact sampling (Propp and Wilson, 1996) and Doeblin chains (Athreya and Stenflo, 2003; Meyn and Tweedie, 2012).

Often, we let $\sigma=e_{s}$ for some $s\in V$ , where $e_{s}\in\{0,1\}^{n}$ satisfies $e_{s}(v)=\mathbbm{1}_{\{v=s\}}$ . In such a case, we denote PPR as $\{\pi_{s}(v)\}_{v\in V}$ and the transition matrix as $P_{s}$ . In fact, using (2), one can show

[TABLE]

Due to (4), many PPR estimation algorithms focus on estimating $\pi_{s}(v)$ , from which extensions to estimating $\pi_{\sigma}(v)$ naturally follow; we focus on this as well throughout the paper.

Finally, we argue existence and uniqueness of PPR is not a concern. Indeed, the PPR Markov chain is aperiodic (since $P_{s}(s,s)\geq\alpha>0$ ), so to guarantee existence and uniqueness, we need only verify irreducibility. For this, let $V_{s}=\{v\in V:\textrm{$ \exists $a path from$ s $to$ v $in$ G $}\}$ , so that $\forall\ u,v\in V_{s}$ , $\exists\ i\in\mathbb{N}$ s.t. $P_{s}^{i}(u,v)>0$ (we can jump from $u$ to $s$ , then reach $v$ from $s$ ). Hence, if $P_{s}$ is not irreducible, we can define a modified chain with states $V_{s}$ that is irreducible and obtain the stationary distribution $\pi_{s}$ for this chain. We can then set $\pi_{s}(v)=0\ \forall\ v\notin V_{s}$ – intuitively, if $s$ cannot reach $v$ , $v$ is not “important” to $s$ , so its PPR should be zero. Given this simple fix, we assume existence and uniqueness of PPR.

3. Related Work

Before proceeding, we discuss some existing PPR estimation algorithms. Broadly speaking, these can be organized hierarchically: first, those that estimate the entire PPR matrix $\{\pi_{s}(t)\}_{s\in V,t\in V}$ ; second, those that estimate a single row $\{\pi_{s}(t)\}_{t\in V}$ or column $\{\pi_{s}(t)\}_{s\in V}$ of this matrix, or its column sums (i.e. global PageRank); and third, those that estimate a single entry $\pi_{s}(t)$ .

At the first level, several algorithms have been proposed to accelerate the power iteration or matrix inversion in (2). To accelerate the power iteration, (Jeh and Widom, 2003) provides a decomposition that allows a single row of the PPR matrix to be estimated using previously-computed rows; hence, this yields a procedure of first computing a small number of rows and then using these to estimate other rows. To obtain less costly matrix inversions, several works, e.g. (Tong et al., 2008; Shin et al., 2015), have leveraged structural assumptions of the graph at hand. For example, Tong et al. in (Tong et al., 2008) propose a decomposition of $P$ into a block diagonal matrix $P_{1}$ and $P_{2}:=P-P_{1}$ ; for graphs like social networks, $P_{2}$ can be extremely sparse. From the probabilistic viewpoint (3), (Fogaras et al., 2005) gives an algorithm to estimate any entry of the PPR matrix at runtime using a precomputed database of random walk samples.

At the second level, algorithms include the dynamic programming methods in (Andersen et al., 2006) and (Andersen et al., 2008) that estimate a row and a column of the PPR matrix, respectively; both can be viewed as localized versions of the power iteration in (2). The algorithm in (Andersen et al., 2006) yields $l_{1}$ and $l_{\infty}$ error guarantees on the row estimate with complexity $O(m)$ , while (Andersen et al., 2008) gives an $l_{\infty}$ guarantee on the column estimate with complexity $O(m)$ . We make use of these algorithms in our methods and will discuss them in more detail in Section 4. We also note the approach in (Andersen et al., 2006; Andersen et al., 2008) is closely related to work by Lee and co-authors (Lee et al., 2013, 2014a, 2014b) that focuses on estimation of the stationary distribution of countable state-space Markov chains, as well as estimation in the context of general linear systems. From the probabilistic viewpoint, an important work is (Avrachenkov et al., 2007), which analyzes Monte Carlo methods for global PageRank estimation, based on both the final step of sampled random walks (as given by (3)) and the number of visits along the entire walk. In (Avrachenkov et al., 2007), it is shown that a single walk from each node (i.e. $n$ walks total) suffices to obtain estimates with small relative error for nodes with high global PageRank. Another work in this category is (Borgs et al., 2014), which uses random walk-based methods to detect all nodes with global PageRank exceeding $n^{-\delta},\delta\in(0,1)$ with complexity sublinear in $n$ . (Borgs et al., 2014) also contains an algorithm to estimate a row of the PPR matrix with each estimate satisfying a multiplicative plus additive error guarantee; the complexity is linear in $n$ (if the error tolerance is set to match (Lofgren et al., 2016)).

At the third level, the aforementioned Bidirectional-PPR algorithm from (Lofgren et al., 2016) combines existing dynamic programming and Monte Carlo methods to estimate a single PPR value with worst-case and average-case complexity $O(n)$ and $O(\sqrt{m})$ , respectively. From an accuracy perspective, this algorithm achieves a relative error bound for PPR values exceeding $1/n$ , and an absolute error bound otherwise. We discuss this algorithm in more detail in Section 4.

In the context of this body of work, we will consider estimation of a small set of PPR values, $\{\pi_{s}(t)\}_{s\in S,t\in T}$ for some $S,T\subset V$ . While we do not precisely quantify “small”, we implicitly assume $|S|\approx|T|=o(\sqrt{m})$ . In this setting, the existing methods described above can be applied in two ways. First, using methods such as the power iteration or the dynamic programming schemes (i.e. the first two levels of the above hierarchy), one can estimate entire rows and/or columns of the PPR matrix and then discard unwanted estimates. Such approaches typically have complexity $O(|S|m)$ or $O(|T|m)$ (depending on exactly which approach is used). Second, one can run the single pair estimator Bidirectional-PPR separately for each pair $(s,t)\in S\times T$ . This approach has typical complexity $O(|S||T|\sqrt{m})$ . When $|S|\approx|T|=o(\sqrt{m})$ , the second approach is more efficient. Hence, we will treat this approach as a baseline for comparison to our methods. Our primary contribution is to show that this baseline can be accelerated by exploiting clustering among $S$ and $T$ to estimate PPR values jointly, rather than running Bidirectional-PPR separately for each $(s,t)\in S\times T$ .

4. Single node pair estimation

We begin by proposing an enhancement of Bidirectional-PPR (Lofgren et al., 2016), the state-of-the-art single pair PPR estimator; we will introduce our algorithm and then describe Bidirectional-PPR as a special case. As mentioned in Section 3, the idea behind these estimators is to combine dynamic programming (DP) and Markov chain Monte Carlo (MCMC) to estimate $\pi_{s}(t)$ for some $s,t\in V$ . Our algorithm uses two DP stages and one MCMC stage. We will refer to these stages as the forward DP, backward DP, and MCMC stages; hence, we call our estimator FW-BW-MCMC. It is depicted pictorially in Fig. 2 and defined formally in Algorithm 3. Before proceeding, we briefly describe each stage.

The forward DP stage is Algorithm 1. This is nearly identical to the Approximate-PageRank algorithm of (Andersen et al., 2006), so we use the same name here; however, we change the termination criteria from $\|D^{-1}r^{s}\|_{\infty}\leq r_{\max}^{s}$ to $\|r^{s}\|_{1}\leq r_{\max}^{s}$ , where $r^{s}_{\max}\in(0,1)$ is an input to the algorithm (we describe our motivation for this change shortly). The algorithm takes as input $s\in V$ and produces $p^{s},r^{s}\in\mathbb{R}_{+}^{n}$ , shown in (Andersen et al., 2006) to satisfy the invariant (5) at each iteration.

[TABLE]

As mentioned Section 3, Algorithm 1 can be viewed as a “localized” power iteration. At a high level, it computes elements of the matrices in (2) corresponding to high probability paths from $s$ to $u$ (the $p^{s}(u)$ term) while tracking the error from “uncomputed” paths (the $\sum_{w\in V}r^{s}(w)\pi_{w}(u)$ term). These high probability paths are shown as blue edges in Fig. 2.

The backward DP stage is Approximate-Contributions (Algorithm 2, from (Andersen et al., 2008)), which can be viewed as the dual of Algorithm 1: while Algorithm 1 works forwards (along outgoing edges), Algorithm 2 works backwards (along incoming edges). In (Andersen et al., 2008), it is shown that Algorithm 2 maintains the invariant (6), which can be interpreted similarly to (5). This stage is shown in red in Fig. 2.

[TABLE]

To motivate the MCMC stage, first observe that combining (5) and (6) with $u=t$ and $v=s$ gives

[TABLE]

and so, after running the DP stages, only the third term in (7) is unknown. The goal of the MCMC stage is to estimate this term. Towards this end, let $\sigma_{s}=r^{s}/\|r^{s}\|_{1}$ and use (4) to write this term as

[TABLE]

Leveraging the probabilistic PPR interpretation (3), we can then estimate this term by sampling random walks. More specifically, we first sample a starting node from $\sigma_{s}$ (blue nodes in Fig. 2), and we then sample a random walk beginning at the starting node (black edges in Fig. 2). This process of sampling random walks is the MCMC stage of our algorithm.

As mentioned above, the forward DP stage uses termination criteria $\|r^{s}\|_{1}\leq r_{\max}^{s}$ , rather than the $\|D^{-1}r^{s}\|_{\infty}\leq r_{\max}^{s}$ criteria used in (Andersen et al., 2006). This is because we will require a uniform bound on $\{\|r^{s}\|_{1}\}_{s\in S}$ when proving results pertaining to a set sources $S$ in later sections. However, this bound is not needed in practice, where we can instead use $\|D^{-1}r^{s}\|_{\infty}\leq r_{\max}^{s}$ termination. We call this variant of our algorithm FW-BW-MCMC-Practical; see Algorithm 8 in Appendix D for a formal definition.

Having defined FW-BW-MCMC, we can describe the existing algorithm Bidirectional-PPR, which (in brief) operates as follows: run the backward DP from $t$ , take $v=s$ in (6), and estimate the unknown term $\mathbb{E}_{U\sim\pi_{s}}[r^{t}(U)]$ by sampling random walks from $s$ . We observe this is a special case of FW-BW-MCMC; specifically, the case $r^{s}_{\max}=1$ . We reemphasize that walks are sampled from $\nu\sim\sigma_{s}$ in FW-BW-MCMC and from $s$ in Bidirectional-PPR, which will be an important distinction later.

In the next sections, we will propose many pair estimators that use either Bidirectional-PPR or our enhancement as a primitive. We will show that using our enhancement as the primitive offers runtime accelerations not possible when using Bidirectional-PPR. Implicit in this discussion will be an understanding that using either primitive yields similar performance when these accelerations are ignored (so that using our enhancement as the primitive offers better performance when the accelerations are accounted for). In particular, we can prove the following results (as single pair estimation is not our focus, we defer formal statements and proofs to Appendix A):

(1)

FW-BW-MCMC, FW-BW-MCMC-Practical, and Bidirectional-PPR offer the same accuracy guarantee (except for mild differences in assumptions) 2. (2)

FW-BW-MCMC and Bidirectional-PPR have worst-case complexity $O(n)$ 3. (3)

FW-BW-MCMC-Practical and Bidirectional-PPR have average-case complexity $O(\sqrt{m})$

5. Many node pair estimation

In this section, we consider the problem of estimating PPR for many node pairs, $\{\pi_{s}(t)\}_{s\in S,t\in T}$ for some $S,T\subset V$ . We consider two variants of this problem. First, in Section 5.1, we view $\{\pi_{s}(t)\}_{s\in S,t\in T}$ as a set of scalars, each of which we aim to accurately estimate. Second, in Section 5.2, we view $\{\pi_{s}(t)\}_{s\in S,t\in T}$ as a matrix, which we aim to approximate accurately in the operator norm. For both variants, we propose algorithms that accelerate existing approaches, and we show the accelerations scale with quantities that can be interpreted as clustering measures of $S$ and $T$ . In addition to these algorithms, we briefly discuss variants that use precomputation in Section 5.3.

5.1. Scalar estimation viewpoint

Given $S,T\subset V$ , a natural approach to estimate $\pi_{s}(t)\ \forall\ (s,t)\in S\times T$ is to use single pair estimators from Section 4 as primitives. In particular, we could use either of the following approaches:

•

Run forward DP and sample random walks from $\nu\sim\sigma_{s}$ for each $s\in S$ . Run backward DP from each $t\in T$ . Compute estimates as in FW-BW-MCMC.

•

Sample random walks from each $s\in S$ . Run backward DP from each $t\in T$ . Compute estimates as in Bidirectional-PPR.

As argued in Appendix A, the primitives FW-BW-MCMC and Bidirectional-PPR are roughly equivalent in terms of complexity and accuracy; hence, both approaches have similar complexity. However, in Section 5.1.1, we show the source stage of the first approach (forward DP and random walks) can be accelerated in a way not possible for the second approach. Further, in Section 5.1.2, we show the target stage (backward DP) can be accelerated as well. Hence, using primitive method FW-BW-MCMC and the accelerations of Sections 5.1.1-5.1.2, we can more efficiently estimate $\{\pi_{s}(t)\}_{s\in S,t\in T}$ .

5.1.1. Source stage acceleration

To accelerate the source stage, we define a unified MCMC stage for a set of sources $S$ . At a high level, this scheme allows us to share walks across multiple $s\in S$ , thereby decreasing the total number of walks required. We motivate the scheme pictorially in Fig. 3(a), for the simple case $S=\{s_{1},s_{2}\}$ . Here blue and red depict $\sigma_{s_{1}}$ and $\sigma_{s_{2}}$ values, i.e. blue and red nodes are the starting nodes of random walks used in the $\pi_{s_{1}}$ and $\pi_{s_{2}}$ estimates, respectively. Observe several nodes have nonzero $\sigma_{s_{1}}$ and $\sigma_{s_{2}}$ values. The unified MCMC stage allows us to use random walks sampled from such nodes towards both estimates ( $\pi_{s_{1}}$ and $\pi_{s_{2}}$ ).

To define the unified MCMC stage, we first define an equivalent MCMC stage for a single source. Recall that in Algorithm 3 we sample each of $w$ random walks in two stages: first, we sample starting node $\nu_{s}\sim\sigma_{s}$ , and second, we sample a walk starting at $\nu_{s}$ . Equivalently, we can first sample starting nodes $\{\nu_{s}^{(i)}\}_{i=1}^{w}$ i.i.d. from $\sigma_{s}$ , and then sample $X^{(w)}_{s}(v):=\sum_{i=1}^{w}\mathbbm{1}_{\{\nu_{s}^{(i)}=v\}}$ walks starting at $v$ , for each $v\in V$ . With this in mind, the unified MCMC stage proceeds as follows. First, for each $s\in S$ we sample starting nodes $\{\nu_{s}^{(i)}\}_{i=1}^{w}$ i.i.d. from $\sigma_{s}$ (as in the single source case), and we define $X^{(w)}_{s}(v)$ as above. Next, we sample $X^{(w)}(v):=\max_{s\in S}X^{(w)}_{s}(v)$ walks starting at each $v\in V$ . Letting $U_{i}^{v}$ denote the endpoint of the $i$ -th walk from $v$ , we then estimate $\pi_{s}(t)$ as

[TABLE]

The final term in (9) is an unbiased estimate of $\mathbb{E}_{U\sim\pi_{\sigma_{s}}}[r^{t}(U)]$ using $\sum_{v\in V}X_{s}^{(w)}(v)=w$ random walks, so the accuracy guarantee of Algorithm 3 holds. To analyze the complexity of this scheme, first note we must sample $\sum_{v\in V}X^{(w)}(v)$ walks in total. We can then prove the following.

Theorem 5.1.

Fix $\epsilon,p_{\textrm{fail}}\in(0,1)$ . Assume $w$ , the number of random walks required for each $s\in S$ in the unified MCMC stage from Section 5.1.1, satisfies

[TABLE]

Then with probability at least $1-p_{\textrm{fail}}$ , the total number of walks $\sum_{v\in V}X^{(w)}(v)$ sampled satisfies

[TABLE]

Proof.

See Appendix E. ∎

Before proceeding, we offer several remarks on this result:

•

A lower bound on $w$ is given by (35) in Theorem A.1 to guarantee accuracy of each estimate. Thus, if $w$ exceeds both (10) and (35), we obtain guarantees for both scalar accuracy and complexity of the walks. (In general, though, it is unclear which of (10) and (35) is larger.)

•

In the worst case, the denominator on the right side of (10) may be quite small, so the assumption on $w$ in Theorem 5.1 may be restrictive. However, this only means that the concentration in (11) may not provably occur, not that the scheme will necessarily have poor performance. Furthermore, we find that this concentration essentially occurs for the values of $w$ used in practice, see e.g. leftmost plot in Fig. 5 and left two plots in Fig. 8.

•

We will denote the matrix with rows $\{\sigma_{s}\}_{s\in S}$ by $\Sigma$ (or by $\Sigma_{S}$ , if we wish to emphasize the sources $S$ at hand) and will write the bound in (11) as

[TABLE]

Here we have used the notation of the $L_{p,q}$ matrix norm, defined for a matrix $A$ as

[TABLE]

From Theorem 5.1, we expect to sample approximately $w\|\Sigma\|_{\infty,1}$ walks when $w$ is large. It is easily verified that $\|\Sigma\|_{\infty,1}\in[1,|S|]$ , so our approach requires $w|S|$ random walks in the worst case, but only $w$ in the best case. In contrast, if we use Bidirectional-PPR as a primitive for many pair estimation, the unified MCMC stage is not possible (all random walks used to estimate $\pi_{s}$ begin at $s$ , so sharing walks is not possible), and $w|S|$ walks are always required. In short, FW-BW-MCMC with the unified MCMC stage accelerates the source stage of our many pair estimation approach.

Unfortunately, it is difficult to quantify the degree of this acceleration in general, in part because $\|\Sigma_{S}\|_{\infty,1}$ depends on the forward DP, which itself is difficult to analyze. However, in Section 6, we offer empirical evidence that $\|\Sigma_{S}\|_{\infty,1}$ scales with the conductance of $S$ , a common measure of the clustering of $S$ in the underlying graph (see (30) for a formal definition). Furthermore, as will be discussed next, this quantity provably scales with clustering for the stochastic block model, a common model for networks with community structure. In short, when $S$ is clustered in the graph, $\|\Sigma_{S}\|_{\infty,1}$ is typically small, and estimating PPR for many sources is easier.

We now turn to our result for the stochastic block model. We consider the special case for which $n$ is a perfect square and the graph is composed of $\sqrt{n}$ communities, each containing $\sqrt{n}$ nodes. (This allows us to compare the extremes of choosing $\sqrt{n}$ sources from the same community or from distinct communities; however, the analysis can be modified for other cases.) More specifically, we define $V_{n,i}=\{1+(i-1)\sqrt{n},\ldots,i\sqrt{n}\}$ and set $V_{n}=\cup_{i=1}^{\sqrt{n}}V_{n,i}$ ; we will view each $V_{n,i}$ as a community. For $v\in V_{n}$ , we denote by $i(v)$ the unique $i\in\{1,\ldots,\sqrt{n}\}$ satisfying $v\in V_{n,i}$ , i.e. $i(v)$ is the community that $v$ belongs to. We then construct a graph $G_{n}=(V_{n},E_{n})$ as follows: for any $u,v\in V_{n}$ s.t. $u\neq v$ , edge $u\rightarrow v$ is present with probability $p_{n}$ if $i(u)=i(v)$ (i.e. if $u,v$ are in the same community), and is present with probability $q_{n}$ if $i(u)\neq i(v)$ (i.e. if $u,v$ are in different communities), independent of other edges. We also define

[TABLE]

In words, $d_{\textrm{out}}(v)$ is $v$ ’s out-degree (as before, though here it is a random variable), and $d_{\textrm{out}}^{-}(v)$ is the number of edges pointing from $v$ to other communities.

Our analysis will assume $p_{n}=p$ is a constant and $q_{n}=o(1/\sqrt{n})$ . In this case, $\mathbb{E}[d_{\textrm{out}}(v)]=\Theta(\sqrt{n})$ (i.e. the graph is dense) and $\mathbb{E}[d_{\textrm{out}}^{-}(v)]=o(\sqrt{n})$ (i.e. nodes prefer to connect to their own community). Also, we assume the forward DP is run for at most $o(\sqrt{n})$ iterations. Since all nodes have out-degree $\Theta(\sqrt{n})$ with high probability (see proof of Theorem 5.2), this means we dedicate at most $o(n)$ complexity to the forward DP. This is consistent with the fact that our algorithm has average-case complexity $O(\sqrt{m})$ , since $\sqrt{m}=n^{3/4}$ when all out-degrees are $\Theta(\sqrt{n})$ . Hence, this assumption on the number of iterations is minor. Under these assumptions, we can prove the following:

Theorem 5.2.

Let $\{G_{n}=(V_{n},E_{n})\}_{n\in\mathbb{N}:\sqrt{n}\in\mathbb{N}}$ be the sequence of stochastic block models described above, with $p_{n}=p$ for some constant $p\in(0,1)$ and $q_{n}=o(1/\sqrt{n})$ . Assume we run the forward DP for at least one iteration, but at most $o(\sqrt{n})$ iterations. Then the following hold:

•

For each $n$ , let $S_{n}=V_{n,i}$ for some $i\in\{1,\ldots,\sqrt{n}\}$ , i.e. all sources belong to the same community. If $q_{n}=\Omega(\log n/n)$ (i.e. cross-community connections are dense), then for some constant $C>0$ ,

[TABLE]

i.e. $\|\Sigma_{S_{n}}\|_{\infty,1}=O(q_{n}n)=o(\sqrt{n})$ with high probability. If instead $q_{n}=\Theta(1/n)$ (i.e. cross-community connections are sparse), then for some constant $C>0$ ,

[TABLE]

i.e. $\|\Sigma_{S_{n}}\|_{\infty,1}=O(\log n/\log\log n)$ with high probability.

•

For each $n$ , let $S_{n}\subset V_{n}$ with $|S_{n}|=\sqrt{n}$ and $i(s)\neq i(s^{\prime})\ \forall\ s,s^{\prime}\in S_{n}$ s.t. $s\neq s^{\prime}$ , i.e. each source belongs to a distcint community. Then for any constant $\delta\in(0,1)$ ,

[TABLE]

i.e. $\|\Sigma_{S_{n}}\|_{\infty,1}\in[(1-\delta)\sqrt{n},\sqrt{n}]$ with high probability.

Proof.

See Appendix H.∎

5.1.2. Target stage acceleration

Our next goal is to accelerate the target stage of the many pair estimation approach. For this, we propose a unified DP stage that avoids repeated computations that may occur when running backward DP for each target separately. We motivate our approach in the simple case $T=\{t_{1},t_{2}\}$ . Assume that $p^{t_{1}},r^{t_{1}}$ have been computed by Algorithm 2, and that $p^{t_{2}},r^{t_{2}}$ are currently being computed. If $r^{t_{2}}(t_{1})>r^{t}_{\max}$ at some iteration, we can use the alternate update rule given by (18) (rather than that given in Algorithm 2).

[TABLE]

When $p^{t_{2}},r^{t_{2}}$ are updated via (18), the invariant (6) is maintained. Indeed, for any $s\in V$ ,

[TABLE]

where in (20) we simply rearranged terms and in (21) we assume $p^{t_{1}},r^{t_{1}}$ and $p^{t_{2}},r^{t_{2}}$ satisfy (6).

We can interpret the update rule (18) as follows. As discussed in Section 4, we view Algorithm 2 as a method of traversing paths to $t$ and computing the probability of these paths. For the update in Algorithm 2, specific paths are extended by a single step at each iteration; we call this update Extend. In contrast, the alternate update rule (18) extends paths by (potentially) many steps in an iteration; specifically, by appending paths to $t_{1}$ , with paths from $t_{1}$ to $t_{2}$ , to obtain paths to $t_{2}$ . We call this update Merge to highlight that longer paths are combined to obtain new paths.

The utility of Merge is that the probability of paths to $t_{2}$ through $t_{1}$ need not be recomputed one step at a time via Extend. This is depicted in Fig. 3(b): red paths are computed via Extend during $t_{2}$ DP; blue paths, having already been computed via Extend during $t_{1}$ DP, are used to compute longer paths in a single iteration via Merge during $t_{2}$ DP. In contrast, blue paths would be recomputed one step at a time via Extend during $t_{2}$ DP, if separate DP was used. In short, Merge may allow Algorithm 2 to terminate in fewer iterations. This is made more specific in Proposition 5.3.

Proposition 5.3.

Suppose $T=\{t_{1},t_{2}\}$ and $\pi_{t_{1}}(t_{2})>r^{t}_{\max}$ . If we run Algorithm 2 for $t_{2}$ and use Merge at iterations for which $v^{*}=t_{1}$ , the algorithm terminates in at most $\frac{n\pi(t_{2})}{\alpha r^{t}_{\max}}-\frac{(\|p^{t_{1}}\|_{1}-\alpha)}{\alpha}$ iterations. If Merge is not used, the number of iterations for termination is at most $\frac{n\pi(t_{2})}{\alpha r^{t}_{\max}}$ .

Proof.

See Appendix F. ∎

From Algorithm 2, $\|p^{t_{1}}\|_{1}\geq\alpha$ . Hence, Proposition 5.3 allows us to tighten the iteration bound by $\frac{(\|p^{t_{1}}\|_{1}-\alpha)}{\alpha}\geq 0$ (with equality if and only if the algorithm terminates in a single iteration for $t_{1}$ ). In the more general case, the iterations we save roughly scales with the quantity

[TABLE]

assuming the nodes in $T$ are chosen in order $\{t_{1},t_{2},\dotsc,t_{|T|}\}$ . We note the choice of this order has a clear impact on performance, but optimizing it at runtime is difficult; we discuss this more in Appendix I. See Algorithm 4 for our many target algorithm.

We next offer a clustering interpretation of the quantity $c_{T}$ . For this, note $\pi_{t_{j}}(t_{i})>r^{t}_{\max}$ is a notion of “closeness” between $t_{i}$ and $t_{j}$ ; hence, $c_{T}$ is a notion of clustering of the set $T$ , and our analysis suggests estimating PPR for many targets is easier when the targets are clustered. Note that, while the source clustering quantity $\|\Sigma\|_{\infty,1}$ from Section 5.1.1 is smaller when clustering among sources is more significant, the target clustering quantity $c_{T}$ is larger when clustering among targets is more significant; in Section 6, we show $-c_{T}$ scales with the conductance of $T$ in practice.

5.2. Matrix approximation viewpoint

For the second variant of the many pair estimation problem, we view $\{\pi_{s}(t)\}_{s\in S,t\in T}$ as a matrix that we aim to accurately approximate. For simplicity, we assume $|S|=|T|=l$ , and we denote these sets $S=\{s_{i}\}_{i=1}^{l},T=\{t_{i}\}_{i=1}^{l}$ . We also assume $V=\{1,2,...,n\}$ , and we let $\Pi$ denote the matrix of dimension $n\times n$ whose $(i,j)$ -th element is $\pi_{i}(j)$ . In this notation, we seek an estimate $\hat{\Pi}(S,T)$ of $\Pi(S,T)$ that minimizes $\|\hat{\Pi}(S,T)-\Pi(S,T)\|_{2}$ , where for a matrix $A$ , $A(I,J)$ denotes the submatrix of $A$ containing rows $I$ and columns $J$ , and where $\|A\|_{2}=\max_{\|x\|_{2}=1}\|Ax\|_{2}$ is the operator norm.

Before proceeding, we introduce additional notation used in this section. Similar to the $A(I,J)$ notation, $A(I,:)$ and $A(:,J)$ are the submatrices with rows $I$ and all columns, and all rows and columns $J$ , respectively. For a vector $x$ , $x(I)$ is the vector with elements $I$ ; when $x$ has nonzero entries, $\textrm{diag}(x)$ and $\textrm{diag}(1/x)$ are the diagonal matrices whose $i$ -th diagonal elements are $x(i)$ and $1/x(i)$ , respectively. Finally, we will encounter stable rank, which for a matrix $A$ is defined as $\textrm{srank}(A)=(\|A\|_{F}/\|A\|_{2})^{2}$ , where $\|\cdot\|_{F}=\|\cdot\|_{2,2}$ is the Frobenius norm, with $\|\cdot\|_{2,2}$ defined as in (13). It is straightforward to verify $1\leq\textrm{srank}(A)\leq\textrm{rank}(A)$ by writing $\|A\|_{F}^{2}$ and $\|A\|_{2}^{2}$ in terms the singular values of $A$ (see, for example, Section 2.1.15 of (Tropp, 2015)).

With this notation in mind, we define the following matrices:

[TABLE]

Here $p^{s_{j}},r^{s_{j}}$ and $p^{t_{j}},r^{t_{j}}$ are assumed to have been computed via Algorithms 1 and 4, respectively. We may then collect the invariant (7) for each $(s_{i},t_{j})$ pair in matrix form as

[TABLE]

Observe only $R_{S}^{\mathsf{T}}\Pi R_{T}$ is unknown in (25). Hence, we consider estimation of this term. To this end, let $\sigma$ be any $n$ -length vector satisfying $\sigma(i)>0\ \forall\ i\in\{1,2,...,n\}$ and $\sum_{i=1}^{n}\sigma(i)=1$ ; note we may view $\sigma$ as a distribution on $V$ . We then rewrite the unknown term in (25) as

[TABLE]

Using (26), we can obtain unbiased estimates of $R_{S}^{\mathsf{T}}\Pi R_{T}$ as follows. Let $\{\mu_{i}\}_{i=1}^{w}$ be i.i.d. samples from $\sigma$ . For $i\in\{1,2,...,w\}$ , let $\nu_{i}\sim\pi_{\mu_{i}}$ independently (where we sample from $\pi_{\mu_{i}}$ using a random walk, as given by (3)), and let $X_{i}=R_{S}^{\mathsf{T}}\textrm{diag}(1/\sigma)e_{\mu_{i}}e_{\nu_{i}}^{\mathsf{T}}R_{T}$ . It is straightforward to see $\mathbb{E}[e_{\mu_{i}}e_{\nu_{i}}^{\mathsf{T}}]=\textrm{diag}(\sigma)\Pi$ ; hence, $\mathbb{E}[X_{i}]=R_{S}^{\mathsf{T}}\Pi R_{T}$ . We may then estimate $\Pi(S,T)$ as $\hat{\Pi}(S,T)=P_{T}(S,:)+P_{S}^{\mathsf{T}}R_{T}+\frac{1}{w}\sum_{i=1}^{w}X_{i}$ .

We will consider two forms of $\sigma$ for this approach. Specifically, let us define

[TABLE]

where $\sigma_{s}=r^{s}/\|r^{s}\|_{1}$ as before. Observe that when $\sigma\in\{\sigma_{\textrm{avg}},\sigma_{\max}\}$ , the assumption $\sum_{i=1}^{n}\sigma(i)=1$ is satisfied. Furthermore, we argue that the assumption $\sigma(i)>0$ is without loss of generality in these cases. Indeed, suppose $\sigma(j)=0$ for some $j$ and $\sigma(i)>0$ for $i\neq j$ . Then $\mathbb{P}[\mu_{i}=j]=0$ by definition, and by (27), $r^{s}(j)=0\ \forall\ s\in S$ . It is then readily verified that $R_{S}(V\setminus\{j\},:)^{\mathsf{T}}\textrm{diag}(1/\sigma(V\setminus\{j\}))e_{\mu_{i}}e_{\nu_{i}}^{\mathsf{T}}R_{T}$ is an unbiased estimate of $R_{S}^{\mathsf{T}}\Pi R_{T}$ . Given this simple fix, we assume $\sigma(i)>0$ moving forward.

To summarize, we have proposed the matrix approximation scheme formally defined in Algorithm 5. Theorem 5.4 provides a guarantee for the accuracy of this scheme.

Theorem 5.4.

Fix $\epsilon>0$ . If $\sigma=\sigma_{\textrm{avg}}$ in Algorithm 5, assume the number of walks $w$ satisfies

[TABLE]

If instead $\sigma=\sigma_{\max}$ in Algorithm 5, assume $w$ satisfies

[TABLE]

Then for both choices of $\sigma$ , and with probability at least $1-p_{\textrm{fail}}$ , Algorithm 5 returns an estimate $\hat{\Pi}(S,T)$ satisfying $\|\Pi(S,T)-\hat{\Pi}(S,T)\|_{2}\leq\epsilon\max\{\|\Pi(S,T)\|_{2},1\}$ .

Proof.

See Appendix G ∎

We note that, neglecting common factors, Theorem 5.4 states $w$ scales with $l^{2}$ and $l^{3/2}$ in the best case for $\sigma_{\textrm{avg}}$ and $\sigma_{\max}$ , respectively; in the worst case, $w$ scales with $l^{5/2}$ for both approaches. In the next section, we compare $\sqrt{l\ \textrm{srank}(\Pi(S,T))}$ with $\|\Sigma\|_{\infty,1}$ empirically to compare the “typical” case.

Next, we observe Theorem 5.4 shows that, as in the scalar estimation viewpoint of Section 5.1, PPR matrix approximation is easier when clustering occurs. This is because, when $\sigma=\sigma_{\max}$ , complexity scales with $\|\Sigma\|_{\infty,1}$ (which we have argued is measure of clustering of $S$ ); when $\sigma=\sigma_{\textrm{avg}}$ , complexity scales with $\textrm{srank}(\Pi(S,T))$ , a measure of matrix dimensionality. Additionally, stable rank is unique from the clustering quantities introduced thus far in that it takes into account both $S$ and $T$ (unlike $\|\Sigma\|_{\infty,1}$ , which only accounts for $S$ , or $c_{T}$ , which only accounts for $T$ ).

Finally, we comment on a difference for the choices of $\sigma$ . In particular, when $\sigma=\sigma_{\max}$ , one can set $w$ proportional to $\|\Sigma\|_{\infty,1}$ before sampling random walks, leveraging clustering at runtime to increase efficiency. In contrast, when $\sigma=\sigma_{\textrm{avg}}$ , the scaling factor in the $w$ lower bound is the unknown quantity $\textrm{srank}(\Pi(S,T))$ . However, we propose using $\textrm{srank}(P_{T}(S,:)+P_{S}^{\mathsf{T}}R_{T})$ (known at runtime) as a surrogate for $\textrm{srank}(\Pi(S,T))$ . In Section 6, we show empirically that using this surrogate yields performance similar to using $\textrm{srank}(\Pi(S,T))$ .

5.3. Precomputation variants

While we have thus far assumed all computations are done online, one can also consider variants for which some computations are done offline, with the results stored for later use. In fact, in Section 4 of (Lofgren et al., 2016), the authors propose several such algorithms for the case of one source $s\in V$ and many targets $T\subset V$ , using Bidirectional-PPR as a primitive. Each of these variants proceeds as follows. For the offline stage, Approx-Contributions is run for every $t\in V$ , and the vectors $\{p^{t},r^{t}\}_{t\in V}$ are stored. For the online stage, random walks are sampled from $s$ , and $\{\pi_{s}(t)\}_{t\in T}$ are estimated using the endpoints of these walks and $\{p^{t},r^{t}\}_{t\in T}$ . As mentioned, several such algorithms are proposed; these only differ in how the vectors are stored and how the walks and vectors are combined to generate estimates. In particular, the basic framework of running Approx-Contributions offline and sampling walks from $s$ online is used in all of the precomputation algorithms from (Lofgren et al., 2016).

Analogous to our extension of Bidirectional-PPR from the single pair case to the many pairs case, we can extend these precomputation algorithms from the single source case to the many sources case. Specifically, we can modify each of these algorithms in two ways (but otherwise leave them unchanged). First, we can modify the offline stage by also precomputing and storing $\{p^{s},r^{s}\}_{s\in V}$ via Approx-PageRank. Second, we can modify the online stage by sampling walks using the precomputed vectors $\{r^{s}\}_{s\in S}$ and the walk sharing scheme from Section 5.1.1.

To assess the performance of this approach, we compare against the naive extension of (Lofgren et al., 2016)’s precomputation algorithms to the case $|S|>1$ ; namely, leaving the offline stage unchanged and sampling walks separately from each $s\in S$ online. Clearly, our approach requires more storage (due to running Approx-PageRank offline); however, this storage will be roughly double that of the naive extension and thus will not increase the order of the space complexity. On the other hand, our approach will accelerate the online stage of this naive extension, since fewer random walks will typically be sampled. Specifically, per Section 5.1.1, we expect to sample $w\|\Sigma\|_{\infty,1}$ walks instead of $w|S|$ walks; as discussed previously, the former quantity can be much smaller if $S$ is clustered.

We also note that Algorithm 4 can be used to compute $\{p^{t},r^{t}\}_{t\in V}$ offline, though this is a minor point, since offline computational complexity is generally not a concern. However, this raises another point. When precomputation is not allowed, our source and target accelerations are both used at runtime; when precomputation is allowed, only our source acceleration is used at runtime. Hence, the runtime savings of our schemes may be less significant in the precomputation setting. In spite of this, we believe the savings will still be considerable in general. This belief follows from the fact that, in our experiments, the source acceleration is generally at least as significant as the target acceleration. For example, Fig. 4 shows that the number of random walks sampled grows more slowly in $|S|$ than the number of DP iterations grows in $|T|$ . Additionally, Fig. 7 shows that for fixed $|S|,|T|$ , walk savings and DP iteration savings are comparable across a wide range of graphs.

6. Experiments

In this section, we demonstrate the empirical performance of our algorithms and the role of clustering in their performance. We conduct experiments using both synthetic and real graphs. On the synthetic side, we use a directed Erdős-Rényi graph and directed stochastic block model (referred to hereafter as Direct-ER and Direct-SBM, respectively), each with $n=2\times 10^{3}$ and $\mathbb{E}[m]=2\times 10^{4}$ . For the real datasets, we use a set of graphs from the Stanford Network Analysis Platform (Leskovec and Krevl, [n. d.]) including social networks (Slashdot, Wiki-Talk), partial web crawls (web-BerkStan, web-Google), co-purchasing and co-authoring graphs (com-amazon, com-dblp), and a road network (roadNet-PA). In addition to the diverse domains of these datasets, they differ in terms of sparsity (in order of magnitude, each has $10^{6}$ edges, but the number of nodes ranges from $10^{4}$ to $10^{6}$ ), so we believe our empirical results are robust across different graph structures. We also note that error bars depict standard deviation across experimental trials, while for scatter plots without error bars, each dot represents a single trial. For further experimental documentation, we point the reader to Appendix J. In particular, Table 3 in Appendix J documents algorithmic parameters used. We chose these parameters so the primitive algorithms FW-BW-MCMC and Bidirectional-PPR yield similar accuracy ( $\approx 10\%$ relative error) while balancing runtime between the DP and MCMC stages of the algorithm in the single pair case. Note the analysis in Appendix A shows that balancing runtime in this manner minimizes overall complexity; hence, for both algorithms, our chosen parameters optimize runtime subject to an accuracy constraint, providing a fair comparison. Finally, the implementation of our algorithms is available at https://github.com/danielvial/clusteringPpr.

6.1. Synthetic data

6.1.1. Scalar estimation

We first compare FW-BW-MCMC with Bidirectional-PPR when computing $\pi_{s}(t)\ \forall\ (s,t)\in S\times T$ as $|S|$ and $|T|$ grow on Direct-ER. More specifically, for FW-BW-MCMC we use the $\|D^{-1}r^{s}\|_{\infty}\leq r_{\max}^{s}$ forward DP scheme as in FW-BW-MCMC-Practical, sample walks using the scheme from Section 5.1.1, and use Algorithm 4 for backward DP; for Bidirectional-PPR, we sample walks separately from each $s\in S$ and run backward DP separately for each $t\in T$ . Results are shown in Fig. 4. Note the number of random walks sampled and number of backward DP iterations grow more slowly with $|S|=|T|$ using FW-BW-MCMC, due to the accelerations proposed in Sections 5.1.1 and 5.1.2, respectively. As a result, runtime grows more slowly using FW-BW-MCMC. In Fig. 4, we also show the clustering quantities (12) and (22). We observe the source clustering quantity $\|\Sigma\|_{\infty,1}$ has a concave shape, which corresponds to the apparent sublinear growth of random walks as $|S|$ grows. Additionally, the target clustering quantity $c_{T}$ has a convex shape; since backward DP iteration savings scale with $c_{T}$ , we expect DP iterations to correspondingly “flatten”, which indeed occurs. These observations empirically validate the key insights of Section 5.1: namely, that the estimation schemes proposed have complexities that scale with the identified clustering quantities $\|\Sigma\|_{\infty,1}$ and $c_{T}$ . We also plot $\textrm{srank}(\Pi(S,T))$ on the runtime plot; note it appears to flatten along with runtime as $|S|,|T|$ grow. Finally, these plots remain qualitatively similar as $n$ grows, while the improvement of our scheme over the existing one increases; see Appendix J.

Next, to further examine the effect of clustering, we use Direct-SBM. We fix $|S|=|T|=100$ and sample $S$ and $T$ from decreasingly clustered sets via the following scheme: we first sample $S,T$ from a single community, we then sample $S,T$ from two communities, etc., until we sample $S,T$ from the entire graph, allowing us to observe a wide range of clustering. As in the previous experiment, we are interested in how algorithmic performance relates to $\|\Sigma\|_{\infty,1}$ and $c_{T}$ . Here, we also compare these quantities to a clustering measure commonly used in the graph theory literature (see e.g. the aforementioned (Andersen et al., 2006)), conductance, defined for $U\subset V$ as

[TABLE]

In Fig. 5, we observe fewer random walks are sampled when $\Phi(S)$ is small (when $S$ is significantly clustered); similarly, the backward DP converges in fewer iterations when $\Phi(T)$ is small (when $T$ is significantly clustered). Furthermore, Fig. 5 shows that $\|\Sigma\|_{\infty,1}$ grows with $\Phi(S)$ and $-c_{T}$ grows with $\Phi(T)$ . In short, our identified clustering quantities behave similar to conductance. In the runtime plot, we again show $\textrm{srank}(\Pi(S,T))$ as a measure of overall complexity; this quantity (roughly) grows with the average conductance $\frac{1}{2}(\Phi(S)+\Phi(T))$ , as does runtime.

6.1.2. Matrix approximation

We now document performance of our matrix approximation scheme (Algorithm 5) using Direct-SBM and the $S,T$ sampling strategy from the previous experiment. We compare three cases: $\sigma=\sigma_{\max}$ with $w\propto\|\Sigma\|_{\infty,1}$ , $\sigma=\sigma_{\textrm{avg}}$ with $w\propto\sqrt{l\ \textrm{srank}(\Pi(S,T))}$ , and $\sigma=\sigma_{\textrm{avg}}$ with $w\propto\sqrt{l\ \textrm{srank}(P_{T}(S,:)+P_{S}^{\mathsf{T}}R_{T})}$ . These cases are motivated by Theorem 5.4, which states that the sample requirements for $\sigma=\sigma_{\max}$ and $\sigma=\sigma_{\textrm{avg}}$ are $\|\Sigma\|_{\infty,1}$ and $\sqrt{l\ \textrm{srank}(\Pi(S,T))}$ , respectively (neglecting common factors); additionally, since $\textrm{srank}(\Pi(S,T))$ is unknown in practice, we proposed using $\textrm{srank}(P_{T}(S,:)+P_{S}^{\mathsf{T}}R_{T})$ as a surrogate in the discussion following the theorem. Results are shown in Fig. 6. Observe that for all three cases, fewer walks are sampled when $S$ and $T$ are clustered (i.e. when $\frac{1}{2}(\Phi(S)+\Phi(T))$ is small; nevertheless, error remains roughly constant (in fact, when clustering is present, error is somewhat lower despite fewer walks being sampled). Further, we observe $\sigma_{\max}$ and $\sigma_{\textrm{avg}}$ have similar performance, in terms of complexity and accuracy. Finally, we note the results for the $\textrm{srank}(\Pi(S,T))$ and $\textrm{srank}(P_{T}(S,:)+P_{S}^{\mathsf{T}}R_{T})$ cases are quite similar, suggesting that $\textrm{srank}(P_{T}(S,:)+P_{S}^{\mathsf{T}}R_{T})$ is an appropriate surrogate for $\textrm{srank}(\Pi(S,T))$ .

6.2. Real data

6.2.1. Scalar estimation

We next compare FW-BW-MCMC with Bidirectional-PPR as in Section 6.1.1, but here using real datasets. We fix $|S|=|T|=1000$ and randomly sample $S,T$ using two different schemes: sampling uniformly among all nodes and using an algorithm described in Appendix J to build clustered subsets of nodes; we find these schemes typically give conductance values $\approx 0.99$ and $\approx 0.5$ , respectively, allowing us to observe two degrees of clustering. In Fig. 7, we show random walk count, DP iteration count, and runtime for our method relative to the corresponding values using Bidirectional-PPR. Averaging across the diverse set of graphs considered, our method is approximately 1.4 times faster in the uniform case and 2.9 times faster in the clustered case, highlighting the efficiency of our algorithms and the impact of clustering on their performance. Additionally, we note our method is at least twice as fast for all datasets in the clustered case. For the same experiment, we also show random walk count (normalized to $w$ ) and the number of Merge updates (i.e. the number of DP iterations saved when compared to existing methods) in Fig. 8. From Theorem 5.1 and Proposition 5.3, we expect these quantities to scale linearly with the identified clustering quantities $\|\Sigma\|_{\infty,1}$ and $c_{T}$ , respectively; from Fig. 8, we observe this scaling roughly occurs. This verifies our analysis empirically on real datasets.

6.2.2. Matrix approximation

Finally, we test our matrix approximation scheme (Algorithm 5) on real graphs. Here we also compare to a baseline method that does not leverage clustering among targets and sources. In particular, we run backward DP separately for each target, rather than using the accelerated scheme as in Algorithm 5. Additionally, the baseline method uses no forward DP, i.e. we set $r^{s}_{\max}=1$ in Algorithm 5, so that $p^{s}=0,r^{s}=e_{s}\ \forall\ s\in S$ . Note that, in this case, both the $\sigma_{\max}$ and $\sigma_{\textrm{avg}}$ schemes reduce to sampling $\mu_{i}\sim S$ uniformly, sampling $\nu_{i}\sim\pi_{\nu_{i}}$ using a random walk, and estimating $\Pi(S,T)$ as $\hat{\Pi}(S,T)=P_{T}(S,:)+\frac{1}{w}\sum_{i=1}^{w}X_{i}$ , where $X_{i}=[e_{s_{1}}\ e_{s_{2}}\ \cdots\ e_{s_{l}}]^{\mathsf{T}}e_{\mu_{i}}e_{\nu_{i}}^{\mathsf{T}}R_{T}$ is an unbiased estimate of $\Pi(S,:)R_{T}$ . We reemphasize that walks are not shared among sources for this baseline scheme, i.e. clustering among sources is not leveraged to improve performance. For the baseline scheme, we set $w\propto l$ , and we compare performance to the $\sigma_{\max}$ scheme with $w\propto\|\Sigma\|_{\infty,1}$ and the $\sigma_{\textrm{avg}}$ scheme with $w\propto\sqrt{l\ \textrm{srank}(P_{T}(S,:)+P_{S}^{\mathsf{T}}R_{T})}$ . Results are shown in Fig. 9, with quantities shown for the $\sigma_{\max}$ and $\sigma_{\textrm{avg}}$ schemes relative to the baseline scheme. Averaging across datasets, the $\sigma_{\max}$ and $\sigma_{\textrm{avg}}$ schemes are over twice as fast as the baseline scheme when $S,T$ are chosen uniformly and 3.4 times faster when $S,T$ are clustered; additionally, the accuracy of both schemes is comparable to the baseline across datasets (and slightly better on average). We also note both our schemes are at least twice as fast as the baseline for all graphs in the clustered case.

7. Application: distributed random walk sampling

Thus far, our key finding has been that PPR estimation complexity scales with quantities that describe clustering among sources and/or targets. In this section, we demonstrate one application of these findings; namely, that these findings can be used to efficiently estimate $\{\pi_{s}\}_{s\in S}$ online when several machines are available and when offline storage is permitted. More specifically, we consider a natural distributed computational setting with the following features:

•

$k$ machines are available for parallel computation and a central machine is available to facilitate the parallel computation (for simplicity, we assume $k\in\{|S|,|S|/2,|S|/3,\ldots\}$ )

•

$\{p^{t},r^{t}\}_{t\in V}$ have been precomputed via Algorithm 2 and are stored offline

Using the existing method Bidirectional-PPR as a primitive, a baseline strategy for this estimation task is as follows: arbitrarily partition $S$ into $k$ subsets of size $|S|/k$ , use the $i$ -th machine to sample random walks from each source $s$ belonging to the $i$ -th subset, and estimate $\pi_{s}$ using the endpoints of walks from $s$ and $\{p^{t},r^{t}\}_{t\in V}$ (as in the primitive method Bidirectional-PPR).

Our goal is to devise a strategy more efficient than this baseline. In particular, we propose the following approach. First, we arbitrarily partition $S$ into $k$ subsets of size $|S|/k$ , and we use the $i$ -th machine to run forward DP (Algorithm 1) for each source $s$ belonging to the $i$ -th subset. Second, we use the central machine to construct another partition $\{S_{i}\}_{i=1}^{k}$ of $S$ , in a manner we discuss shortly. Third, we use the $i$ -th machine to run the accelerated source stage from Section 5.1.1 for the subset of sources $S_{i}$ . Finally, we estimate $\pi_{s}$ as in the primitive method FW-BW-MCMC.

It remains to specify how to construct the partition $\{S_{i}\}_{i=1}^{k}$ . For this, we turn to Theorem 5.1 and the results of Section 6, which indicate that the number of random walks sampled on the $i$ -th machine scales with $\|\Sigma_{S_{i}}\|_{\infty,1}$ , where $\Sigma_{S_{i}}$ is the matrix with rows $\{\sigma_{s}\}_{s\in S_{i}}$ . Hence, as the random walk stage in our approach runs in parallel across $i$ , the runtime of the this stage scales with

[TABLE]

Our goal is thus to construct the partition $\{S_{i}\}_{i=1}^{k}$ so as to minimize (31). However, as this is a combinatorial optimization problem, we devise a heuristic method to approximate the solution. To simplify the discussion of this method, we introduce some notation. For $S^{\prime}\subset S$ , let $\sigma_{S^{\prime}}$ be s.t. $\sigma_{S^{\prime}}(v)=\max_{s^{\prime}\in S^{\prime}}\sigma_{s^{\prime}}(v)\ \forall\ v\in V$ ; note $\|\Sigma_{S^{\prime}}\|_{\infty,1}=\|\sigma_{S^{\prime}}\|_{1}$ . For $S^{\prime}\subset S$ and $s\in S\setminus S^{\prime}$ , let

[TABLE]

It is straightforward to show (33), i.e. (32) gives the increase in $\|\sigma_{S^{\prime}}\|_{1}$ if $s$ is added to $S^{\prime}$ .

[TABLE]

With this notation in place, we may restate the objective function (31) as

[TABLE]

Our heuristic method to approximate the minimizer of (34) proceeds as follows. First, we assign one node to each $S_{i}$ , $i\in\{1,\ldots,k\}$ , using an initialization method similar to $k$ -means++ (Arthur and Vassilvitskii, 2007): we choose the $i$ -th of these nodes with probability proportional to its distance from the first $(i-1)$ of them, in hopes of choosing initial nodes with $\sigma_{s}$ vectors far apart. Next, we iteratively assign the remaining $|S|-k$ nodes to some $S_{j}$ . In particular, we assign $s$ to $S_{j}$ such that $d(s,S_{j})+\|\sigma_{S_{j}}\|_{1}$ is minimal; from (33), this can be viewed as minimizing the increase in the objective function (34) incurred by assigning $s$ to some $S_{j}$ . This heuristic method is formally defined in Algorithm 6.

We now empirically compare our approach with the baseline scheme. For this experiment, we set $S=\{\tilde{S}_{i}\}_{i=1}^{k}$ , where each $\tilde{S}_{i}$ is a clustered subset of nodes constructed as in Section 6 (with $k=10$ and $|\tilde{S}_{i}|=100\ \forall\ i$ ). This yields a set of sources $S$ that is not highly clustered itself, but that contains $k$ subsets that are densely connected internally and sparsely connected to other subsets. In addition to comparing to the baseline, we also test the performance of an “oracle” scheme, which knows the clustering information of the input set $S$ . More specifically, the oracle scheme proceeds in the same manner as our scheme, except instead of using Algorithm 6 to construct the partition $\{S_{i}\}_{i=1}^{k}$ , it simply sets $S_{i}=\tilde{S}_{i}\ \forall\ i\in\{1,2,\ldots,k\}$ . Put differently, while the heuristic scheme attempts to learn an assignment of sources to machines for which each machine is assigned a clustered set of sources (in the sense that (34) is minimal), the oracle scheme knows such an assignment a priori.

Results for this experiment are shown in Fig. 10, using the set of real graphs from Section 6. Averaging across graphs, the oracle and heuristic methods are roughly 1.8 and 2.2 faster than the baseline scheme, respectively (left). (Here total runtime is computed as maximum walk sampling time across machines for the baseline; sum of maximum forward DP time and maximum walk time for the oracle; and sum of maximum forward DP time, maximum walk time, and time to run Algorithm 6 for the heuristic.) Additionally, both methods sample approximately $\frac{1}{4}$ of the random walks sampled by the baseline scheme, across graphs (middle). Finally, the heuristic method typically produces a partition $\{S_{i}\}_{i=1}^{k}$ of $S$ with objective function value (34) similar to that produced by the oracle method (right). Interestingly, the heuristic outperforms the oracle for several datasets. This suggests that the cluster information known by the oracle does not necessarily produce an optimal assignment of sources to machines; rather, the source clustering quantity $\|\sigma_{S_{i}}\|_{1}$ identified in Section 5.1.1 may be what truly dictates performance.

Before closing, we offer several remarks on this distributed setting. First, while we focused on the scalar estimation scheme from Section 5.1.1, the framework extends to the $\sigma_{\max}$ matrix approximation scheme from Section 5.2. In particular, using the latter scheme in this setting would also involve construction of a partition so as to minimize (34), per Theorem 5.4. For this reason, we expect the performance of this scheme to be similar to Fig. 10. Second, we note that using the $\sigma_{\textrm{avg}}$ matrix approximation scheme in this setting requires a partition that minimizes a different objective function. In Appendix K, we present an algorithm to construct such a partition, as well as empirical results describing performance (in short, our scheme performs similarly to the oracle and noticeably outperforms the baseline, as in Fig. 10). Third, we find in practice that our heuristic partitioning schemes naturally balance the number of sources assigned to each machine (see Appendix K). Such balance is crucial in the performance of our scheme. This is because we require $\|\sigma_{S_{i}}\|_{1}=O(|S|/k)\ \forall\ i$ to perform as well as the baseline, which may in turn require an extreme degree of clustering if the partition is unbalanced (for example, if $|S_{i}|=O(|S|)$ for some $i$ ). It is worth noting that we also tried to partition $\{\sigma_{s}\}_{s\in S}$ using $k$ -means++ (an off-the-shelf vector partitioning algorithm), but this led to highly unbalanced assignments (and thus poor performance). Finally, we note than one limitation of our scheme is that, if $|S|,|T|=\Theta(n)$ , Algorithm 6 essentially partitions the entire graph and thus may be slower than directly estimating PPR. However, we recall from Section 3 that our focus is $|S|,|T|=o(\sqrt{m})=o(n)$ , so this is not a concern. Indeed, for the Fig. 10 experiment, Algorithm 6 accounted for only 12% of runtime (averaged across graphs).

8. Conclusions

In this work, we analyzed the relationship between PPR estimation complexity and clustering by devising estimation algorithms for many node pairs and showing the complexity of these methods scales with quantities interpretable as clustering measures. To demonstrate the utility of these findings, we considered a distributed setting for which the clustering quantities computed in situ could be leveraged to reduce computation time. We believe this setting and the algorithms we designed for it are just one example of how our findings can be used to accelerate PPR estimation; hence, an avenue for future work would be to further explore applications of our results.

Appendix A Analysis of FW-BW-MCMC and comparison to Bidirectional-PPR

Here we state and prove the guarantees that were stated informally at the end of Section 4. We include the corresponding results for Bidirectional-PPR for comparison. We first present the accuracy guarantee, Theorem A.1. The idea is to bound relative error when $\pi_{s}(t)\geq\delta$ and to bound absolute error when $\pi_{s}(t)<\delta$ . The authors of (Lofgren et al., 2016) suggest choosing $\delta=O(\frac{1}{n})$ . This choice dictates that we desire the relative bound when $t$ ’s PPR exceeds a uniform distribution over all nodes, which suggests that $t$ is “significant” to $s$ in this case. The proof applies the Chernoff bound to a variety of cases; this approach is similar (Lofgren et al., 2016), but we must address cases that do not arise in that work.

Theorem A.1.

Fix minimum PPR threshold $\delta$ , relative error tolerance $\epsilon$ , and failure probability $p_{\textrm{fail}}$ . For FW-BW-MCMC, assume the following hold:

[TABLE]

For Bidirectional-PPR, assume the following hold:

[TABLE]

Then the estimate $\hat{\pi}_{s}(t)$ produced by either algorithm satisfies the following with probability $\geq 1-p_{\textrm{fail}}$ :

[TABLE]

Proof.

See (Lofgren et al., 2016) for Bidirectional-PPR; see Appendix B for FW-BW-MCMC. ∎

From Theorem A.1, FW-BW-MCMC offers the same accuracy as Bidirectional-PPR. However, our assumptions on $\epsilon$ and $c$ are stronger than those required for Bidirectional-PPR. The first assumption is mild, since $\frac{1}{\sqrt{2e}}\approx 0.43$ and we typically desire a tighter relative error bound. The second affects complexity and will be discussed next. Note also that our guarantee holds for any $r^{t}_{\max}\in(0,1)$ , while proving the theorem for Bidirectional-PPR requires a lower bound on $r^{t}_{\max}$ .

Next, we have a worst-case complexity result in Theorem A.2 (by worst case, we mean the result holds for when the algorithm is run for any $s,t\in V$ ). The idea is to choose $r^{s}_{\max},r^{t}_{\max}$ to balance the complexity of the DP and MCMC stages of the algorithm. The result requires the additional assumption $m\delta<\log(1/p_{\textrm{fail}})/\epsilon^{2}$ , which guarantees that these $r^{s}_{\max},r^{t}_{\max}$ values lie in $(0,1)$ . Note that with $\delta=O(\frac{1}{n})$ , this implies $m=O(n)$ , i.e. nodes have constant degrees as $n$ grows.

Theorem A.2.

Fix minimum PPR threshold $\delta$ , relative error tolerance $\epsilon$ , and failure probability $p_{\textrm{fail}}$ . Assume (35)-(36) hold and $m\delta<\log(1/p_{\textrm{fail}})/\epsilon^{2}$ . Then setting $r^{s}_{\max}=r^{t}_{\max}=\frac{m^{1/3}\delta^{1/3}\epsilon^{7/9}}{(\log(1/p_{\textrm{fail}}))^{1/3}}$ in FW-BW-MCMC yields minimal complexity $O\left(\frac{m^{2/3}(\log(1/p_{\textrm{fail}}))^{1/3}}{\alpha\epsilon^{7/9}\delta^{1/3}}\right)$ . Furthermore, setting $r^{t}_{\max}=\frac{\sqrt{m\delta}\epsilon}{\sqrt{\log(1/p_{\textrm{fail}})}}$ in Bidirectional-PPR yields minimal complexity $O\left(\frac{\sqrt{m\log(1/p_{\textrm{fail}})}}{\alpha\epsilon\sqrt{\delta}}\right)$ .

Proof.

See Appendix C. ∎

Note that, with $\delta=O(\frac{1}{n})$ , so that $m=O(n)$ , both algorithms have complexity linear in $n$ , while FW-BW-MCMC has strictly better dependence on the parameters $p_{\textrm{fail}}$ and $\epsilon$ .

Finally, we present an average-case complexity result for FW-BW-MCMC-Practical (Algorithm 8), which changes the termination criteria to $\|D^{-1}r^{s}\|_{\infty}\leq r_{\max}^{s}$ in the forward DP.

Theorem A.3.

For any $s\in V$ and for $t\sim V$ uniformly, FW-BW-MCMC-Practical produces an estimate satisfying the accuracy guarantee of Theorem A.1 and has complexity $O\left(\frac{\sqrt{m\log(1/p_{\textrm{fail}})}}{\sqrt{n\delta}\alpha\epsilon^{7/6}}\right)$ .

Proof.

See Appendix D. ∎

With $\delta=O(\frac{1}{n})$ , this establishes the $O(\sqrt{m})$ average case complexity claimed at the end of Section 4. The guarantee for Bidirectional-PPR in (Lofgren et al., 2016) has $\epsilon$ instead of $\epsilon^{7/6}$ but is otherwise identical.

Appendix B Proof of Theorem A.1

We will use the following result from (Dubhashi and Panconesi, 2009).

Theorem B.1.

*(from Theorem 1.1 in (Dubhashi and Panconesi, 2009)) Let $\{Z_{i}\}$ be a set of independent random variables with $Z_{i}\in[0,1]\ \forall\ i$ , and let $Z=\sum_{i}Z_{i}$ . Then for any $\eta\in(0,1)$ and any $d>2e\mathbb{E}[Z]$ , *

[TABLE]

To begin the proof, we define $Y_{i}=X_{i}/r_{\max}^{t}$ and $Y=\sum_{i=1}^{w}Y_{i}$ , where $X_{i}$ is from Algorithm 3. Observe the $Y_{i}$ ’s are independent and $Y_{i}\in[0,1]$ (by the terminating condition of Algorithm 2), so Theorem B.1 applies for appropriate choices of $\eta$ and $d$ . We also observe that (40) holds, which follows by linearity and $w=\frac{cr^{s}_{\max}r^{t}_{\max}}{\delta}$ in the statement of the theorem.

[TABLE]

We now turn to the case $\pi_{s}(t)\geq\delta$ , for which we aim to show $\mathbb{P}[|\hat{\pi}_{s}(t)-\pi_{s}(t)|>\epsilon\pi_{s}(t)]<p_{\textrm{fail}}\ \forall\ \epsilon\in(0,\frac{1}{\sqrt{2e}})$ . We will examine three sub-cases. The first two sub-cases depend on the constant $k:=(\frac{\epsilon}{2e})^{1/3}$ (we motivate the choice of this constant at the conclusion of the proof). We also observe the following, which follows from the assumption $c>\frac{3(2e)^{1/3}\log(2/p_{\textrm{fail}})}{\epsilon^{7/3}}$ :

[TABLE]

For the first sub-case, assume $\mathbb{E}[Y]\geq kc$ . Then we have the following:

[TABLE]

Here the first inequality holds by definition of $\hat{\pi}_{s}(t)$ in Algorithm 3 and the invariant (7); the equality holds by (40) and the definition of $Y$ ; the second inequality uses Theorem B.1 (note $\epsilon<\frac{1}{\sqrt{2e}}<1$ , so (38) applies); and the final two inequalities hold by $\mathbb{E}[Y]\geq kc$ and (41), respectively.

For the second sub-case, assume $\mathbb{E}[Y]\in[\frac{\epsilon c}{2e},kc)$ . First, observe that by (40), the assumption $\mathbb{E}[Y]<kc$ , and the Algorithm 1 terminating condition,

[TABLE]

and so $\pi_{s}(t)\geq\|r^{s}\|_{1}\mathbb{E}[X_{i}]+(1-k)\delta$ (else, $\pi_{s}(t)<\delta$ by (7), a contradiction). We then write:

[TABLE]

Here the first inequality and first equality follow similar arguments as Case 1; the second inequality is by the Algorithm 1 terminating condition and $w=\frac{cr^{s}_{\max}r^{t}_{\max}}{\delta}$ ; the second equality simply multiplies and divides $k$ ; the third inequality holds by assumption $\mathbb{E}[Y]\in[\frac{\epsilon c}{2e},kc)$ ; the fourth inequality holds by Theorem B.1 (note $\frac{\epsilon}{k}=\epsilon^{2/3}(2e)^{1/3}<1$ by assumption $\epsilon<\frac{1}{\sqrt{2e}}$ , so (38) applies); the fifth inequality follows from $\mathbb{E}[Y]\in[\frac{\epsilon c}{2e},kc)$ ; and the final inequality holds by (41). Note we have assumed $1-k>0$ in the third and fifth inequality; this follows from $\epsilon<\frac{1}{\sqrt{2e}}$ .

For the third and final sub-case, assume $\mathbb{E}[Y]<\frac{\epsilon c}{2e}$ . We have the following:

[TABLE]

Here the first three equalities and first inequality follow similar arguments as previous cases; the penultimate inequality holds since $\{|Y-\mathbb{E}[Y]|>\epsilon c\}\subset\{Y>\epsilon c\}$ when $Y\geq\mathbb{E}[Y]$ , whereas $\{|Y-\mathbb{E}[Y]|>\epsilon c\}\subset\{\mathbb{E}[Y]>\epsilon c\}\subset\{2e\mathbb{E}[Y]>\epsilon c\}=\emptyset$ when $Y<\mathbb{E}[Y]$ ; and the final inequality holds by Theorem B.1; note $\epsilon c>2e\mathbb{E}[Y]$ by assumption, so (39) applies. Next, we observe

[TABLE]

where the first two inequalities hold by $c>\frac{3(2e)^{1/3}\log(1/p_{\textrm{fail}})}{\epsilon^{7/3}}$ and $\epsilon<\frac{1}{\sqrt{2e}}$ , and the final inequality holds since $e<4\Rightarrow\log_{2}(e)<2\Rightarrow\frac{6e}{\log_{2}(e)}>\frac{3e}{2}>1$ . Combining (50) and (51) completes Case 3.

Finally, we note the bounds in Cases 1 and 3 grow with decreasing and increasing $k$ , respectively. Hence, our choice of $k=(\frac{\epsilon}{2e})^{1/3}$ comes from equating the two to minimize failure probability.

We now turn to the case $\pi_{s}(t)<\delta$ . Observe that by $\pi_{s}(t)<\delta$ and the invariant (7), $\|r^{s}\|_{1}\mathbb{E}[X_{i}]<\delta$ . By (40), this implies $2e\mathbb{E}[Y]<\frac{2ew\delta}{r^{t}_{\max}\|r^{s}\|_{1}}=:b$ . Then

[TABLE]

Here the equalities follow similar steps as previous cases, the first inequality holds by the same argument in the Case 3 analysis, and the final inequality holds by Theorem B.1 (note (39) applies since $b>2e\mathbb{E}[Y]$ ). We also observe

[TABLE]

where the first inequality is by the Algorithm 1 terminating condition, the second inequality holds since $2e>1>\epsilon$ , and the third inequality follows from (51); the equalities are by definition. Finally, we combine (53) and (54) to complete the proof.

Appendix C Proof of Theorem A.2

The complexity of Algorithm 3 is the total complexity of Algorithm 2, Algorithm 1, and the random walks. Below, we show Algorithms 2 and 1 have complexity $\frac{m}{\alpha r^{t}_{\max}}$ and $\frac{m}{\alpha r^{s}_{\max}}$ , respectively (using arguments from (Andersen et al., 2008) and (Andersen et al., 2006)). Furthermore, the complexity of the random walk stage is $O(\frac{r^{s}_{\max}r^{t}_{\max}\log(1/p_{\textrm{fail}})}{\alpha\delta\epsilon^{7/3}})$ , where $\frac{1}{\alpha}$ is the expected complexity of sampling a single random walk, and where the remaining factors give the number of walks required (recall in the statement of the theorem we assume (35) holds). Hence, the complexity of Algorithm 3 is $O(C(r^{s}_{\max}r^{t}_{\max})/\alpha)$ , where

[TABLE]

We now aim choose $r^{s}_{\max},r^{t}_{\max}$ to minimize $O(C(r^{s}_{\max}r^{t}_{\max})/\alpha)$ , or equivalently, to minimize $C(r^{s}_{\max}r^{t}_{\max})$ . For this, we let $K=\frac{\log(1/p_{\textrm{fail}})}{\delta\epsilon^{7/3}}>0$ and note $\tfrac{\partial C}{\partial r^{s}_{\max}}=Kr^{t}_{\max}-\tfrac{m}{(r^{s}_{\max})^{2}}=0$ if and only if $(r^{s}_{\max})^{2}r^{t}_{\max}=\tfrac{m}{K}$ , and similarly, $\frac{\partial C}{\partial r^{t}_{\max}}=0$ if and only if $(r^{t}_{\max})^{2}r^{s}_{\max}=\frac{m}{K}$ ; hence, $(\frac{m}{K})^{1/3},(\frac{m}{K})^{1/3}$ is a stationary point of $C(r^{s}_{\max},r^{t}_{\max})$ . To verify this is a minimizer, we observe

[TABLE]

from which it follows that the Hessian of $C$ evaluated at $r^{s}_{\max}=r^{t}_{\max}=(\frac{m}{K})^{1/3}$ is $K(I+11^{\mathsf{T}})$ . This is positive definite, since for any vector $z\neq 0$ ,

[TABLE]

To summarize, we have shown $r^{s}_{\max}=r^{t}_{\max}=(\frac{m}{K})^{1/3}$ minimizes $C(r^{s}_{\max},r^{t}_{\max})$ and hence minimizes the complexity of Algorithm 3. This establishes that the choice of $r^{s}_{\max},r^{t}_{\max}$ in the statement of the theorem minimizes complexity. Finally, substituting $r^{s}_{\max}=r^{t}_{\max}=(\frac{m}{K})^{1/3}$ into (55) and dividing by $\alpha$ gives the complexity expression given in the theorem. Following the same approach establishes the Algorithm Bidirectional-PPR complexity bound given in the theorem.

We return to bound the complexities of Algorithms 2 and 1. For Algorithm 2, we use an argument from (Andersen et al., 2008). First, let $v\in V$ . From Algorithm 2, $p^{t}(v)$ increases by at least $\alpha r^{t}_{\max}$ at each iteration for which $v^{*}=v$ . By the invariant (6), $p^{t}(v)\leq\pi_{v}(t)$ . Taken together, $v^{*}=v$ for at most $\frac{\pi_{v}(t)}{\alpha r^{t}_{\max}}$ iterations. Furthermore, the complexity of each iteration for which $v^{*}=v$ is $d_{\textrm{in}}(v)$ . Hence, the complexity of all iterations for which $v^{*}=v$ is bounded by $d_{\textrm{in}}(v)\frac{\pi_{v}(t)}{\alpha r^{t}_{\max}}$ . Finally, the complexity of Algorithm 2 can be bounded by summing over all $v\in V$ , i.e. $\sum_{v\in V}d_{\textrm{in}}(v)\frac{\pi_{v}(t)}{\alpha r^{t}_{\max}}\leq\frac{1}{\alpha r^{t}_{\max}}\sum_{v\in V}d_{\textrm{in}}(v)=\frac{m}{\alpha r^{t}_{\max}}$ .

We next turn to Algorithm 1. As mentioned in the main text, Algorithm 1 changes the termination criteria from the algorithm originally defined in (Andersen et al., 2006); for clarify, we include the original definition in Algorithm 7. Here we use tilde marks to distinguish quantities from those in Algorithm 1, and we explicitly indicate iteration number $k$ to improve clarity of the arguments to follow. Besides these notational changes, the only difference between Algorithms 1 and 7 is the termination criteria.

With this notation in place, the complexity of Algorithm 7 can be bounded as follows (using arguments from (Andersen et al., 2006)). First, observe that for any iteration $k$ ,

[TABLE]

where the first equality holds via the iterative update in Algorithm 7. Next, let $k^{*}$ be the iteration at which Algorithm 7 terminates. Then the complexity of the algorithm is $\sum_{k=1}^{k^{*}}d_{\textrm{out}}(v_{k})$ , and

[TABLE]

where the first inequality holds since $\tilde{r}_{\max}^{s}<\|D^{-1}\tilde{r}^{s}_{k}\|_{\infty}=\frac{\tilde{r}^{s}_{k-1}(v_{k})}{d_{\textrm{out}}(v_{k})}$ for $k\leq k^{*}$ (i.e. for each $k$ until the algorithm terminates), the second equality holds by the previous display, and the final inequality holds since $\|\tilde{r}^{s}_{0}\|_{1}=\|e_{s}\|_{1}=1$ and $\|\tilde{r}^{s}_{k^{*}}\|_{1}\geq 0$ (the remaining steps are straightforward).

Using this, we can bound the complexity of Algorithm 1. First, observe that in Algorithm 1,

[TABLE]

and so to guarantee termination of Algorithm 1 (i.e. to ensure $\|r^{s}\|_{1}\leq r^{s}_{\max}$ ), it suffices to guarantee $\|D^{-1}r^{s}\|_{\infty}\leq\frac{r^{s}_{\max}}{m}$ . But from the analysis of Algorithm 7, the complexity required to ensure $\|D^{-1}r^{s}\|_{\infty}\leq\frac{r^{s}_{\max}}{m}$ is $\frac{m}{\alpha r^{s}_{\max}}$ ; hence, the complexity of Algorithm 1 is bounded by $\frac{m}{\alpha r^{s}_{\max}}$ as well.

Appendix D Practical version of FW-BW-MCMC

In this appendix, we define and analyze a modified version of FW-BW-MCMC that is more useful in practice. Before proceeding to the formal definition and analysis, we first motivate the practical algorithm. First, suppose for an instance of FW-BW-MCMC we have already run the backward DP (Algorithm 2) and we are currently running the forward DP (Algorithm 1). Though FW-BW-MCMC dictates we run the forward DP until $\|r^{s}\|_{1}<r^{s}_{\max}$ for some predefined $r^{s}_{\max}$ , we could instead terminate the forward DP (even if $\|r^{s}\|_{1}>r^{s}_{\max}$ ) and proceed to the random walks. In other words, we dynamically change $r^{s}_{\max}$ from the predefined value to the current value of $\|r^{s}\|_{1}$ . Then, if the number of walks sampled is $w=c\|r^{s}\|_{1}r^{t}_{\max}/\delta$ , where

[TABLE]

the proof of Theorem A.1 goes through, i.e. the accuracy guarantee is achieved. Furthermore, this argument holds at any iteration of the forward DP. In other words, we can terminate the forward DP at any iteration and achieve the accuracy guarantee, as long as we scale $w$ with the $\|r^{s}\|_{1}$ value obtained at termination. From this observation, we aim to terminate the forward DP at the “optimal” iteration, i.e. the iteration for which the overall complexity of the algorithm is minimized.

Towards determining this optimal iteration, let $C_{FDP}$ denote the complexity of the forward DP until the current iteration, and define $C_{MCMC}=\frac{3(2e)^{1/3}r^{t}_{\max}\log(2/p_{\textrm{fail}})}{\alpha\delta\epsilon^{7/3}}$ , so that $\|r^{s}\|_{1}C_{MCMC}$ gives the complexity of the MCMC stage (since $c\|r^{s}\|_{1}r^{t}_{\max}/\delta$ walks are sampled, each in expected time $\frac{1}{\alpha}$ , with $c$ satisfying (62)). Then, if we terminate the forward DP at the current iteration, the combined complexity of forward DP and MCMC stages will be $C_{FDP}+\|r^{s}\|_{1}C_{MCMC}$ . Suppose instead that we decide to run one more iteration, i.e. to terminate the forward DP at the next iteration. Then, by Algorithm 1, the next iteration will have complexity $d_{\textrm{out}}(v^{*})$ . Furthermore, by (58) in Appendix C, $\|r^{s}\|_{1}$ will decrease by $\alpha r^{s}(v^{*})$ at the next iteration. Hence, if we run one more iteration, the combined complexity of forward DP and MCMC will be $\left(C_{FDP}+d_{\textrm{out}}(v^{*})\right)+\left(\|r^{s}\|_{1}-\alpha r^{s}(v^{*})\right)C_{MCMC}$ . Now clearly, we should terminate the forward DP if and only if the resulting complexity is less than the complexity resulting from running another iteration. Hence, from the previous argument, we should terminate if and only if

[TABLE]

In other words, to optimize the tradeoff between forward DP and MCMC complexity, we should run the forward DP until $\|D^{-1}r^{s}\|_{\infty}$ falls below the threshold in (63). This motivates the practical version of FW-BW-MCMC, given in Algorithm 8. Algorithm 8 changes two aspects of FW-BW-MCMC. First, it replaces Algorithm 1 with Algorithm 7 (which uses $\|D^{-1}\tilde{r}^{s}\|_{\infty}$ termination, as suggested by (63)). Second, it scales the the number of random walks sampled with $\|\tilde{r}^{s}\|_{1}$ , as discussed above.

We can now establish accuracy and average-case complexity guarantees for Algorithm 8.

Theorem D.1.

Fix minimum PPR threshold $\delta$ , relative error tolerance $\epsilon$ , failure probability $p_{\textrm{fail}}$ . Let

[TABLE]

Then the estimate $\hat{\pi}_{s}(t)$ produced by Algorithm 8 satisfies (37)with probability $\geq 1-p_{\textrm{fail}}$ .

Proof.

As discussed above, the proof of Theorem A.1 goes through to establish this result. ∎

Theorem D.2.

Fix minimum PPR threshold $\delta$ , relative error tolerance $\epsilon$ , and failure probability $p_{\textrm{fail}}$ . Assume (64) holds. Then for any $s\in V$ and for $t\sim V$ uniformly, setting $\tilde{r}^{s}_{\max}=\frac{\delta\epsilon^{7/3}}{r^{t}_{\max}\log(1/p_{\textrm{fail}})}$ , $r^{t}_{\max}=\frac{\sqrt{m\delta}\epsilon^{7/6}}{\sqrt{n\log(1/p_{\textrm{fail}})}}$ in Algorithm 8 yields complexity $O\left(\frac{\sqrt{m\log(1/p_{\textrm{fail}})}}{\sqrt{n\delta}\alpha\epsilon^{7/6}}\right)$ .

Proof.

First, for the complexity of the backward DP (Algorithm 2), we use the result from (Lofgren and Goel, 2013), which we include for completeness. Recall from Appendix C that the complexity of Algorithm 2 for any $t\in V$ is bounded by $\textstyle{\sum}_{v\in V}d_{\textrm{in}}(v)\tfrac{\pi_{v}(t)}{\alpha r^{t}_{\max}}$ . Hence, for $t\sim V$ uniformly, the expected complexity is

[TABLE]

since $\sum_{t\in V}\pi_{v}(t)=1$ by definition. Next, we consider the complexity of the forward DP (Algorithm 7). From Appendix C, for any $s\in V$ we have complexity $\frac{1}{\alpha\tilde{r}^{s}_{\max}}=\frac{r^{t}_{\max}\log(1/p_{\textrm{fail}})}{\alpha\delta\epsilon^{7/3}}$ . Finally, for the MCMC stage, we sample $w\|\tilde{r}^{s}\|_{1}\leq w$ walks, where $w=cr^{t}_{\max}/\delta$ with $c$ satisfying (64). Each walk is sampled in average time $\frac{1}{\alpha}$ . Therefore, the MCMC stage complexity is $O(\tfrac{r^{t}_{\max}\log(1/p_{\textrm{fail}})}{\alpha\delta\epsilon^{7/3}})$ . Thus, the overall complexity of Algorithm 8 is bounded by

[TABLE]

Substituting $r^{t}_{\max}$ given in the statement of the theorem yields the desired complexity bound. Further, viewing (66) as a function of $r^{t}_{\max}$ , it is straightforward to verify this $r^{t}_{\max}$ is the global minimizer. ∎

Appendix E Proof of Theorem 5.1

We first observe

[TABLE]

where the second inequality uses the fact that $X^{(w)}_{s}(v)\sim\textrm{Binomial}(w,\sigma_{s}(v))$ (hence, $X^{(w)}_{s}(v)=0$ when $\sigma_{s}(v)=0$ ). Again using this fact, we have by (38) from Theorem B.1 in Appendix B,

[TABLE]

Combining (68) and (69), we obtain

[TABLE]

where the final inequality holds by the bound on $w$ in the statement of the theorem. For the lower tail, following the same steps used to obtain (72) gives

[TABLE]

Finally, by the union bound, (72) and (73) together establish the theorem.

Appendix F Proof of Proposition 5.3

First, assume Merge is used at each iteration for which $v^{*}=t_{2}$ . By Algorithm 2, $\|p^{t_{2}}\|_{1}$ increases by at least $\alpha r^{t}_{\max}$ at each iteration for which $v^{*}\neq t_{1}$ . By (18), $\|p^{t_{2}}\|_{1}$ increases by at least $\|p^{t_{1}}\|_{1}r^{t}_{\max}$ at each iteration for which $v^{*}=t_{1}$ . Let us define $I_{1}$ as the number of iterations for which $v^{*}\neq t_{1}$ , $I_{2}$ as the number of iterations for which $v^{*}=t_{1}$ , and $I=I_{1}+I_{2}$ as the total number of iterations. Since $\|p^{t_{2}}\|_{1}=0$ at the start of Algorithm 2 and $\|p^{t_{2}}\|_{1}\leq n\pi(t_{2})$ by the invariant (6), we have

[TABLE]

Now at termination of Algorithm 2, $\|r^{t_{2}}\|_{\infty}\leq r^{t}_{\max}$ , so by the invariant (6), $\pi_{t_{1}}(t_{2})\leq p^{t_{2}}(t_{1})+r^{t}_{\max}$ at termination. Therefore, if $\pi_{t_{1}}(t_{2})>r^{t}_{\max}$ , $p^{t_{2}}(t_{1})>0$ at termination, which can only occur if $v^{*}=t_{1}$ at some iteration. Hence, $\pi_{t_{1}}(t_{2})>r^{t}_{\max}\Rightarrow I_{2}\geq 1$ . Finally, from Algorithm 2, $\|p^{t_{1}}\|_{1}\geq\alpha$ . Substituting into (74) gives $I\leq\tfrac{n\pi(t_{2})}{\alpha r^{t}_{\max}}-\tfrac{(\|p^{t_{1}}\|_{1}-\alpha)}{\alpha}$ .

If instead Merge is not used, $\|p^{t_{2}}\|_{1}$ increases by at least $\alpha r^{t}_{\max}$ at every iteration. Hence, the same argument as above establishes that the total number of iterations is bounded by $\frac{n\pi(t_{2})}{\alpha r^{t}_{\max}}$ .

Appendix G Proof of Theorem 5.4

We will use Corollary 6.2.1 from (Tropp, 2015), which (applied to our setting) states the following. Assume $\{X_{i}\}_{i=1}^{w}$ are independent random matrices satisfying $\mathbb{E}[X_{i}]=R_{S}^{\mathsf{T}}\Pi R_{T}$ . Let $M$ be s.t. $\|X_{i}\|_{2}\leq M$ a.s., and let $m_{2}(X_{i})=\max\{\|\mathbb{E}[X_{i}X_{i}^{\mathsf{T}}]\|_{2},\|\mathbb{E}[X_{i}^{\mathsf{T}}X_{i}]\|_{2}\}$ . Then $\forall\ \eta>0$ ,

[TABLE]

We have verified the independence and $\mathbb{E}[X_{i}]=R_{S}^{\mathsf{T}}\Pi R_{T}$ assumptions in the main text. Furthermore, from (25) and Algorithm 5, $\Pi(S,T)-\hat{\Pi}(S,T)=R_{S}^{\mathsf{T}}\Pi R_{t}-\frac{1}{w}\sum_{i=1}^{w}X_{i}$ . We may therefore write

[TABLE]

where we have also used the inequalities $\max\{\|\Pi(S,T)\|_{2},1\}\geq 1$ , $\max\{\|\Pi(S,T)\|_{2},1\}\geq\|\Pi(S,T)\|_{2}$ .

Now to prove the theorem, we aim to find $M$ s.t. $\|X_{i}\|_{2}\leq M$ a.s. and to compute $m_{2}(X_{i})$ such that (77) is bounded by $p_{\textrm{fail}}$ , in each of the following cases:

[TABLE]

We begin with Case 1. By Lemma G.1, we may take $M=l^{3/2}r^{s}_{\max}r^{t}_{\max}$ , and by Lemma G.2, we have $m_{2}(X_{i})\leq l^{2}r^{s}_{\max}r^{t}_{\max}\|\Pi(S,T)\|_{F}$ . We can then write

[TABLE]

where the equality is definition of srank, the penultimate inequality holds since $l,\textrm{srank}(\Pi(S,T))\geq 1$ , and the final inequality is by (78). Substituting (81) into (77) establishes the desired result.

For Case 2, we take $M=l^{3/2}\|\Sigma\|_{\infty,1}r^{s}_{\max}r^{t}_{\max}$ (Lemma G.1), and by Lemma G.2 we have

[TABLE]

We then obtain

[TABLE]

where the second inequality is a standard norm equivalence inequality (for $A\in\mathbb{R}^{l\times l}$ , $\|A\|_{\infty},\|A\|_{1}\leq\sqrt{l}\|A\|_{2}$ ), and the third inequality is by (79). Substituting (83) into (77) completes the proof.

Lemma G.1.

If $\sigma=\sigma_{\textrm{avg}}$ , $\|X_{i}\|_{2}\leq l^{3/2}r^{s}_{\max}r^{t}_{\max}$ a.s.; if $\sigma=\sigma_{\max}$ , $\|X_{i}\|_{2}\leq l^{3/2}\|\Sigma\|_{\infty,1}r^{s}_{\max}r^{t}_{\max}$ a.s.

Proof.

Observe $X_{i}=a_{i}b_{i}^{\mathsf{T}}$ , where $a_{i},b_{i}\in\mathbb{R}^{l}$ with $a_{i}(j)=r^{s_{j}}(\mu_{i})/\sigma(\mu_{i}),b_{i}(j)=r^{t_{j}}(\nu_{i})$ . $X_{i}$ has rank 1, and we may write its singular value decomposition as

[TABLE]

so the nonzero singular value of $X_{i}$ is $\|a_{i}\|_{2}\|b_{i}\|_{2}$ . Using the well-known fact that a matrix’s 2-norm equals its largest singular value, $\|X_{i}\|_{2}=\|a_{i}\|_{2}\|b_{i}\|_{2}$ , so we seek bounds on $\|a_{i}\|_{2}$ and $\|b_{i}\|_{2}$ .

First, we assume $\sigma=\sigma_{\textrm{avg}}$ . Then we can write the following:

[TABLE]

Here the first equality holds by definition (27), the first inequality uses the terminating condition of Algorithm 1 ( $\|r^{s}\|_{1}\leq r^{s}_{\max}$ ), the second inequality is by nonnegativity, and the second equality is by definition of $a_{i}$ . We conclude $\|a_{i}\|_{2}\leq lr^{s}_{\max}$ . To bound $\|b_{i}\|_{2}$ , we have

[TABLE]

where we have used a well-known vector norm inequality and the terminating condition of Algorithm 2 ( $\|r^{t}\|_{\infty}\leq r^{t}_{\max}$ ). Hence, $\|X_{i}\|_{2}\leq l^{3/2}r^{s}_{\max}r^{t}_{\max}$ follows.

Next, we assume $\sigma=\sigma_{\max}$ . We have

[TABLE]

which is justified similarly to (85). Combining with (86) gives $\|X_{i}\|_{2}\leq l^{3/2}\|\Sigma\|_{\infty,1}r^{s}_{\max}r^{t}_{\max}$ . ∎

Lemma G.2.

If $\sigma=\sigma_{\textrm{avg}}$ , then $m_{2}(X_{i})\leq l^{2}r^{s}_{\max}r^{t}_{\max}\|\Pi(S,T)\|_{F}$ ; if instead $\sigma=\sigma_{\max}$ , then $m_{2}(X_{i})\leq l\|\Sigma\|_{\infty,1}r^{s}_{\max}r^{t}_{\max}\max\{\|\Pi(S,T)\|_{\infty},\|\Pi(S,T)\|_{1}\}$ .

Proof.

We first assume $\sigma=\sigma_{\textrm{avg}}$ . Using Jensen’s inequality, and since $X_{i}=a_{i}b_{i}^{\mathsf{T}}$ , we have $\|\mathbb{E}[X_{i}X_{i}^{\mathsf{T}}]\|_{2}\leq\mathbb{E}[\|X_{i}X_{i}^{\mathsf{T}}\|_{2}]=\mathbb{E}[\|a_{i}\|_{2}^{2}\|b_{i}\|_{2}^{2}]$ ; similarly, $\|\mathbb{E}[X_{i}X_{i}^{\mathsf{T}}]\|_{2}\leq\mathbb{E}[\|a_{i}\|_{2}^{2}\|b_{i}\|_{2}^{2}]$ . Thus,

[TABLE]

where the second inequality uses the terminating condition of Algorithm 2 ( $r^{t}(v)\leq r^{t}_{\max}$ ) and the nonnegativity of $r^{s}(u)$ , the third follows from (85), and the fourth uses the invariant (7). Finally, letting $vec(\Pi(S,T))$ denote the $l^{2}$ -length vector with entries $\{\pi_{s}(t)\}_{s\in S,t\in T}$ , we have

[TABLE]

where the first equality is by nonnegativity, the inequality is a standard norm inequality, and the second inequality is by definition of Frobenius norm. Substituting into (90) establishes the result.

We next assume $\sigma=\sigma_{\max}$ and bound $\|\mathbb{E}[X_{i}X_{i}^{\mathsf{T}}]\|_{2}$ . We observe that by definition,

[TABLE]

Letting $1_{l}$ denote the all ones vector of length $l$ , we also have

[TABLE]

Now since $\mathbb{E}[X_{i}X_{i}^{\mathsf{T}}]$ is symmetric, its 2-norm is its largest eigenvalue; since it is nonnegative, the Perron-Frobenius Theorem states this eigenvalue is bounded by its maximum row sum. Therefore,

[TABLE]

where (94) uses the row sums derived in (93), (95) uses (87) from the proof of Lemma G.1 and the terminating condition of Algorithm 2 ( $\|r^{t}\|_{\infty}\leq r^{t}_{\max}$ ), and (96) uses the invariant (7). We can use the same idea to bound $\|\mathbb{E}[X_{i}^{\mathsf{T}}X_{i}]\|_{2}$ . The steps to obtain the expression analogous to (94) follow the same approach so we omit them. We then have

[TABLE]

where (97) is immediate, (98) uses (87) from the proof of Lemma G.1 and the terminating condition of Algorithm 2 ( $\|r^{t}\|_{\infty}\leq r^{t}_{\max}$ ), and (99) uses the invariant (7). We conclude from (96) and (99) that

[TABLE]

Appendix H Proof of Theorem 5.2

The theorem relies on two key lemmas. The first of these (Lemma H.1) shows that the out-degrees in our stochastic block model concentrate, in the sense that these degrees are all close to $p\sqrt{n}$ with high probability. Lemma H.1 also bounds the maximal number of outgoing edges pointing to other communities, i.e. $\max_{v\in V_{n}}d_{\textrm{out}}^{-}(v)$ . The proof, deferred to Appendix H.1, is a modified version of a similar (standard) result for similar random graph families (such as the Erdős-Rényi model).

Lemma H.1.

Let $\{G_{n}=(V_{n},E_{n})\}_{n\in\mathbb{N}:\sqrt{n}\in\mathbb{N}}$ be the sequence of stochastic block models defined in Section 5.1.1, with $p_{n}=p$ for some constant $p\in(0,1)$ . For $\epsilon,C>0$ , define the following events:

[TABLE]

Then the following hold:

•

If $q_{n}=o(1/\sqrt{n})$ , then for any constant $\epsilon>0$ , $\lim_{n\rightarrow\infty}\mathbb{P}(\mathcal{E}_{n,\epsilon})=1$ .

•

If $q_{n}=\Omega(\log n/n)$ , then for some constant $C>0$ , $\lim_{n\rightarrow\infty}\mathbb{P}(\mathcal{F}_{n,C})=1$ .

•

If $q_{n}=\Theta(1/n)$ , then for some constant $C>0$ , $\lim_{n\rightarrow\infty}\mathbb{P}(\mathcal{G}_{n,C})=1$ .

Proof.

See Appendix H.1. ∎

The second key lemma (Lemma H.2) contains useful bounds regarding the vector $\sigma^{s}_{k}=r^{s}_{k}/\|r^{s}_{k}\|_{1}$ , where $r^{s}_{k}$ is the $r^{s}$ vector in the $k$ -th iteration of Algorithm 1. (Here and moving forward, we explicitly denote the current iteration of Algorithm 1 via subscripts, as in Algorithm 7 from Appendix C). In fact, these bounds hold more generally than will be required for the theorem; namely, we formulate the lemma for any deterministic graph on $n$ nodes for which the out-degree condition $\mathcal{E}_{n,\epsilon}$ holds. The proof is somewhat tedious so is deferred to Appendix H.2.

Lemma H.2.

Let $G_{n}=(V_{n}=\{1,\ldots,n\},E_{n})$ be a deterministic graph satisfying

[TABLE]

for some $p,\epsilon\in(0,1)$ , and let $k\in\{1,\ldots,\lceil(1-\epsilon)^{2}p\sqrt{n}(1-\alpha)/(2e)\rceil\}$ . Then for any $s\in V_{n}$ ,

[TABLE]

and for any $S_{n}\subset V_{n}$ s.t. $s\in S_{n}$ ,

[TABLE]

Proof.

See Appendix H.2. ∎

We now turn to the proof of the theorem. First, suppose all sources belong to the same community, and consider the sub-case $q_{n}=o(1/\sqrt{n}),q_{n}=\Omega(\log n/n)$ . Then for any $\epsilon\in(0,1)$ , Lemma H.2 implies that any realization of $G_{n}$ satisfying $\mathcal{E}_{n,\epsilon}$ also satisfies

[TABLE]

Recall $\alpha,\epsilon,p$ are constants and $|S_{n}|=\sqrt{n},k=o(\sqrt{n})$ in the statement of the theorem. Hence, for some $C^{\prime\prime}>0$ and all $n$ large, any realization of $G_{n}$ satisfying $\mathcal{E}_{n,\epsilon}$ also satisfies

[TABLE]

Now let $C^{\prime}>0,C=C^{\prime}C^{\prime\prime}$ . Then for $n$ large, any realization satisfying $\mathcal{E}_{n,\epsilon}$ and $\mathcal{F}_{n,C^{\prime}}$ also satisfies

[TABLE]

In other words, we have shown that for some $C>0$ and any $C^{\prime}>0$ ,

[TABLE]

Finally, for $C^{\prime}$ satisfying the second statement of Lemma H.1, we obtain

[TABLE]

In the sub-case $q_{n}=\Theta(1/n)$ , a similar argument implies that for some $C,C^{\prime}>0$ ,

[TABLE]

We next consider the case for which all sources belong to different communities, i.e. $S_{n}=\{\sqrt{n},2\sqrt{n},\ldots,n\}$ (which is without loss of generality by symmetry). Then clearly

[TABLE]

Furthermore, for any $\epsilon\in(0,1)$ , Lemma H.2 implies that any realization satisfying $\mathcal{E}_{n,\epsilon}$ satisfies

[TABLE]

Now suppose $q_{n}=o(1/\sqrt{n}),q_{n}=\Omega(\log n/n)$ , and let $\delta\in(0,1)$ be a constant. Then for $C>0$ and $n$ sufficiently large, any realization satisfying $\mathcal{E}_{n,\epsilon}$ and $\mathcal{F}_{n,C}$ will also satisfy

[TABLE]

where we again used the fact that $\alpha,\epsilon,p$ are constant and $k=o(\sqrt{n})$ . The same argument holds for all summands in the summation over $i$ in (115). It follows that, for appropriate choice of $C>0$ ,

[TABLE]

A similar approach establishes the desired result in the case $q_{n}=\Theta(1/n)$ .

Note that (perhaps surprisingly) the only feature of the stochastic block model used above was the degree concentration of Lemma H.1. In other words, we considered the number of edges for each node, while ignoring how exactly these edges were connected. Consequently, the same analysis can be used to obtain results for sequences of deterministic graphs $\{G_{n}=(V_{n},E_{n})\}_{n\in\mathbb{N}:\sqrt{n}\in\mathbb{N}}$ . For example, if such a sequence satisfies $\mathcal{E}_{n,\epsilon},\mathcal{G}_{n,C}$ for some constants $\epsilon,C$ and for all $n$ large, the analysis above implies $\|\Sigma_{S_{n}}\|_{\infty,1}=O(\log n/\log\log n)$ when $\sqrt{n}$ sources belong to the same community, whereas $\|\Sigma_{S_{n}}\|_{\infty,1}=\Omega(\sqrt{n})$ when $\sqrt{n}$ sources belong to different communities.

H.1. Proof of Lemma H.1

For the first statement, we begin by showing $d_{\textrm{out}}(1)$ concentrates around $p\sqrt{n}$ ; we will then use the union bound to establish the lemma. Towards this end, first note that since edges from node $1$ to each $v\in\{2,\ldots,\sqrt{n}\}$ are present with probability $p$ , and since edges from node $1$ to each $v\in\{\sqrt{n}+1,\ldots,n\}$ are present with probability $q_{n}$ , we have

[TABLE]

Next, since $q_{n}=o(1/\sqrt{n})$ and $p$ is constant by assumption, we have for $n$ sufficiently large,

[TABLE]

Thus, combining the previous two lines, we obtain (for such $n$ ),

[TABLE]

We can then use monotonicity and (38) from Appendix B to obtain

[TABLE]

where we also used $\mathbb{E}[d_{\textrm{out}}(1)]\geq p\sqrt{n}$ by (119). Using the same argument for the lower tail, and then using the union bound, we thus obtain

[TABLE]

Finally, by this bound, the fact that $\{d_{\textrm{out}}(v)\}_{v\in V}$ are identically-distributed, and the union bound,

[TABLE]

which, by the law of complements, completes the proof of the first statement.

For the second statement, we similarly begin with a tail bound for $d_{\textrm{out}}^{-}(1)$ . First note that, since $q_{n}=\Omega(\log n/n)$ , we can find $C^{\prime}>0$ such that for all $n$ sufficiently large,

[TABLE]

Now let $C>\max\{2e,2/(C^{\prime}\log 2)\}$ . Then clearly

[TABLE]

Hence, we can use (39) from Appendix B to obtain

[TABLE]

By the union bound argument used above, we then have

[TABLE]

Also, by our choice of $C$ and for $n$ sufficiently large (so that $q_{n}n>C^{\prime}\log n$ ),

[TABLE]

Combining the previous two inequalities then yields, for $n$ sufficiently large,

[TABLE]

from which the second statement clearly follows.

For the third statement, we again derive a tail bound for $d_{\textrm{out}}^{-}(1)$ and invoke the union bound, but the tail bound requires a slightly different approach. First, for any $M\in\{1,\ldots,\lfloor n-\sqrt{n}\rfloor\}$ , the event $\{d_{\textrm{out}}^{-}(1)\geq M\}$ means that node $1$ has outgoing edges to $M$ nodes in other communities, so

[TABLE]

where the first inequality is the union bound, the second equality holds by definition of our stochastic block model, and the second inequality is immediate. Now by assumption $q_{n}=\Theta(1/n)$ , we can find $C_{1}$ such that $q_{n}n\leq C_{1}$ for $n$ sufficiently large; combined with the standard binomial coefficient approximation ${n\choose M}\leq(\frac{ne}{M})^{M}$ , we can further bound the above as

[TABLE]

for all $n$ large (we also defined $C_{2}=C_{1}e$ ). Thus, by the union bound and the fact that $\{d_{\textrm{out}}^{-}(v)\}_{v\in V}$ are identically-distributed, we obtain for all $n$ large and any constant $C>0$ ,

[TABLE]

Next, we note

[TABLE]

Choosing any $C\geq 1$ clearly implies

[TABLE]

Also, since $C_{2}>0$ is a constant, we have for all $n$ large (for example)

[TABLE]

Combining the previous four lines, we then obtain, for all $n$ large,

[TABLE]

so that choosing any $C>2$ establishes the third statement.

H.2. Proof of Lemma H.2

We begin with another lemma, which in fact holds for any underlying graph $G$ .

Lemma H.3.

For any graph $G=(V,E)$ , any source node $s\in V$ , and any iteration $k\in\{1,\ldots,d_{\textrm{out}}(s)\}$ of Algorithm 1,

[TABLE]

Proof.

For the lower bound, first note $r_{1}^{s}(v)=(1-\alpha)/d_{\textrm{out}}(s)\ \forall\ v\in N_{\textrm{out}}(s)$ . Furthermore, for each such $v$ , $r_{k}^{s}(v)$ is non-decreasing in $k$ for $k<k_{v}$ , where $k_{v}$ is the first iteration $k$ for which $v^{*}_{k}=v$ . Also, since $v_{1}^{*}=s$ , we must have $k_{v}\geq d_{\textrm{out}}(s)+1$ for some $v\in N_{\textrm{out}}(s)$ . Hence, for any $k\in\{1,\ldots,d_{\textrm{out}}(s)\}$ , we can find some $v\in N_{\textrm{out}}(s)$ for which $k_{v}>k$ , which implies $r_{k}^{s}(v)\geq r_{1}^{s}(v)=(1-\alpha)/d_{\textrm{out}}(s)$ . Since also $d_{\textrm{out}}(s)\leq\max_{v\in V}d_{\textrm{out}}(v)$ , the lower bound follows.

For the upper bound, we use induction. For the base of induction, simply note

[TABLE]

Now assuming the upper bound holds for $k-1$ , we have for any $v\in V$ ,

[TABLE]

where the first inequality uses the iterative update rule in Algorithm 1, the second is immediate, the third uses the inductive hypothesis, and the fourth uses the standard inequality $1+x\leq e^{x}$ . ∎

We next state and prove a corollary of Lemma H.3, which translates the $r^{s}_{k}$ bounds from Lemma H.3 to bounds regarding $\sigma^{s}_{k}$ (the actual vector of interest in the theorem).

Corollary H.4.

Let $G_{n}=(V_{n}=\{1,\ldots,n\},E_{n})$ be a graph satisfying

[TABLE]

for some $p,\epsilon\in(0,1)$ . Then for any $k\in\{1,\ldots,\lfloor(1-\epsilon)p\sqrt{n}\rfloor\}$ and any $s\in V_{n}$ ,

[TABLE]

Proof.

Fix $k\in\{1,\ldots,\lfloor(1-\epsilon)p\sqrt{n}\rfloor\}$ and $s\in V_{n}$ . Then $k<d_{\textrm{out}}(s)$ by (145) and the choice of $k$ . We can then use the assumption (145), Lemma H.3, and the choice of $k$ to obtain

[TABLE]

Finally, $\epsilon,p,\alpha\in(0,1)$ yields the first pair of inequalities. Next, by definition of $v_{k+1}^{*}$ in Algorithm 1,

[TABLE]

On the other hand, by the assumption (145), and since $\epsilon\in(0,1)$ ,

[TABLE]

Combining (151) and (152), and using the first pair of inequalities, yields the second pair of inequalities. For the third pair of inequalities, we first assume $k>1$ and use (58) from Appendix C to obtain

[TABLE]

where we also used $r^{s}_{0}=e_{s},v^{*}_{1}=s$ by Algorithm 1. We can then use the second pair of inequalities to obtain the third pair of inequalities. If instead $k=1$ , we immediately have $\|r^{s}_{k}\|_{1}=1-\alpha$ , which is precisely the third pair of inequalities in the case $k=1$ . ∎

We can now prove Lemma H.2. For the first bound, note the assumptions of Lemma H.2 are stronger than those of Corollary H.4, so we can use Corollary H.4 to obtain

[TABLE]

Using the trivial inequalities $1-\epsilon<1,k-1<k$ in the final expression then yields the first upper bound. (Note the assumed upper bound on $k$ ensures the denominator is non-negative.)

For the second bound, let $S_{n}\subset V_{n}$ be a set containing $s$ . We begin by showing

[TABLE]

To prove (155), we use induction. For $k=1$ , the $r^{s}$ update in Algorithm 1 implies

[TABLE]

which, using the assumption (103) and $\alpha,\epsilon,p\in(0,1)$ , can clearly be bounded as

[TABLE]

which establishes (155) when $k=1$ . Now assume (155) holds for $k-1$ and consider two cases:

(1)

$v_{k}^{*}\notin S_{n}$ : We can write the $r^{s}$ update in Algorithm 1 as

[TABLE]

where $\mathbbm{1}_{A}$ is the indicator function of the event $A$ . This clearly implies

[TABLE]

from which the inductive hypothesis completes the proof, since the upper bound in (155) increases with $k$ . 2. (2)

$v_{k}^{*}\in S_{n}$ : Again using (158), we observe

[TABLE]

(Note the final term in (158) does not appear in (160), since $\sum_{v\in V_{n}\setminus S_{n}}\mathbbm{1}_{\{v=v_{k}^{*}\}}=0$ when $v_{k}^{*}\in S_{n}$ .) For the second summand in (160), we use the second upper bound from Corollary H.4, the assumption (103), $\alpha\in(0,1)$ , and the assumed case $v_{k}^{*}\in S_{n}$ to obtain

[TABLE]

Substituting into (160) and using the inductive hypothesis yields

[TABLE]

which completes the proof.

Combining (155) with the lower bound for $\|r^{s}_{k}\|_{1}$ from Corollary H.4 gives

[TABLE]

from which the trivial bound $k-1<k$ completes the proof.

Appendix I Choosing order of targets in Algorithm 4

As mentioned at the end of Section 5.1.2, the performance of Algorithm 4 can significantly depend on the order in which the targets $t_{1},t_{2},\ldots,t_{|T|}$ are chosen. For instance, suppose there exists $t^{*}\in T$ such that $\pi_{t^{*}}(t^{\prime})>r^{t}_{\max}\ \forall\ t^{\prime}\in T$ , but $\pi_{t}(t^{\prime})\leq r^{t}_{\max}\ \forall\ t\in T\setminus\{t^{*}\},t^{\prime}\in T$ . Then choosing $t_{1}=t^{*}$ implies $c_{T}=|T|-1$ , while choosing $t_{|T|}=t^{*}$ implies $c_{T}=0$ . More generally, the algorithm is most efficient when any $t$ satisfying $\pi_{t}(t^{\prime})>r^{t}_{\max}$ for many $t^{\prime}\in T$ is chosen “early” in the algorithm, i.e. $t_{i}=t$ for small $i$ . However, because $\pi_{t}(t^{\prime})$ is unknown, optimizing the order $t_{1},t_{2},\ldots,t_{|T|}$ at runtime is difficult. A possible workaround is to use $p^{t^{\prime}}(t)$ as a proxy for $\pi_{t}(t^{\prime})$ , since $p^{t^{\prime}}(t)\in[\pi_{t}(t^{\prime})-r^{t}_{\max},\pi_{t}(t^{\prime})]$ by the invariant (6). Unfortunately, even this proxy is difficult to utilize at runtime. This is because we would like to choose $t_{i}$ such that $\pi_{t_{j}}(t_{i})$ is large for many $j<i$ , but the proxy $p^{t_{i}}(t_{j})$ of $\pi_{t_{j}}(t_{i})$ is only known after choosing $t_{i}$ . (Loosely speaking, we have a “chicken and egg” scenario.) Hence, we do not suspect there is a provably optimal method, or even a simple heuristic but suboptimal method, for choosing the order of targets at runtime.

Appendix J Details on Section 6 experiments

Datasets: Direct-ER is a directed Erdős-Rényi graph with parameters $n=2000,p=0.005$ (edge $v\rightarrow u$ is present with probability $p$ , independent of other edges, $\forall\ v,u\in V,v\neq u$ ). Direct-SBM is a directed stochastic block model; there are $n=2000$ nodes partitioned into $k=20$ disjoint communities, each of size $\frac{n}{k}=100$ ; directed edges occur with probability $9/(\frac{n}{k}-1)$ between distinct nodes in the same community and with probability $1/(n-\frac{n}{k})$ between nodes in different communities (so that each node has nine neighbors in its own community and one neighbor in another community, in expectation, yielding a highly modular graph). The real graphs used are available from the Stanford Network Analysis Platform (SNAP) (Leskovec and Krevl, [n. d.]); see Table 2 for further details.

Parameters: For the scalar estimation experiments in Sections 6.1.1 and 6.2.1, we use the algorithmic parameters shown in Table 3. More specifically, FW-BW-MCMC uses Algorithm 7 for forward DP with parameter $\tilde{r}^{s}_{\max}$ and samples $w\|\tilde{r}^{s}\|_{1}$ random walk starting node locations for each source $s$ (as in Algorithm 8), uses the walk sharing scheme from Section 5.1.1 to sample walks jointly across $S$ , and uses Algorithm 4 with parameter $r^{t}_{\max}$ for the targets; for Bidirectional-PPR, we sample $w$ walks separately for each source and run Algorithm 2 separately for each target. In practice, we find that $w$ given by the accuracy guarantee (Theorem A.1) is overly pessimistic, so we instead set $w=\frac{cr^{t}_{\max}}{\delta}$ for both methods, with $c$ given in the table. For the matrix experiments in Sections 6.1.2 and 6.2.2, we use the same $\tilde{r}^{s}_{\max}$ and $r^{t}_{\max}$ values. Furthermore, we set $w=l\tfrac{cr^{t}_{\max}}{\delta}$ , $w=\|\Sigma\|_{\infty,1}\tfrac{cr^{t}_{\max}}{\delta}$ , and $w=\sqrt{l\ \textrm{srank}(P_{T}(S,:)+P_{S}^{\mathsf{T}}R_{T})}\tfrac{cr^{t}_{\max}}{\delta}$ for the baseline, $\sigma_{\max}$ , and $\sigma_{\textrm{avg}}$ schemes, respectively.

Single pair performance: The parameters in Table 3 were chosen so the primitives FW-BW-MCMC- Practical and Bidirectional-PPR offer similar accuracy in the single pair case and balance runtime between dynamic programming (DP) and Monte Carlo (MC). To demonstrate this, we show statistics in Table 3. We obtained the statistics by averaging across $10^{3}$ trials of the following procedure. First, we sample $t\in V$ uniformly. Next, we sample a “significant” source $s$ (i.e. $s$ satisfying $\pi_{s}(t)>\delta$ ) and an “insignificant” source $s^{\prime}$ (i.e. $s^{\prime}$ satisfying $\pi_{s^{\prime}}(t)<\delta$ ). Since Theorem A.1 bounds relative and absolute error for significant and insignificant pairs, respectively, we compute relative and absolute error for the $\pi_{s}(t)$ and $\pi_{s^{\prime}}(t)$ estimates, respectively. (We do not report absolute error statistics as no insignificant estimate violated the absolute error guarantee.) For real datasets, we cannot compute $\pi_{s}(t)$ to test error performance; instead, we run Algorithm 2 with $r^{t}_{\max}$ replaced by $\eta=\frac{1}{n}$ , denote the output $p^{t}_{\eta},r^{t}_{\eta}$ , and bound relative error for significant pairs as

[TABLE]

where we have used $p_{\eta}^{t}(s)\in[\pi_{s}(t)-\|r^{t}_{\eta}\|_{\infty},\pi_{s}(t)]$ (which holds by (6)), $\|r^{t}_{\eta}\|_{\infty}<\eta=\frac{1}{n}$ (which holds by Algorithm 2), and $p_{\eta}^{t}(s)\geq\delta=\frac{10}{n}$ (which holds by choice of $s,t$ ). In the same manner, we can bound absolute error for insignificant pairs as $|\hat{\pi}_{s}(t)-\pi(t)|\leq|\hat{\pi}_{s}(t)-p_{\eta}^{t}(s)|+\frac{1}{n}$ . (Note we choose significant pairs as those $(s,t)$ satisfying $p^{t}_{\eta}(s)\geq\delta$ , since then $\pi_{s}(t)\geq\delta$ by (6); similarly, we choose insignificant pairs as those $(s^{\prime},t)$ satisfying $p^{t}_{\eta}(s^{\prime})<\delta-\eta$ , since then $\pi_{s^{\prime}}(t)<\delta$ by (6).)

Additional Erdős-Rényi results: We also ran the first experiment from Section 6.1.1 for Erdős-Rényi graphs with $n\in\{4000,8000\}$ , each with edge formation probability $10/n$ . For FW-BW-MCMC, we used parameters $(\tilde{r}_{\max}^{s},r^{t}_{\max})=(1.5,3.5)\times 10^{-3}$ when $n=4000$ and $(\tilde{r}_{\max}^{s},r^{t}_{\max})=(1.2,3.2)\times 10^{-3}$ when $n=8000$ (choosing smaller parameters for larger $n$ gave more balanced runtime than using the $n=2000$ parameters from Table 3). Similarly, for Bidirectional-PPR, we used $r^{t}_{\max}=1.1\times 10^{-3}$ when $n=2000$ and $r^{t}_{\max}=0.8\times 10^{-3}$ when $n=8000$ . As in Table 3, we ensured these parameters gave similar accuracy for both algorithms. Results are shown in Fig. 11. As mentioned in Section 6.1.1, the plots are qualitatively similar across $n$ ; however, they improve slightly as $n$ grows. For instance, in the extreme case $|S|=|T|=n/2$ , FW-BW-MCMC-Prac was (on average) 2.9, 4.5, and 5.8 times faster than Bidirectional-PPR for $n=2000$ , $n=4000$ , and $n=8000$ , respectively.

Building clustered subsets: As mentioned in Section 6.2, we use a simple algorithm to randomly construct clustered subsets of nodes for experiments; Algorithm 9 provides a formal definition.

Appendix K Additional experiments for distributed setting

K.1. Matrix approximation, $\sigma_{\textrm{avg}}$ approach

In this section, we describe a scheme to use the $\sigma_{\textrm{avg}}$ variant of Algorithm 5 in the distributed setting from Section 7. Our scheme is quite similar to that defined in Section 7 and proceeds as follows. First, we arbitrarily partition $S$ into $k$ subsets of size $|S|/k$ , and we use the $i$ -th machine to run forward DP (Algorithm 1) for each source $s$ belonging to the $i$ -th subset. Next, we create another partition $\{S_{i}\}_{i=1}^{k}$ of $S$ and use the $i$ -th machine to sample random walks for $S_{i}$ using the $\sigma_{\textrm{avg}}$ variant of Algorithm 5. Finally, we construct the estimate $\hat{\Pi}(S,T)$ of $\Pi(S,T)$ as in Algorithm 5.

It remains to specify the construction of $\{S_{i}\}_{i=1}^{k}$ . For this, we first use the output $p^{s}$ of Algorithm 1 to define $\textrm{surr}_{s}=P_{T}(s,:)+(p^{s})^{\mathsf{T}}R_{T}$ for each $s\in S$ ; here $P_{T}$ and $R_{T}$ are the matrices with columns $\{p^{t}\}_{t\in T}$ and $\{r^{t}\}_{t\in T}$ , respectively (with each $(p^{t},r^{t})$ computed offline via Algorithm 2 as in Section 7). Note that $\textrm{surr}_{s}$ is a row of the surrogate matrix $P_{T}(S,:)+P_{S}^{\mathsf{T}}R_{T}$ discussed at the conclusion of Section 5.2. For $S^{\prime}\subset S$ , we also define $\textrm{surr}_{S^{\prime}}$ be the matrix with rows $\{\textrm{surr}_{s}\}_{s\in S^{\prime}}$ . Now, as in Section 6.2.2, the number of walks sampled on the $i$ -th machine will be set proportional to $\sqrt{|S_{i}|\textrm{srank}(\textrm{surr}_{S_{i}}))}$ ; hence, our goal is to construct $\{S_{i}\}_{i=1}^{k}$ so as to minimize

[TABLE]

To approximate the solution of this minimization problem, we consider a heuristic method defined in Algorithm 10. Note this is similar to Algorithm 6 in Section 7: first, we assign one source to each $S_{i}$ , while attempting to choose these $s$ with $\textrm{surr}_{s}$ vectors far apart; next, we iteratively assign the remaining $|S|-k$ nodes to some $S_{i}$ , while attempting to minimize the cost of this assignment. In light of (166), we here define the cost of assigning $s$ to $S_{i}$ as $\tilde{d}(s,S_{i})=\sqrt{(|S_{i}|+1)\textrm{srank}(\textrm{surr}_{S_{i}\cup\{s\}})}$ .

Unfortunately, Algorithm 10 requires the singular value decomposition (SVD) of $\textrm{surr}_{S_{j}\cup\{s\}}$ to be computed, so that $\tilde{d}(s,S_{j})$ can computed in the second for loop of Algorithm 10. (In contrast, computing $d(s,S_{j})$ in the $\sigma_{\max}$ partitioning scheme, Algorithm 6, only requires subtracting one vector from another.) Hence, we also propose an alternative partitioning method that avoids this SVD computation. This method is based on two observations. First, we have

[TABLE]

where the first equality is a well-known result, the inequality follows from the Perron-Frobenius Theorem, and the remaining equalities are straightforward. Second, by definition of $\|\cdot\|_{F}$ , we have

[TABLE]

Combining these observations, we obtain

[TABLE]

This expression allows us to estimate $\tilde{d}(s,S_{j})$ more efficiently than it can be computed exactly. In Algorithm 11, we give a partitioning scheme that leverages this insight. Note that the computation of $\hat{d}(s,S_{j})$ in Algorithm 11 can be performed as

[TABLE]

i.e. the terms $\sum_{s^{\prime}\in S_{j}}\|\textrm{surr}_{s^{\prime}}\|_{2}^{2}$ and $\sum_{s^{\prime}\in S_{j}}\textrm{surr}_{s^{\prime}}(t)\|\textrm{surr}_{s^{\prime}}\|_{1}$ in (171) have already been computed as $x_{j}$ and $y_{j}(t)$ when $\hat{d}(s,S_{j})$ is computed; furthermore, $x_{j}$ and $y_{j}(t)$ are updated (rather than being computed in full) each time some $s$ is added to $S_{j}$ (last line of Algorithm 11).

In Fig. 12, we present empirical results for the $\sigma_{\textrm{avg}}$ matrix approximation scheme in the distributed setting. In particular, we show results for the scheme described above with the partition $\{S_{i}\}_{i=1}^{k}$ constructed via Algorithm 10 (“Heuristic” in Fig. 12) and via Algorithm 11 (“Alt Heuristic” in Fig. 12). For both schemes, we show the maximum forward DP and random walk sampling time across machines, the maximum number of walks sampled across machines, and the value of the objective function (166). The first two quantities are shown relative to the respective quantities for a baseline scheme, which arbitrarily partitions $S$ into subsets of size $|S|/k$ and uses the $i$ -th machine to run the baseline matrix approximation scheme from Section 6.2.2 for the $i$ -th subset (recall no forward DP is used for this baseline scheme, i.e. walks are not shared across sources). For this experiment, we let $S=\{\tilde{S}_{i}\}_{i=1}^{k}$ , where $k=10$ and each $\tilde{S}_{i}$ is a clustered subset satisfying $|\tilde{S}_{i}|=100$ ; we also compare to an oracle scheme that sets $S_{i}=\tilde{S}_{i}$ (as in Section 7). In general, Fig. 12 conveys the same message as Fig. 10 in Section 7: our methods perform similarly to the oracle method and noticeably outperform the baseline. Here we also note that the heuristic outperforms the oracle across graphs, while the oracle in turn outperforms the alternative heuristic. Nevertheless, the alternative heuristic offers similar performance as the other schemes, while avoiding the SVD computation of the heuristic (which we expect would become prohibitively costly as $S$ grows).

K.2. Other results for source partitioning schemes

As discussed at the conclusion of Section 7, it is crucial that our source partitioning schemes (Algorithms 6, 10, and 11) balance the number of sources assigned to each machine. To see why, note that the baseline schemes have objective function value $|S|/k$ ; hence, if some machine $i$ is assigned $O(|S|)$ sources using our schemes, we may only outperform the baseline when clustering is extreme. Luckily, we find that the partitions are typically quite balanced in practice, despite the lack of explicit balance constraints in Algorithms 6, 10, and 11. To demonstrate this, we show the maximum and minimum number of sources assigned to machines for the three partitioning schemes in Fig. 13. Averaged across graphs, Algorithms 6, 10 and 11 typically produce partitions with $|S_{i}|\in[85,122]$ , $|S_{i}|\in[55,188]$ , and $|S_{i}|\in[75,134]$ , respectively (the red line shows $|S|/k=100$ , i.e. a perfectly balanced partition). We also note that, while Algorithm 10 typically produces the least balanced partition, its overall performance is similar to that for Algorithm 11 (see Fig. 12), which we have argued is more useful in practice for large $S$ .

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Andersen et al . (2008) Reid Andersen, Christian Borgs, Jennifer Chayes, John Hopcroft, Vahab Mirrokni, and Shang-Hua Teng. 2008. Local computation of Page Rank contributions. Internet Mathematics 5, 1-2 (2008), 23–45.
3Andersen et al . (2006) Reid Andersen, Fan Chung, and Kevin Lang. 2006. Local graph partitioning using Page Rank vectors. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06) . IEEE, 475–486.
4Arthur and Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms . Society for Industrial and Applied Mathematics, 1027–1035.
5Athreya and Stenflo (2003) Krishna B Athreya and Örjan Stenflo. 2003. Perfect sampling for Doeblin chains. Sankhyā: The Indian Journal of Statistics (2003), 763–777.
6Avrachenkov et al . (2007) Konstantin Avrachenkov, Nelly Litvak, Danil Nemirovsky, and Natalia Osipova. 2007. Monte Carlo methods in Page Rank computation: When one iteration is sufficient. SIAM J. Numer. Anal. 45, 2 (2007), 890–904.
7Baluja et al . (2008) Shumeet Baluja, Rohan Seth, D Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. 2008. Video suggestion and discovery for You Tube: Taking random walks through the view graph. In Proceedings of the 17th international conference on World Wide Web . ACM, 895–904.
8Borgs et al . (2014) Christian Borgs, Michael Brautbar, Jennifer Chayes, and Shang-Hua Teng. 2014. Multiscale matrix sampling and sublinear-time pagerank computation. Internet Mathematics 10, 1-2 (2014), 20–48.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

On the role of clustering in Personalized PageRank estimation

Abstract.

1. Introduction

2. Preliminaries

3. Related Work

4. Single node pair estimation

5. Many node pair estimation

5.1. Scalar estimation viewpoint

5.1.1. Source stage acceleration

Theorem 5.1.

Proof.

Theorem 5.2.

Proof.

5.1.2. Target stage acceleration

Proposition 5.3.

Proof.

5.2. Matrix approximation viewpoint

Theorem 5.4.

Proof.

5.3. Precomputation variants

6. Experiments

6.1. Synthetic data

6.1.1. Scalar estimation

6.1.2. Matrix approximation

6.2. Real data

6.2.1. Scalar estimation

6.2.2. Matrix approximation

7. Application: distributed random walk sampling

8. Conclusions

Appendix A Analysis of FW-BW-MCMC and comparison to Bidirectional-PPR

Theorem A.1.

Proof.

Theorem A.2.

Proof.

Theorem A.3.

Proof.

Appendix B Proof of Theorem A.1

Theorem B.1.

Appendix C Proof of Theorem A.2

Appendix D Practical version of FW-BW-MCMC

Theorem D.1.

Proof.

Theorem D.2.

Proof.

Appendix E Proof of Theorem 5.1

Appendix F Proof of Proposition 5.3

Appendix G Proof of Theorem 5.4

Lemma G.1.

Proof.

Lemma G.2.

Proof.

Appendix H Proof of Theorem 5.2

Lemma H.1.

Proof.

Lemma H.2.

Proof.

H.1. Proof of Lemma H.1

H.2. Proof of Lemma H.2

Lemma H.3.

Proof.

Corollary H.4.

Proof.

Appendix I Choosing order of targets in Algorithm 4

Appendix J Details on Section 6 experiments

Appendix K Additional experiments for distributed setting

K.1. Matrix approximation, σavg\sigma_{\textrm{avg}}σavg​ approach

K.2. Other results for source partitioning schemes

K.1. Matrix approximation, $\sigma_{\textrm{avg}}$ approach