Randomized Gossiping with Effective Resistance Weights: Performance   Guarantees and Applications

Bugra Can; Saeed Soori; Necdet Serhat Aybat; Maryam Mehri Dehnavi,; Mert Gurbuzbalaban

arXiv:1907.13110·math.OC·October 19, 2021·IEEE Trans. Control. Netw. Syst.

Randomized Gossiping with Effective Resistance Weights: Performance Guarantees and Applications

Bugra Can, Saeed Soori, Necdet Serhat Aybat, Maryam Mehri Dehnavi,, Mert Gurbuzbalaban

PDF

Open Access

TL;DR

This paper introduces a randomized gossiping method using effective resistance weights that enhances distributed averaging and optimization performance by leveraging network structure insights.

Contribution

It proposes a novel ER-based weighting scheme for gossip algorithms, improving convergence times and efficiency in distributed consensus and optimization tasks.

Findings

01

ER weights reduce averaging time compared to uniform weights

02

Numerical experiments confirm improved communication efficiency

03

ER gossiping enhances performance of distributed optimization algorithms

Abstract

The effective resistance between a pair of nodes in a weighted undirected graph is defined as the potential difference induced when a unit current is injected at one node and extracted from the other, treating edge weights as the conductance values of edges. The effective resistance is a key quantity of interest in many applications, e.g., solving linear systems, Markov Chains, and continuous-time averaging networks. We consider effective resistances (ER) in the context of designing randomized gossiping methods for the consensus problem, where the aim is to compute the average of node values in a distributed manner through iteratively computing weighted averages among randomly chosen neighbors. We show that employing ER weights improves the averaging time corresponding to the traditional choice of uniform weights -the amount of improvement depends on the network structure. We illustrate…

Tables6

Table 1. Table 1: FMMC vs ER on the barbell graph.

Graph

Method

Comm.

per node

(stage-

i

)

Comm.

per node

( stage-

i ​ i

)

K_{5} - K_{5}

ER

2.9

\times 10^{3}

81

FMMC

1.28

\times 10^{5}

65

K_{10} - K_{10}

ER

8.4

\times 10^{4}

198

FMMC

3.93

\times 10^{5}

130

K_{20} - K_{20}

ER

2.6

\times 10^{6}

433

FMMC

6.4

\times 10^{6}

251

K_{25} - K_{25}

ER

7.9

\times 10^{6}

566

FMMC

> 10^{7}

287

Table 2. Table 2: FMMC vs ER on the small-world graph

Graph

Method

Comm.

per node

(stage-

i

)

Comm.

per node

( stage-

i ​ i

)

n = 5

ER

6.4

41

FMMC

41075.2

84

n = 10

ER

16.8

130

FMMC

> 10^{6}

143

n = 20

ER

19.20

315

FMMC

> 10^{6}

370

n = 25

ER

20.00

403

FMMC

> 10^{6}

512

Table 3. Table 3: The comparison of mean and the standard deviation of the wake-ups required for ER, Metropolis, and uniform asynchronous gossiping algorithms based on 250 runs.

Graph Topology	ER-Kac	ER-Ex	Uniform	Metropolis
Small-world, $n = 100$	530 $(\pm 28.92)$	527 $(\pm 29.29)$	801 $(\pm 219.22)$	623 $(\pm 43.10)$
$𝐊_{𝟐𝟎} - 𝐊_{𝟐𝟎}$	1577 $(\pm 563.62)$	1525 $(\pm 505.93)$	8661 $(\pm 3460.71)$	8341 $(\pm 3328.85)$
SBM(100,2,0.5,0.01)	749 $(\pm 306.76)$	721 $(\pm 268.02)$	1213 $(\pm 363.60)$	914 $(\pm 422.66)$
SBM(100,2,0.9,0.01)	835 $(\pm 436.50)$	814 $(\pm 439.41)$	1459 $(\pm 643.22)$	1248 $(\pm 960.33)$
SBM(120,3,0.9,0.01)	1022 $(\pm 383.75)$	1003 $(\pm 349.30)$	2079 $(\pm 604.95)$	1541 $(\pm 762.64)$
SBM(120,3,0.9,0.05)	646 $(\pm 59.98)$	639 $(\pm 65.43)$	1185 $(\pm 158.95)$	674 $(\pm 69.33)$

Table 4. Table 4: Comparison of spectral gaps Δ r , Δ m subscript Δ 𝑟 subscript Δ 𝑚 \Delta_{r},\Delta_{m} , and Δ f subscript Δ 𝑓 \Delta_{f} and CPU times (in seconds) required to compute the communication ( p i | j ) subscript 𝑝 conditional 𝑖 𝑗 (p_{i|j}) and wake-up probabilities ( p i ) subscript 𝑝 𝑖 (p_{i}) on barbell graphs K n ~ − K n ~ subscript 𝐾 ~ 𝑛 subscript 𝐾 ~ 𝑛 K_{\tilde{n}}-K_{\tilde{n}} .

\tilde{n}

\log (1 / Δ_{r})

\log (1 / Δ_{m})

\log (1 / Δ_{f})

CPU Time

ER (in secs)

CPU Time

FQG (in secs)

20

5.914

7.076

4.407

\leq 0.01

2.50

24

6.292

7.599

4.605

\leq 0.01

3.10

28

6.610

8.043

4.771

\leq 0.01

3.48

32

6.884

8.429

4.913

\leq 0.01

6.47

36

7.125

8.771

5.037

\leq 0.01

8.63

40

7.340

9.078

5.147

\leq 0.01

12.33

44

7.534

9.357

5.247

\leq 0.01

13.45

52

7.873

9.846

5.421

\leq 0.01

21.79

72

8.532

10.803

5.756

\leq 0.01

106.22

Table 5. Table 5: The comparison of spectral gaps Δ r , Δ m subscript Δ 𝑟 subscript Δ 𝑚 \Delta_{r},\Delta_{m} , and Δ f subscript Δ 𝑓 \Delta_{f} of the iteration matrices and CPU times (in secs) required to compute P r superscript 𝑃 𝑟 P^{r} and P f superscript 𝑃 𝑓 P^{f} on S B M ( n , 3 , 0.9 , 0.01 ) 𝑆 𝐵 𝑀 𝑛 3 0.9 0.01 SBM(n,3,0.9,0.01) .

n

\log (1 / Δ_{r})

\log (1 / Δ_{m})

\log (1 / Δ_{f})

CPU Time

ER (in secs)

CPU Time

FQG (in secs)

22

6.490

7.150

5.457

\leq 0.01

3.02

24

6.276

7.356

5.072

\leq 0.01

4.09

30

6.743

8.049

5.236

\leq 0.01

4.02

32

7.104

8.328

5.573

\leq 0.01

6.98

36

6.498

7.524

5.302

\leq 0.01

6.74

38

6.268

6.959

4.974

\leq 0.01

8.58

40

6.613

7.475

5.264

\leq 0.01

8.97

44

6.621

7.420

5.265

\leq 0.01

11.04

48

6.904

7.814

5.361

\leq 0.01

13.58

50

6.969

7.947

5.505

\leq 0.01

15.05

52

7.166

8.249

5.566

\leq 0.01

16.93

54

6.934

7.840

5.474

\leq 0.01

21.39

Table 6. Table 6: Comparison of log ⁡ ( 2 Φ ( W ¯ P ) ) / log ⁡ ( 1 − λ n − 1 ( W ¯ P ) ) 2 Φ subscript ¯ 𝑊 𝑃 1 subscript 𝜆 𝑛 1 subscript ¯ 𝑊 𝑃 \log(2\Phi(\overline{W}_{P}))/\log(1-\lambda_{n-1}(\overline{W}_{P})) and log ( Φ 2 ( W ¯ p ) ) ) / log ( 1 − λ n − 1 ( W ¯ P ) ) \log(\Phi^{2}(\overline{W}_{p})))/\log(1-\lambda_{n-1}(\overline{W}_{P})) of ER based gossiping and classical gossiping on the ( c 𝑐 c -barbell) c − K n ~ 𝑐 subscript 𝐾 ~ 𝑛 c-K_{\tilde{n}} graph with c = 10 𝑐 10 c=10 .

$\tilde{n}$	$a_{r}^{l b}$	$a_{r}^{u b}$	$a_{u}^{l b}$	$a_{u}^{u b}$
10	0.802	1.735	0.865	1.848
16	0.818	1.756	0.883	1.873
18	0.822	1.761	0.887	1.878
20	0.825	1.765	0.891	1.882
22	0.828	1.769	0.893	1.886
28	0.834	1.778	0.900	1.894
30	0.836	1.780	0.901	1.896
36	0.841	1.786	0.905	1.900
38	0.842	1.788	0.906	1.902
44	0.845	1.793	0.909	1.905
46	0.846	1.794	0.910	1.906
48	0.847	1.795	0.911	1.907
50	0.848	1.797	0.912	1.908
100	0.862	1.815	0.923	1.920
500	0.886	1.847	0.939	1.938
1000	0.894	1.858	0.944	1.943

Equations137

R_{ij} = L_{ii}^{†} + L_{j j}^{†} - 2 L_{ij}^{†}, \forall (i, j) \in E .

R_{ij} = L_{ii}^{†} + L_{j j}^{†} - 2 L_{ij}^{†}, \forall (i, j) \in E .

P_{ii} ≜ 0; P_{ij} ≜ p_{i} p_{j ∣ i}, \forall j \in N_{i},

P_{ii} ≜ 0; P_{ij} ≜ p_{i} p_{j ∣ i}, \forall j \in N_{i},

P_{ij} ≜ 0, \forall j \in N ∖ N_{i},

T_{a v e} (ε, P) ≜ y^{0} \in R^{n} ∖ {0} sup in f {k : P (\frac{∥ y ^{k} - y ˉ 1 ∥}{∥ y ^{0} ∥} \geq ε) \leq ε},

T_{a v e} (ε, P) ≜ y^{0} \in R^{n} ∖ {0} sup in f {k : P (\frac{∥ y ^{k} - y ˉ 1 ∥}{∥ y ^{0} ∥} \geq ε) \leq ε},

y^{k + 1} = W^{(i, j)} y^{k} \mbox w h er e W^{(i, j)} ≜ I - \frac{( e _{i} - e _{j} ) ( e _{i} - e _{j} ) ^{⊤}}{2} .

y^{k + 1} = W^{(i, j)} y^{k} \mbox w h er e W^{(i, j)} ≜ I - \frac{( e _{i} - e _{j} ) ( e _{i} - e _{j} ) ^{⊤}}{2} .

\overline{W}_{P} ≜ E_{P} [W^{(i, j)}] = i, j \in N \sum P_{ij} W^{(i, j)},

\overline{W}_{P} ≜ E_{P} [W^{(i, j)}] = i, j \in N \sum P_{ij} W^{(i, j)},

0.5 \frac{lo g ( ε ^{- 1} )}{lo g ([ λ _{n - 1} ( W _{P} ) ] ^{- 1} )}

0.5 \frac{lo g ( ε ^{- 1} )}{lo g ([ λ _{n - 1} ( W _{P} ) ] ^{- 1} )}

P_{ij}^{u} = p_{i}^{u} p_{j ∣ i}^{u} = \frac{1}{n} \frac{1}{d _{i}},

P_{ij}^{u} = p_{i}^{u} p_{j ∣ i}^{u} = \frac{1}{n} \frac{1}{d _{i}},

P_{i^{*} j^{*}}^{u} = P_{j^{*} i^{*}}^{u} = \frac{1}{n} \frac{1}{d _{i^{*}}} = \frac{2}{n ^{2}} .

P_{i^{*} j^{*}}^{u} = P_{j^{*} i^{*}}^{u} = \frac{1}{n} \frac{1}{d _{i^{*}}} = \frac{2}{n ^{2}} .

P_{ij}^{r} = p_{i}^{r} p_{j ∣ i}^{r} = \frac{R _{ij}}{2 \sum _{(i, j) \in E} R _{ij}} = \frac{R _{ij}}{2 ( n - 1 )} = P_{j i}^{r},

P_{ij}^{r} = p_{i}^{r} p_{j ∣ i}^{r} = \frac{R _{ij}}{2 \sum _{(i, j) \in E} R _{ij}} = \frac{R _{ij}}{2 ( n - 1 )} = P_{j i}^{r},

P_{i^{*} j^{*}}^{r} = P_{j^{*} i^{*}}^{r} = \frac{R _{i^{*} j^{*}}}{2 ( n - 1 )} = \frac{1}{2 ( n - 1 )},

P_{i^{*} j^{*}}^{r} = P_{j^{*} i^{*}}^{r} = \frac{R _{i^{*} j^{*}}}{2 ( n - 1 )} = \frac{1}{2 ( n - 1 )},

Θ (c^{2} \tilde{n}^{3} lo g (1/ ϵ)) \leq T_{a v e} (ε, P^{u}) \leq Θ (c^{4} \tilde{n}^{6} lo g (1/ ϵ)),

Θ (c^{2} \tilde{n}^{3} lo g (1/ ϵ)) \leq T_{a v e} (ε, P^{u}) \leq Θ (c^{4} \tilde{n}^{6} lo g (1/ ϵ)),

Θ (c^{2} \tilde{n}^{2} lo g (1/ ϵ)) \leq T_{a v e} (ε, P^{r}) \leq Θ (c^{4} \tilde{n}^{4} lo g (1/ ϵ)) .

T_{a v e} (ε, P^{r}) = Θ (1/ n) T_{a v e} (ε, P^{u}) .

T_{a v e} (ε, P^{r}) = Θ (1/ n) T_{a v e} (ε, P^{u}) .

T_{a v e} (ε, P^{r}) = O (D n^{3}) lo g (ϵ^{- 1}) .

T_{a v e} (ε, P^{r}) = O (D n^{3}) lo g (ϵ^{- 1}) .

\overline{W}_{P} = I - \frac{1}{2} D + \frac{1}{2} (P + P^{⊤}),

\overline{W}_{P} = I - \frac{1}{2} D + \frac{1}{2} (P + P^{⊤}),

\overline{W}_{P^{u}} = I - \frac{1}{2} D^{u} + \frac{P ^{u} + ( P ^{u} ) ^{⊤}}{2}, \overline{W}_{P^{r}} = I - \frac{1}{2} D^{r} + P^{r},

\overline{W}_{P^{u}} = I - \frac{1}{2} D^{u} + \frac{P ^{u} + ( P ^{u} ) ^{⊤}}{2}, \overline{W}_{P^{r}} = I - \frac{1}{2} D^{r} + P^{r},

[\overline{W}_{P^{u}}]_{i^{*} j^{*}} = \frac{2}{n ^{2}}, [\overline{W}_{P^{r}}]_{i^{*} j^{*}} = \frac{1}{2 ( n - 1 )} .

[\overline{W}_{P^{u}}]_{i^{*} j^{*}} = \frac{2}{n ^{2}}, [\overline{W}_{P^{r}}]_{i^{*} j^{*}} = \frac{1}{2 ( n - 1 )} .

Φ (W) = S \subset N : S, S^{c} \neq = \emptyset min \frac{\sum _{i \in S, j \in S^{c}} π _{i} W _{ij}}{min { π ( S ) , π ( S ^{c} )}}

Φ (W) = S \subset N : S, S^{c} \neq = \emptyset min \frac{\sum _{i \in S, j \in S^{c}} π _{i} W _{ij}}{min { π ( S ) , π ( S ^{c} )}}

1 - 2Φ (W) \leq λ_{n - 1} (W) \leq 1 - Φ^{2} (W),

1 - 2Φ (W) \leq λ_{n - 1} (W) \leq 1 - Φ^{2} (W),

Φ (\overset{ˉ}{W}_{P^{u}}) = \frac{c _{*}}{c n ~ ^{3}}, Φ (\overset{ˉ}{W}_{P^{r}}) = \frac{c _{*}}{2 n ~ ( c n ~ - 1 )} .

Φ (\overset{ˉ}{W}_{P^{u}}) = \frac{c _{*}}{c n ~ ^{3}}, Φ (\overset{ˉ}{W}_{P^{r}}) = \frac{c _{*}}{2 n ~ ( c n ~ - 1 )} .

- lo g (1 - Φ^{2} (W)) \leq lo g (λ_{n - 1}^{- 1} (W)) \leq - lo g (1 - 2Φ (W)) .

- lo g (1 - Φ^{2} (W)) \leq lo g (λ_{n - 1}^{- 1} (W)) \leq - lo g (1 - 2Φ (W)) .

λ_{n - 1} (\overline{W}_{P^{r}}) = 1 - Θ (\frac{1}{n ^{2}}), λ_{n - 1} (\overline{W}_{P^{u}}) = 1 - Θ (\frac{1}{n ^{3}}) .

λ_{n - 1} (\overline{W}_{P^{r}}) = 1 - Θ (\frac{1}{n ^{2}}), λ_{n - 1} (\overline{W}_{P^{u}}) = 1 - Θ (\frac{1}{n ^{3}}) .

T_{mi x} (ε, W) ≜ k \geq 0 in f {p \geq 0 : ∥ p ∥_{1} = 1 sup ∥ (W^{k})^{⊤} p - π ∥_{T V} \leq ε},

T_{mi x} (ε, W) ≜ k \geq 0 in f {p \geq 0 : ∥ p ∥_{1} = 1 sup ∥ (W^{k})^{⊤} p - π ∥_{T V} \leq ε},

T_{mi x} (\frac{1}{8}, \overline{W}_{P^{r}}) \leq 8 i, j \in {1, \dots, n} max H_{\overline{W}_{P^{r}}} (i \to j) + 1 \leq 8 D n^{3} .

T_{mi x} (\frac{1}{8}, \overline{W}_{P^{r}}) \leq 8 i, j \in {1, \dots, n} max H_{\overline{W}_{P^{r}}} (i \to j) + 1 \leq 8 D n^{3} .

T_{mi x} (\frac{1}{8}, \overline{W}_{P^{r}}) \geq (\frac{1}{1 - λ _{n - 1} ( W _{P^{r}} )} - 1) lo g (4) .

T_{mi x} (\frac{1}{8}, \overline{W}_{P^{r}}) \geq (\frac{1}{1 - λ _{n - 1} ( W _{P^{r}} )} - 1) lo g (4) .

M_{ij} ≜ ⎩ ⎨ ⎧ \frac{1}{m a x ( d _{i} , d _{j} )} 1 - \sum_{j \in N_{i} ∖ i} \frac{1}{m a x ( d _{i} , d _{j} )} 0 \mbox i f (i, j) \in E, \mbox i f i = j, \mbox e l se .

M_{ij} ≜ ⎩ ⎨ ⎧ \frac{1}{m a x ( d _{i} , d _{j} )} 1 - \sum_{j \in N_{i} ∖ i} \frac{1}{m a x ( d _{i} , d _{j} )} 0 \mbox i f (i, j) \in E, \mbox i f i = j, \mbox e l se .

i, j \in {1, 2, \dots, n} max H_{M} (i \to j) \leq 12 n^{2}, λ_{n - 1} (M) \leq 1 - \frac{1}{71 n ^{2}} .

i, j \in {1, 2, \dots, n} max H_{M} (i \to j) \leq 12 n^{2}, λ_{n - 1} (M) \leq 1 - \frac{1}{71 n ^{2}} .

f_{i} (x) ≜ \frac{1}{2 n} ∥ x ∥^{2} + \frac{1}{N _{s}} ℓ = 1 \sum N_{s} lo g (1 + exp^{- b_{i ℓ} a_{i ℓ}^{⊤} x}),

f_{i} (x) ≜ \frac{1}{2 n} ∥ x ∥^{2} + \frac{1}{N _{s}} ℓ = 1 \sum N_{s} lo g (1 + exp^{- b_{i ℓ} a_{i ℓ}^{⊤} x}),

Φ_{S} (W) ≜ \frac{1}{π ( S )} i \in S, j \in S^{C} \sum π_{i} W_{ij} .

Φ_{S} (W) ≜ \frac{1}{π ( S )} i \in S, j \in S^{C} \sum π_{i} W_{ij} .

G_{1} ≜ the left-most clique of the c -barbell graph .

G_{1} ≜ the left-most clique of the c -barbell graph .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Control Multi-Agent Systems · Complex Network Analysis Techniques · Opinion Dynamics and Social Influence

Full text

Randomized Gossiping with Effective Resistance Weights: Performance Guarantees and Applications

Bugra Can*∗*

Management Sciences and Information Systems

Rutgers Business School

[email protected] &Saeed Soori

Department of Computer Sciences

University of Toronto

[email protected] &Necdet Serhat Aybat

Industrial and Manufacturing Engineering Department

Penn State University

[email protected] &Maryam Mehri Dehnavi

Department of Computer Sciences

University of Toronto

[email protected] &Mert Gürbüzbalaban

Management Sciences and Information Systems

Rutgers Business School

[email protected] Bugra Can and Mert Gürbüzbalaban acknowledge support from the Office of Naval Research Award Number N00014-21-1-2244, and the grants National Science Foundation (NSF) CCF-1814888, NSF DMS-2053485, NSF DMS-1723085.

Abstract

The effective resistance between a pair of nodes in a weighted undirected graph is defined as the potential difference induced when a unit current is injected at one node and extract from the other, treating edge weights as the conductance values of edges. The effective resistance is a key quantity of interest in many applications, e.g., solving linear systems, Markov Chains, and continuous-time averaging networks. We consider effective resistances (ER) in the context of designing randomized gossiping methods for the consensus problem, where the aim is to compute the average of node values in a distributed manner through iteratively computing weighted averages among randomly chosen neighbours. For barbell graphs, we prove that choosing wake-up and communication probabilities proportional to ER weights improves the averaging time corresponding to the traditional choice of uniform weights. For $c$ -barbell graphs, we show that ER weights admit lower and upper bounds on the averaging time that improves upon the lower and upper bounds available for uniform weights. Furthermore, for graphs with a small diameter, we can show that ER weights can improve upon the existing bounds for Metropolis weights by a constant factor under some assumptions. We illustrate these results through numerical experiments where we showcase the efficiency of our approach on several graph topologies including barbell graphs, small-world graphs, and stochastic block models. We also present an application of the ER gossiping to distributed optimization: we numerically verify that using ER gossiping within EXTRA and DPGA-W methods improves their practical performance in terms of communication efficiency.

Keywords Distributed algorithms/control $\cdot$ networks of autonomous agents $\cdot$ optimization $\cdot$ randomized gossiping algorithms

1 Introduction

Let $\mathcal{G}=(\mathcal{N},\mathcal{E},w)$ be an undirected, weighted and connected graph defined by the set of nodes (agents) $\mathcal{N}=\{1,\ldots,n\}$ , the set of edges $\mathcal{E}\subseteq\mathcal{N}\times\mathcal{N}$ , and the edge weights $w_{ij}>0$ for $(i,j)\in\mathcal{E}$ . Since $\mathcal{G}$ is undirected, we assume that both $(i,j)$ and $(j,i)$ refer to the same edge when it exists, and for all $(i,j)\in\mathcal{E}$ , we set $w_{ji}=w_{ij}$ . Identifying the weighted graph $\mathcal{G}$ as an electrical network in which each edge $(i,j)$ corresponds to a branch of conductance $w_{ij}$ , the effective resistance $R_{ij}$ between a pair of nodes $i$ and $j$ is defined as the voltage potential difference induced between them when a unit current is injected at $i$ and extracted at $j$ . The effective resistance (ER), also known as the resistance distance, is a key quantity of interest to compute in many applications and algorithmic questions over graphs. It defines a metric on the graph providing bounds on its conductance [1, 2]. Furthermore, it is closely associated with the hitting and commute times of a random walk111The hitting time is the expected number of steps of a random walk starting from $i$ until it first visits $j$ . The commute time $C_{ij}$ is the expected number of steps required to go from $i$ to $j$ and from $j$ to $i$ back again. on the graph $\mathcal{G}$ when the probability of a transition from $i$ to $j\in\mathcal{N}_{i}$ is $w_{ij}/\sum_{j^{\prime}\in\mathcal{N}_{i}}w_{ij^{\prime}}$ where $\mathcal{N}_{i}\triangleq\{j\in\mathcal{N}:~{}w_{ij}>0\}$ denotes the set of neighboring nodes of $i\in\mathcal{N}$ ; therefore, it arises naturally for studying random walks over graphs and their mixing time properties [3, 4, 5], spectral approximation of graphs [6], continuous-time averaging networks including consensus problems in distributed optimization [3].

There exist centralized algorithms for computing or approximating effective resistances accurately which require global communication beyond local information exchange among the neighboring agents [7, 8, 9, 6, 10]. The references [7, 8, 9] develop key techniques for computing the effective resistances explicitly on specific network types. In particular, [8] addresses a class of graphs which are underlying networks of some symmetric association schemes whereas [7] considers two dimensional resistor networks. The reference [9] provides an algorithm for the calculation of the resistance between two arbitrary nodes in a distance-regular network and also provides analytical formulas. The works [6, 10] are based on computing or approximating the entries of the pseudoinverse $\mathcal{L}^{{\dagger}}$ of the graph Laplacian matrix $\mathcal{L}$ , based on the identity [6]

[TABLE]

However, such centralized algorithms are impractical or infeasible for several key applications in multi-agent systems, e.g., randomized gossiping algorithms, for averaging the node values across the whole network, use only local communications between random neighbors (see [11, 12, 13]); this motivates the use of distributed algorithms for computing effective resistances which only rely on the information exchange among immediate neighbors. In these applications, communication among the agents is typically the bottleneck compared to the complexity of local computations of the agents; thus, it is crucial to develop distributed algorithms that are efficient in terms of the total number of communications required. To the best of the authors’ knowledge, the first attempt for computing effective resistances in a decentralized way and also the first ER-based randomized gossiping algorithms appeared in [14]. The latter algorithms are asynchronous gossiping algorithms where each agents’ wake-up and communication probabilities are chosen proportional to ER weights (see Section 2 for details). Aybat and Gürbüzbalaban have shown in [14] that effective resistance (ER) weights can be computed at each agent locally with an efficient distributed algorithm, Distributed Randomized Kaczmarz (D-RK). Our paper is motivated by the numerical evidence presented in [14] that using ER weights has the potential to improve the performance of randomized gossiping algorithms on specific graphs. Since in [14] no rigorous performance guarantees for the use of ER weights were provided, here we focus on establishing the missing theoretical results that match the outstanding empirical behavior.

Contributions. First, in this paper, we provide theoretical guarantees on the ER-based randomized gossiping algorithms proposed in [14] for the consensus problem, where the objective is to compute the average of node values over a network in a decentralized manner [12]. A standard approach for solving the consensus problem is the randomized uniform gossiping, where each node keeps a local estimate of the average of node values and has the equal (uniform) probability of being activated to communicate with a randomly chosen neighbour to update its local estimate. However, this approach treats all the edges (equally) uniformly and can be slow in practice. To overcome this problem, in [14], ER-based randomized gossiping algorithms were proposed without any theoretical guarantees, in which the edges are being activated by non-uniform probabilities that are proportional to their effective resistances.

Our theoretical results presented in Section 3 (see Results 1, 2, and 3) explain the superior empirical behaviour of ER-based gossiping over the uniform gossiping observed in [14]. Briefly, we bound the time required to compute an inexact average using analysis based on conductance and spectral properties of the underlying weighted communication graph, and compare the bounds we obtained corresponding to the ER and uniform gossiping methods. We show that averaging time with ER weights is $\Theta(n)$ faster than that of uniform gossiping on a barbell graph where $n$ is the number of agents. Furthermore, we also prove that for connected graphs with a small diameter, the averaging time with resistance weights can be faster than known performance bounds for the averaging time with gossiping based on Metropolis weights by a constant factor see (Remark 12). We also provide numerical experiments on several graph topologies which illustrate the performance improvements that can be obtained within ER-based gossiping. In our experiments, the effective resistances are first computed with the normalized D-RK algorithm of [14] and then used for ER-based gossiping. Our theoretical and numerical results show that ER weights are especially useful in the presence of “bottleneck edges" or clusters giving a graph cut leading to small graph conductance values.

On a different note, Aybat and Gürbüzbalaban [14] introduced two alternative methods to compute ER weights in a decentralized manner: D-RK and normalized D-RK –both converging linearly. In our experiments at Section 5, we have adopted the normalized D-RK, upon proving that the convergence rate of normalized D-RK is better than D-RK; resolving a conjecture raised in [14] (see the Supplementary Material).

Second, we consider the consensus optimization problem, where the agents connected on a network aim to collaboratively solve the optimization problem $\min_{x\in\mathbb{R}^{p}}f(x)\triangleq\sum_{i=1}^{n}f_{i}(x)$ where $f_{i}(x):\mathbb{R}^{p}\to\mathbb{R}$ is a cost function only available to (node) agent $i$ . This problem includes a number of key problems in supervised learning including distributed regression and logistic regression or more generally distributed empirical risk minimization problems [15, 16]. The consensus iterations are a building block of many existing state-of-the-art distributed consensus optimization algorithms such as the EXTRA and the distributed proximal gradient (DPGA-W) [17] algorithms for consensus optimization. We show through numerical experiments that our framework based on effective resistances can improve the performance of the EXTRA and DPGA-W algorithms for consensus optimization in terms of the total number of communications required. We believe our framework has far-reaching potential for improving the communication efficiency of many other distributed algorithms including distributed subgradient and ADMM methods, and this will be the subject of future work.

Related work. For consensus problems, there are some alternative methods to accelerate the commonly used consensus protocols. The approach in [18] is a synchronous algorithm combining Metropolis weights with a momentum averaging scheme. There are other approaches based on momentum averaging [19, 20, 21], min-sum splitting [22], and Chebyshev acceleration [23, 24, 25] to accelerate the convergence speed of the consensus methods. This paper is orthogonal to the momentum averaging-based approaches in the sense that it can be used in combination with the aforementioned momentum-based acceleration schemes, we refer the reader to the Supplementary Material for the details. There are also works that provide lower bounds on the distributed averaging time on a graph [12, 26, 27, 28]. In particular, it follows from these lower bounds that for the two-dimensional grid, even the best gossiping weights will not lead to an accelerated performance compared to baseline approaches. Indeed, for special graphs such as the two-dimensional grid, cycle graph or the line graph, ER weights will be similar to uniform weights due to the symmetries in the graph structure and consequently ER weights will not improve the performance compared to uniform weights. However, for graphs with asymmetries involving clusters or bottleneck edges along which the graph cut has low conductance, based on our numerical and theoretical results, we expect ER weights lead to an improved performance.

Outline. In Section 2, we give a brief overview of randomized gossiping including uniform and ER-based gossiping methods. In Section 3, we state our main contributions. In Section 4, we provide detailed arguments establishing the main results stated in Section 3. In Section 5, we provide numerical experiments illustrating that using ER weights can improve the performance of EXTRA and DPGA-W algorithms for consensus optimization. In Section 6, we give some concluding remarks. Finally, we present some of the proofs and supporting results in Appendix A–B.

Notation. Let $|S|$ denote the cardinality of a set $S$ , $\lfloor.\rfloor$ denote the floor function and $\mathbb{Z}_{+}$ be the set of nonnegative integers. We define $d_{i}\triangleq|\mathcal{N}_{i}|$ as the degree of $i\in\mathcal{N}$ , and $m\triangleq|\mathcal{E}|$ . Throughout the paper, $\mathcal{L}\in\mathbb{R}^{|\mathcal{N}|\times|\mathcal{N}|}$ denotes the weighted Laplacian of $\mathcal{G}$ , i.e., ${\cal L}_{ii}=\sum_{j\in\mathcal{N}_{i}}w_{ij}$ , $\mathcal{L}_{ij}=-w_{ij}$ if $j\in\mathcal{N}_{i}$ , and equals to [math] otherwise. The diameter of a graph is $\mathcal{D}\triangleq\max_{i,j\in\mathcal{N}}d(i,j)$ where $d(i,j)$ is the shortest path on the graph between nodes $i$ and $j$ . The set $\mathbb{S}^{n}$ denotes the set of $n\times n$ real symmetric matrices. We use the notation $Z=[z_{i}]_{i=1}^{n}$ where $z_{i}$ ’s are either the columns or rows of the matrix $Z$ depending on the context. $\mathbf{1}$ is the column vector with all entries equal to 1, and $\mathbf{I}$ is the identity matrix. We let $||x||_{p}$ denote the $L_{p}$ norm of a vector $x$ for $p\geq 1$ , and let $\|A\|_{F}$ denote the Frobenius norm of a matrix $A$ . A square matrix $A$ is doubly stochastic if all of its entries are non-negative and all its rows and columns sum up to 1. We say that a square matrix $A$ is weakly diagonally dominant if it’s diagonal entries $A_{ii}$ satisfy the inequality $|A_{ii}|\geq\sum_{j\neq i}|A_{ij}|$ for every $i$ . Let $f$ and $g$ be real-valued functions defined over positive integers. We say $f(n)=\mathcal{O}(g(n))$ if $f$ is bounded above by $g$ asymptotically, i.e., there exist constants $k_{1}>0$ and $n_{0}\in\mathbb{Z}_{+}$ such that $f(n)\leq k_{1}\cdot g(n)$ for all $n>n_{0}$ . Similarly, we say $f(n)=\Omega(g(n))$ if there exist constants $k_{2}>0$ and $n_{0}\in\mathbb{Z}_{+}$ such that $f(n)\geq k_{2}\,g(n)$ for every $n>n_{0}$ ; and we say $f(n)=\Theta(g(n))$ if $f(n)=\Omega(g(n))$ and $f(n)=\mathcal{O}(g(n))$ . Finally, $\log(x)$ denots the natural logarithm of $x$ , and $e_{i}$ is the $i$ -th standard basis vector in $\mathbb{R}^{n}$ for $i=1,2,\dots,n$ .

2 Preliminaries

2.1 Randomized gossiping

Here we give an overview of randomized gossiping methods for the consensus problem. These methods can compute the average of node values over a network in an asynchronous and decentralized manner, for details see [12, 28].

Let $y^{0}\in\mathbb{R}^{n}$ be a vector such that the $i$ -th component $y_{i}^{0}$ represents the initial value at node $i\in\mathcal{N}$ . The aim of the randomized gossiping algorithms is to have each node compute the average $\bar{y}\triangleq\sum_{i=1}^{n}y_{i}^{0}/n$ in a decentralized manner through an iterative procedure. At every iteration $k\in\mathbb{Z}_{+}$ , each node $i\in\mathcal{N}$ possesses a local estimate $y_{i}^{k}$ of the average to be computed and communicates with only randomly selected neighbors to update its estimate. The setup is that each node $i\in\mathcal{N}$ has an exponential clock ticking with rate $r_{i}>0$ where the time between two ticks is exponentially distributed and independent of other nodes’ clocks. A node wakes up when its clock ticks. Since all the clocks are independent, if a node wakes up at time $t_{k}\geq 0$ , it is node $i$ with probability (w.p.) $p_{i}\triangleq r_{i}/\sum_{j\in\mathcal{N}}r_{j}$ . Given that the node $i$ wakes up at time $t_{k}$ , the conditional probability that it picks one of its neighbors $j\in\mathcal{N}_{i}$ to communicate with probability $p_{j|i}\in(0,1)$ , where the probabilities $\{p_{j|i}\}_{j\in\mathcal{N}_{i}}$ are design parameters satisfying $\sum_{j\in\mathcal{N}_{i}}p_{j|i}=1$ . When either $i$ wakes up and picks $j\in\mathcal{N}_{i}$ or vice versa, we say the edge $(i,j)$ is activated. Once the edge $(i,j)$ is activated, nodes $i$ and $j$ exchange their local variables $y_{i}^{k}$ and $y_{j}^{k}$ at time $t_{k}$ and both compute the average $(y_{i}^{k}+y_{j}^{k})/2$ . This is illustrated in Algorithm 1 below which admits an asynchronous implementation – see, e.g., [12].

Assuming *there are no self-loops *for each $i\in\mathcal{N}$ , let

[TABLE]

where $P_{ij}$ is the (unconditional) probability that the edge $(i,j)$ is activated by the node $i$ . By definition, we have $\sum_{ij}P_{ij}\triangleq\sum_{i\in\mathcal{N}}\sum_{j\in\mathcal{N}}P_{ij}=1$ . Let $\mathcal{A}(P)$ denote an asynchronous gossiping algorithm characterized by a probability matrix $P$ as in (2) for some set of probabilities $\{p_{i}\}_{i\in\mathcal{N}}$ and $\{p_{j|i}\}_{j\in\mathcal{N}_{i}}$ for $i\in\mathcal{N}$ . The performance of $\mathcal{A}(P)$ is typically measured by the $\varepsilon$ -averaging time, defined for any $\varepsilon>0$ as:

[TABLE]

see, e.g., [12]. Suppose $(i,j)$ is activated by node $i$ , then we can write the update in Step 1 of the Algorithm 1 as

[TABLE]

We also define

[TABLE]

which is the expected value of the random iteration matrix $W^{(i,j)}$ with respect to the distribution defined over $i\in\mathcal{N}$ and $j\in\mathcal{N}_{i}$ . The following theorem from [12] shows that the second largest eigenvalue of $\overline{W}_{P}$ determines the $\varepsilon$ -averaging time.

Theorem 1 ([12, Theorem 3]).

For a given $\mathcal{A}(P)$ , the symmetric matrix $\overline{W}_{P}$ defined in (4) satisfies

[TABLE]

where $\lambda_{n-1}(\overline{W}_{P})$ is the second largest eigenvalue of $\overline{W}_{P}$ .

This result makes the connection between the convergence time of an asynchronous gossiping algorithm $\mathcal{A}(P)$ and the spectrum of the expected iteration matrix $\overline{W}_{P}$ . It is therefore of interest to design $P$ through carefully choosing the probabilities $\{p_{i}\}_{i\in\mathcal{N}}$ and $\{p_{j|i}\}_{j\in\mathcal{N}_{i}}$ for $i\in\mathcal{N}$ in order to get the best performance, i.e., the smallest $\varepsilon$ -averaging time.

In this paper, we consider two different randomized gossiping algorithms: uniform gossiping and ER gossiping which differ in how the probabilities $\{p_{i}\}_{i\in\mathcal{N}}$ and $\{p_{j|i}\}_{j\in\mathcal{N}_{i}}$ for $i\in\mathcal{N}$ are selected. In particular, based on Theorem 1, we will study the second largest eigenvalue of the expected iteration matrix $\overline{W}_{P}$ corresponding to these two algorithms and compare their $\epsilon$ -averaging times.

2.2 Randomized uniform gossiping

In the randomized uniform gossiping, each node $i$ wakes up with equal probability $p_{i}^{u}=\frac{1}{n}$ , i.e., using uniform clock rates $r_{i}=r>0$ for $i\in\mathcal{N}$ . The superscript $u$ stands for the uniform choice of clock rates. Then, node $i$ picks the edge $(i,j)$ with conditional probability $p_{j|i}^{u}=\frac{1}{d_{i}}$ for $j\in\mathcal{N}_{i}$ ; thus,

[TABLE]

see, e.g., [29, 30]. One of the drawbacks of this approach is that it can be quite slow over graphs with a high bottleneck ratio [31] where, intuitively speaking, some “bottleneck edges" limit the spread of information over the underlying graph. A classical example of a graph with a high bottleneck ratio is the barbell graph. Barbell graphs are frequently studied within the consensus problem literature as they constitute a worst-case example in terms of both the mixing properties of random walks [4, Section 5] and the performance of distributed averaging algorithms (see, e.g., [3, 32]).

Barbell graphs consist of two complete subgraphs connected with an edge (see Figure 1). Let $K_{\tilde{n}}$ denote a complete graph with ${\tilde{n}}$ nodes, we will be denoting a barbell graph with $n=2{\tilde{n}}$ nodes by $K_{\tilde{n}}-K_{\tilde{n}}$ . Let $(i^{*},j^{*})$ be the edge that connects the two complete subgraphs which we will be referring to as the bottleneck edge. This is the only edge that allows node values to be propagated between the two complete subgraphs; therefore, how frequently it is sampled is a key factor that determines the averaging time.

The probability of sampling the bottleneck edge $(i^{*},j^{*})$ , with uniform weights can be computed explicitly:

[TABLE]

This implies that it takes $\Theta(n^{2})$ iterations in expectation to activate this edge, which is the underlying reason why the randomized uniform gossiping iterates converge slowly when $n$ is large on the barbell graph. The effect of bottleneck edges on the performance of gossiping algorithms has been recently studied experimentally by Aybat and Gürbüzbalaban [14] on different topologies including the barbell and small-world graphs. The authors proposed ER gossiping where the edges are sampled with non-uniform probabilities proportional to effective resistances $\{R_{ij}\}_{(i,j)\in\mathcal{E}}$ and the numerical experiments in [14] showed that this can lead to significant performance improvement over graphs with bottleneck edges, such as barbell graphs. We next describe this method.

2.3 Effective-resistance (ER) gossiping

In the ER gossiping, each $i\in\mathcal{N}$ wakes up with probability $p_{i}^{r}=\frac{\sum_{j\in\mathcal{N}_{i}}R_{ij}}{2\sum_{(i,j)\in\mathcal{E}}R_{ij}}$ , i.e., setting clock rate $r_{i}=\sum_{j\in\mathcal{N}_{i}}R_{ij}$ for $i\in\mathcal{N}$ , and node $i$ picks $(i,j)$ with conditional probability $p_{j|i}^{r}=\frac{R_{ij}}{\sum_{j\in\mathcal{N}_{i}}R_{ij}}$ for all $j\in\mathcal{N}_{i}$ ; thus, ER gossiping corresponds to the unconditional probabilities

[TABLE]

for all $(i,j)\in\mathcal{E}$ where the third equality follows from Foster’s Theorem which says that $\sum_{(i,j)\in\mathcal{E}}R_{ij}=(n-1)$ – see, e.g., [33]. This choice of sampling probabilities can lead to bottleneck edges being more frequently sampled. We illustrate this fact on the barbell graph ( $K_{\tilde{n}}-K_{\tilde{n}}$ ): Note that the unconditional probability of sampling the bottleneck edge $(i^{*},j^{*})$ is given explicitly as

[TABLE]

where $n=2{\tilde{n}}$ and we used the fact that $R_{i^{*}j^{*}}=1$ (see the proof of Lemma 16 for the derivation of (6)). Hence, comparing (5) and (6), we see that ER weights allow sampling of the bottleneck edge $(i^{*},j^{*})$ more frequently, by a factor of $\Theta(n)$ , than the uniform gossiping on $K_{\tilde{n}}-K_{\tilde{n}}$ . Intuitively speaking, this is the reason why ER gossiping can be efficient on barbell graphs. Numerical experiments provided in [14] support this intuition where ER gossiping outperforms uniform gossiping over an unweighted barbell graph as well as small-world graphs, which are random graphs that arise frequently in real-world applications such as social networks.

Despite the empirical success of ER gossiping in practice, theoretical results supporting its practical performance have been lacking in the literature. The purpose of this paper is to provide rigorous convergence guarantees for ER gossiping algorithms on certain network topologies (see Section 3 for our main results’ statements and Section 4 for the proofs) and to present further numerical evidence that ER gossiping, beyond distributed averaging, can also improve the practical performance of distributed methods for consensus optimization (Section 5). Indeed, in our analysis, we consider connected graphs characterized by their diameter $\mathcal{D}\in\mathbb{Z}_{+}$ , barbell graphs and $c$ -barbell graphs which are generalizations of barbell graphs. More specifically, a $c$ -barbell graph ( $K_{{\tilde{n}}}^{c}$ ) for $c\geq 2$ is a path of $c$ equal-sized complete graphs ( $K_{{\tilde{n}}}$ ) [34], e.g., see Figure 2 for $K_{4}^{c}$ . In the special case, when $c=2$ , a $c$ -barbell graph is equivalent to the barbell graph. We show that for these graphs, ER gossiping has provably better convergence properties than uniform gossiping in terms of $\varepsilon$ -averaging times. Precise results will be stated in the next section.

3 Main Results

In this section, we state our main theoretical results: we provide performance bounds for the ER gossiping in terms of $\epsilon$ -averaging time $T_{ave}(\varepsilon,P^{r})$ . Our results highlight the performance improvements obtained with this approach.

Our first result concerns $c$ -barbell graphs where we focus on the $\varepsilon$ -averaging times of uniform and ER gossiping algorithms. To the best of our knowledge, for $c$ -barbell graphs, an analytical formula for the second largest eigenvalue $\overline{W}_{P}$ is not analytically available; therefore, in our analysis we estimate this eigenvalue based on graph conductance techniques (see Section 4.1 for details) which leads to the following lower and upper bounds on the $\varepsilon$ -averaging times.

Result 1.

Given $\epsilon>0$ , and $\tilde{n},c\in\mathbb{Z}_{+}$ such that $c\geq 2$ , asynchronous randomized gossiping algorithms $\mathcal{A}(P^{u})$ and $\mathcal{A}(P^{r})$ on a c-barbell graph with $n=\tilde{n}c$ satisfy

[TABLE]

These bounds from Result 1 for the c-barbell graph show that, for any given precision $\epsilon>0$ , using effective resistances one can improve upper and lower bounds on the averaging times by a factor of $\Theta(n)$ and $\Theta(n^{2})$ , respectively. In the Supplementary Material, we also compared the averaging times $T_{ave}(\varepsilon,P^{r})$ and $T_{ave}(\varepsilon,P^{u})$ numerically based on computing the second-largest eigenvalues $\lambda_{n-1}(\overline{W}_{P^{r}})$ and $\lambda_{n-1}(\overline{W}_{P^{u}})$ and by invoking Theorem 1. These numerical results are inline with Result 1, showing that effective resistances improve upon uniform weights in the sense that the averaging time for the effective resistances scales better with the number of nodes $\tilde{n}$ .

The next result shows that for the case of barbell graphs (when $c=2$ ) the ER gossiping is in fact faster by a factor of $\Theta(n)$ . The proof idea is based on computing the eigenvalues of $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ explicitly via exploiting symmetry group properties of barbell graphs and showing that the lower bounds in (7)–(8) are attained for $c=2$ .

Result 2.

Given $\epsilon>0$ and $n\in\mathbb{Z}_{+}$ , let $n=2{\tilde{n}}$ . The $\varepsilon$ -averaging times of asynchronous gossiping algorithms $\mathcal{A}(P^{r})$ and $\mathcal{A}(P^{u})$ on barbell graph $K_{{\tilde{n}}}-K_{{\tilde{n}}}$ satisfy the equality:

[TABLE]

A natural question is whether it is possible to further improve the ER gossiping bounds for barbell graphs; however, in the next result, we show that this is not possible as long as the matrix $P$ is symmetric –thus, ER gossiping is optimal. Finally, we also obtain $\varepsilon$ -averaging bounds for a more general class of connected graphs depending on their diameters.

Result 3.

Given $\epsilon>0$ and $n\in\mathbb{Z}_{+}$ , let $n=2{\tilde{n}}$ . Among all the gossiping algorithms $\mathcal{A}(P)$ with a symmetric $P$ on the barbell graph, $K_{\tilde{n}}-K_{\tilde{n}}$ , randomized ER gossiping leads to $T_{ave}(\varepsilon,P^{r})=\Theta(n^{2}\log(1/\varepsilon))$ , which is optimal with respect to $\varepsilon$ and $n$ , and cannot be improved.

In a more general setting, let $\mathcal{G}$ be a connected graph with diameter $\mathcal{D}\in\mathbb{Z}_{+}$ . The $\varepsilon$ -averaging time of $\mathcal{A}(P^{r})$ satisfies

[TABLE]

Remark 2.

*The $\varepsilon$ -averaging time of randomized gossiping with lazy Metropolis weights222For lazy Metropolis weights see (15) and the paragraph after. on any graph is $\mathcal{O}(n^{3}\log(1/\varepsilon))$ ; while, for the barbell graph, Metropolis weights perform similar to uniform weights; both require $\Theta(n^{3}\log(1/\varepsilon))$ time which can be improved to $\Theta(n^{2}\log(1/\varepsilon))$ by ER gossiping. *

Remark 3.

If the diameter $\mathcal{D}\leq 11$ , our bounds for ER gossiping improve upon that of the randomized gossiping with lazy Metropolis weights by a (small) constant factor (see Remark 12). Note $\mathcal{D}=3$ for barbell graphs and $\mathcal{D}\leq 11$ is also reasonable for mid-size small-world graphs which are random graphs that arise frequently in real-world applications [35]. For instance, Cont et al. [35] show that the diameter $\mathcal{D}$ of the randomized community-based small-world graphs admits $2\log(n)$ upper bound almost surely; hence, for these graphs $\mathcal{D}\leq 11$ almost surely for $n\leq 240$ . Indeed, we empirically observe that randomly generated small-world graphs with parameters $n=\{5k:k=1,\ldots,5\}$ and $m=\lfloor 0.2(n^{2}-n)\rfloor$ using the methodology described in the numerical experiments in Section 5.1 satisfy $\mathcal{D}\leq 5$ on average over $10^{4}$ independent and identically distributed (i.i.d.) samples.

4 Proofs of Main Results

In order for both uniform and ER gossiping methods to have the same expected number of node wake-ups in a given time period, one should have $r_{i}=r=2(n-1)/n$ for $i\in\mathcal{N}$ within the uniform gossiping model –recall that $r_{i}=\sum_{j\in\mathcal{N}_{i}}R_{ij}$ for $i\in\mathcal{N}$ for ER gossiping; hence, the rate of both Poisson processes will be the same, i.e., $\sum_{i\in\mathcal{N}}r_{i}=2(n-1)$ . We note that the number of clock ticks $k\in\mathbb{Z}_{+}$ can be converted to absolute time easily with standard arguments (simply dividing $k$ by $\sum_{i\in\mathcal{N}}r_{i}$ to get the expected time of the $k$ -th tick), e.g., see [12, Lemma 1]. This allows us to use the number of iterations (clock ticks) to compare asynchronous algorithms.

It can be easily verified that for a given $\mathcal{A}(P)$ , the expected iteration matrix defined in (4) satisfies

[TABLE]

where $D$ is a diagonal matrix with $i$ -th entry $D_{i}\triangleq\sum_{j\in\mathcal{N}_{i}}(P_{ij}+P_{ji})$ . Note $W_{ij}$ defined in Section 2.1 is a doubly stochastic, non-negative and weakly diagonally dominant matrix for all $i\in\mathcal{N}$ and $j\in\mathcal{N}_{i}$ ; therefore, $\overline{W}_{P}$ , which is a convex combination of $W_{ij}$ matrices, is also a doubly stochastic, non-negative and weakly diagonally dominant matrix. It follows then from the Gershgorin’s Disc Theorem (see e.g. [36]) that all the eigenvalues of $\overline{W}_{P}$ are non-negative. Moreover, since $\overline{W}_{P}$ is a non-negative doubly stochastic matrix, its largest eigenvalue $\lambda_{n}(\overline{W}_{P})=1$ . Plugging in $P^{u}$ and $P^{r}$ for $P$ in this identity respectively leads immediately to the following result.

Lemma 4.

The matrices $\overline{W}_{P^{r}}=\mathbb{E}_{P^{r}}[W_{ij}]$ and $\overline{W}_{P^{u}}=\mathbb{E}_{P^{u}}[W_{ij}]$ satisfy the following identities:

[TABLE]

where $D^{u}$ and $D^{r}$ are diagonal matrices satisfying $[D^{u}]_{ii}\triangleq\sum_{j\in\mathcal{N}_{i}}(P^{u}_{ij}+P^{u}_{ji})$ , $[D^{r}]_{ii}=\frac{1}{(n-1)}R_{i}$ where $R_{i}\triangleq\sum_{j\in\mathcal{N}_{i}}R_{ij}$ .

Recall the definition of $T_{ave}(\varepsilon,P)$ given in (3), i.e., $\varepsilon$ -averaging time of an asynchronous gossiping algorithm $\mathcal{A}(P)$ characterized by a probability matrix $P$ . According to Theorem 1, to compare uniform and ER gossiping methods introduced in Section 2, it is sufficient to estimate the second largest eigenvalues of $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ and compare them. In the rest of this section, we discuss estimating the second largest eigenvalues of $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ based on the notions of graph conductance and hitting times when the eigenvalues are not readily available in closed form. We will also discuss some examples for which we can explicitly compute the eigenvalues.

It is worth emphasizing that since the matrices $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ are symmetric and doubly stochastic, they can both be viewed as the probability transition matrix of a reversible Markov Chain on the graph $\mathcal{G}$ , both with a uniform stationary distribution. We saw that depending on the type of randomized gossiping, the sampling probabilities of the bottleneck edge can differ significantly –by a factor of $\Theta(n)$ on barbell graphs implied by (5) and (6). A similar effect can also be observed for the Markov chains defined by the transition probability matrices $\overline{W}_{P^{u}}$ and $\overline{W}_{P^{r}}$ . In fact, by an explicit computation based on Lemma 4 (see Lemma 16 for details), we get

[TABLE]

That is, the probability of moving from one complete subgraph to the other is significantly larger (by a factor of $\Theta(n)$ ) for the Markov chain corresponding to $\overline{W}_{P^{r}}$ than that of the chain with $\overline{W}_{P^{u}}$ . Intuitively speaking, this fact allows the ER-based chain to traverse between the complete subgraphs faster when $n$ is large, leading to faster averaging over the nodes. This will be formalized and proven in the next subsection, where we study gossiping algorithms over barbell and $c$ -barbell graphs.

4.1 Proof of Result 1 via conductance-based analysis

Probability transition matrices on graphs have been studied well; in particular, there are some combinatorial techniques to bound their eigenvalues based on graph conductance [4] as well as some algebraic techniques that allow one to compute all the eigenvalues explicitly exploiting symmetry groups of a graph [37] as we shall discuss in Section 4.2.

The notion of graph conductance is tied to a transition matrix $W$ over a graph which corresponds to a reversible Markov chain admitting an arbitrary stationary distribution $\pi$ . It can be viewed as a measure of how hard it is for the Markov chain to go from a subgraph to its complement in the worst case.

The notion of graph conductance allows us to provide bounds on the mixing time of the corresponding Markov chain as we discuss below.

Definition 5 (Conductance).

Let $W$ be the transition matrix of a reversible Markov chain333That is $\pi_{i}W_{ij}=\pi_{j}W_{ji}$ for all $i,j\in\mathcal{N}$ . on the graph $\mathcal{G}$ with a stationary distribution $\pi=\{\pi_{i}\}_{i=1}^{n}$ . The conductance $\Phi$ is defined as

[TABLE]

where $\pi(S)\triangleq\sum_{i\in S}\pi_{i}$ .

Given a transition matrix $W$ , the relation between conductance $\Phi(W)$ and the second largest eigenvalue $\lambda_{n-1}(W)$ is well-known and given by the Cheeger inequalities:

[TABLE]

–see, e.g., [38, Proposition 6]. Therefore, larger conductance leads to faster averaging, i.e., shorter $T_{ave}(\varepsilon,P)$ , in light of Theorem 1. In particular, we can get lower and upper bounds on the averaging time for both uniform and ER gossiping methods using the Cheeger’s inequality. We study the performance bounds for these gossiping algorithms over c-barbell graphs; and our next result shows $\Theta(n)$ improvement on the conductance of effective resistance-based transition probabilities $\overline{W}_{P^{r}}$ compared to uniform probabilities $\overline{W}_{P^{u}}$ on a c-barbell graph with $n=c{\tilde{n}}$ nodes.

Proposition 6.

*Given $\tilde{n},c\in\mathbb{Z}_{+}$ such that $c\geq 2$ , consider the two Markov chains on the $c$ -barbell graph with $n={\tilde{n}}c$ nodes defined by the transition matrices $\overline{W}_{P^{u}}$ and $\overline{W}_{P^{r}}$ . Let $c_{*}=\big{(}\lfloor\frac{c}{2}\rfloor\big{)}^{-1}$ . The conductance values are given by *

[TABLE]

Remark 7.

Since a barbell graph $K_{\tilde{n}}-K_{\tilde{n}}$ is a special case of a c-barbell graph with $c=2$ and $n=2{\tilde{n}}$ , Proposition 6 implies that $\Phi(\overline{W}_{P^{u}})=\frac{4}{n^{3}}$ and $\Phi(\overline{W}_{P^{r}})=\frac{1}{n(n-1)}$ .

Given the transition matrix $W$ , by taking the logarithm of the Cheeger inequalities in (11), for $\Phi(W)\leq 1/2$ , we obtain

[TABLE]

Then, choosing $W=\overline{W}_{P^{u}}$ and $W=\overline{W}_{P^{r}}$ above, applying Theorem 1 and Proposition 6 and noting $-\log(1-x)\approx x$ for $x$ close to 0, leads to the lower and upper bounds on the averaging time of uniform and ER gossiping algorithms as shown in Result 1 of our main results section (Section 3). In the Supplementary Material, we also studied the tightness of our conductance bounds (13) numerically on the $c$ -barbell graphs to show that our bounds are reasonable. In particular, we observe that our lower bounds gets tighter as the number of nodes, $n$ , increases on c-barbell graphs.

Although this analysis is also applicable to other graphs with low conductance, it does not typically lead to tight estimates, i.e., the lower and upper bounds do not match in terms of their dependency on $n$ . In the next section, we show that for the case of barbell graphs, we get tight estimates on the averaging time by computing the eigenvalues of the averaging matrices $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ explicitly. More precisely, we will show in Proposition 9 that the lower bounds in (7)–(8) are tight for $c=2$ in the sense that $T_{ave}(\varepsilon,P^{u})=\Theta(n^{3})$ and $T_{ave}(\varepsilon,P^{r})=\Theta(n^{2})$ and the effective resistance-based averaging is faster by a factor of $\Theta(n)$ which will imply Result 2.

4.2 Proof of Result 2 via spectral analysis

Eigenvalues of probability transition matrices defined on barbell graphs are studied in the literature. Consider the edge-weighted barbell graph $K_{\tilde{n}}-K_{\tilde{n}}$ with $n=2{\tilde{n}}$ nodes, where $w=[w_{ij}]_{(i,j)\in\mathcal{E}}$ is the vector of edge weights that have positive entries. Suppose each node has a self-loop, e.g., see Fig. 3. Let $(i^{*},j^{*})$ be the edge that connects the two complete subgraphs. The result [37, Prop. 5.1] gives an explicit formula for the eigenvalues of a probability transition matrix $W$ with transition probabilities proportional to edge weights, i.e., $W_{ij}=w_{ij}/\sum_{j\in\mathcal{N}_{i}}w_{ij}$ where $w_{ij}$ satisfy the following assumptions: $w_{i^{*}i^{*}}=w_{j^{*}j^{*}}=0$ , $w_{i^{*}j^{*}}=A$ , $w_{i^{*}j}=w_{j^{*}i}=B$ for all $j\in\mathcal{N}_{i^{*}}\setminus\{j^{*}\}$ and $i\in\mathcal{N}_{j^{*}}\setminus\{i^{*}\}$ , $w_{ij}=C$ for all $(i,j)$ in each $K_{\tilde{n}}$ such that $i\neq j$ and $i,j\notin\{i^{*},j^{*}\}$ , and $w_{ii}=D$ for $i\in\mathcal{N}\setminus\{i^{*},j^{*}\}$ for some $A,B,C,D>0$ . Note we cannot immediately use this result to compute the eigenvalues of the transition matrices $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ defined in Lemma 4. Mainly because all the diagonal entries of $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ being strictly positive breaks the $w_{i*i*}=w_{j*j*}=0$ assumption of [37, Prop. 5.1]. In Proposition 8, we adapt [37, Prop. 5.1] to our setting with some minor modifications to allow $w_{i^{*}i^{*}}=w_{j^{*}j^{*}}=G$ for any $G>0$ so that it becomes applicable to $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ . The proof of Proposition 8, provided in the Supplementary Material, is similar to the proof of [37, Prop. 5.1] and is based on exploiting the symmetry properties of the weighted barbell graph as described above –illustrated in Figure 3 for $\tilde{n}=4$ .

Proposition 8 (Generalization of Proposition 5.1 in [37]).

Consider the edge-weighted barbell graph $K_{\tilde{n}}-K_{\tilde{n}}$ with $n=2{\tilde{n}}$ nodes. Let $(i^{*},j^{*})$ be the edge that connects the two complete subgraphs. Assume that weights are of the form $w_{i^{*}i^{*}}=w_{j^{*}j^{*}}=G$ , $w_{i^{*}j^{*}}=A$ , $w_{i^{*}j}=w_{j^{*}i}=B$ for all $j\in\mathcal{N}_{i^{*}}\setminus\{j^{*}\}$ and $i\in\mathcal{N}_{j^{*}}\setminus\{i^{*}\}$ , $w_{ij}=C$ for all $(i,j)$ in each $K_{\tilde{n}}$ such that $i\neq j$ and $i,j\notin\{i^{*},j^{*}\}$ , and $w_{ii}=D$ for $i\in\mathcal{N}\setminus\{i^{*},j^{*}\}$ for some $A,B,C,D,G>0$ . Consider the transition matrix $W$ associated to this graph with entries $W_{ij}=w_{ij}/\sum_{j\in\mathcal{N}_{i}}w_{ij}$ , then the eigenvalues of $W$ are

•

$\lambda_{a}\triangleq 1$ * with multiplicity one,*

•

$\lambda_{b}\triangleq-1+\frac{A+G}{A+G+E}+\frac{F}{F+B}$ * with multiplicity one,*

•

$\lambda_{c}\triangleq\frac{D-C}{B+F}$ * with multiplicity $n-4$ ,*

•

$\lambda_{\pm}\triangleq\frac{1}{2}\Big{(}\frac{F}{B+F}+\frac{G-A}{A+E+G}\,\pm\,\sqrt{S}\Big{)}$ ,

*where $E\triangleq({\tilde{n}}-1)B$ , $F\triangleq D+({\tilde{n}}-2)C$ and $S\triangleq\big{(}\frac{F}{B+F}+\frac{G-A}{A+E+G}\big{)}^{2}-\frac{4(FG-BE-AF)}{(B+F)(A+E+G)}$ . *

Based on this result, in Proposition 9 we characterize the second largest eigenvalue of the transition matrices $\overline{W}_{P^{u}}$ and $\overline{W}_{P^{u}}$ – the proof can be found in the appendix.

Proposition 9.

Consider Markov chains on the barbell graph $K_{{\tilde{n}}}-K_{{\tilde{n}}}$ with transition matrices $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ . The second largest eigenvalues of these matrices are given by

[TABLE]

Result 2 follows as a direct consequence of Proposition 9 and Theorem 1. Thus, we establish that that averaging time with resistance weights is $\Theta(n)$ faster on a barbell graph.

4.3 Proof of Result 3 via hitting and mixing times

Before giving a formal definition of the $\varepsilon$ -mixing time, we introduce the total variation (TV) distance between two probability measures $p$ and $q$ defined on the set of nodes $\mathcal{N}=\{1,2,\dots,n\}$ . TV distance between $p$ and $q$ is defined as $\|p-q\|_{TV}\triangleq\|p-q\|_{1}/2.$ Given a Markov chain $\mathcal{M}$ with a probability transition matrix $W$ and stationary distribution $\pi$ , $\varepsilon$ -mixing time is a measure of how many iterations are needed for the probability distribution of the chain to be $\varepsilon$ -close to the stationary distribution in the TV distance. A related notion is the hitting time which is a measure of how fast the Markov chain travels between any two nodes.

Definition 10.

(Mixing time and hitting times) Given $\varepsilon>0$ and a Markov chain with probability transition matrix $W$ and stationary distribution $\pi$ , the $\varepsilon$ -mixing time is defined as

[TABLE]

and the hitting time $H_{W}(i\to j)$ is the expected number of steps until the Markov chain reaches $j$ starting from $i$ .

Mixing-times and averaging times are closely related. In fact, given probability transition matrix $W$ , it is known that $T_{ave}(\varepsilon,W)$ and $T_{mix}(\varepsilon,\tilde{W})$ admit the same bounds up to $n\log n$ factors [12, Theorem 7] for $\tilde{W}=\frac{I+W}{2}$ .444Note [12, Theorem 7] uses absolute time whereas we used number of node wake-ups to define $\epsilon$ -averaging and $\epsilon$ -mixing times; therefore, we multiplied $\log(n)$ factor in [12, Theorem 7] by $\sum_{i\in\mathcal{N}}r_{i}=2(n-1)$ to convert absolute times to number of node wake-ups. Hence, designing algorithms with a smaller mixing time, often leads to better algorithms for distributed averaging (see also [28]). It is also known that mixing time is closely related to hitting times [39, Theorem 1.1].

Next, we show the first part of Result 3, i.e., $T_{ave}(\varepsilon,P^{r})=\Theta(n^{2}\log(1/\varepsilon))$ is optimal among all $\mathcal{A}(P)$ with a symmetric $P$ . Note $P$ is symmetric implies that it is doubly stochastic. For large $n$ and doubly stochastic $P$ , by [12, Corollary 1], we have $T_{ave}(\varepsilon,P)=\Theta\left(\frac{n\log(1/\varepsilon)}{1-\lambda_{n-1}(P)}\right)$ . On the other hand, Roch proved in [26, Section 3.3.1] that any symmetric doubly stochastic $P$ matrix on the barbell graph with $n$ nodes satisfies the bound $\frac{1}{1-\lambda_{n-1}(P)}=\Omega(n)$ . Inserting this estimate into the expression for the averaging time, we obtain $T_{ave}(\varepsilon,P)=\Omega\left(n^{2}\log(1/\varepsilon)\right)$ for any $\mathcal{A}(P)$ with symmetric $P$ on barbell graphs.

We conclude that the averaging time of the ER-based gossiping on the barbell graph, which satisfies $T_{ave}(\varepsilon,P^{r})=\Theta(n^{2}\log(1/\varepsilon))$ by Proposition 9 and Theorem 1, is optimal with respect to its dependency to $n$ and $\varepsilon$ among all symmetric choices of the $P$ matrix.

Next, given any connected graph $\mathcal{G}$ , we obtain a bound on the second largest eigenvalue of the $\overline{W}_{P^{r}}$ and show that the averaging time with effective resistance weights $T_{ave}(\varepsilon,P^{r})=\mathcal{O}\left(\mathcal{D}n^{3}\log(1/\varepsilon)\right)$ where $\mathcal{D}$ is the diameter of the graph.

Theorem 11.

Let $\mathcal{G}$ be a graph with diameter $\mathcal{D}$ . The second largest eigenvalue of $\overline{W}_{P^{r}}$ satisfies $\lambda_{n-1}(\overline{W}_{P^{r}})\leq 1-\frac{1}{6\mathcal{D}n^{3}}.$

Proof.

It follows from our discussion in Section 4 that $\overline{W}_{P^{r}}$ is non-negative and doubly stochastic (see the paragraph before Lemma 4). Therefore, for analysis purposes, we can interpret $\overline{W}_{P^{r}}$ as the transition matrix of a Markov chain $\mathcal{M}$ whose stationary distribution $\pi$ is the uniform distribution. Our analysis is based on relating the eigenvalues of $\overline{W}_{P^{r}}$ matrix to the hitting times of the Markov chain $\mathcal{M}$ where we follow the proof technique of [40, Lemma 2.1]. By Lemma 17 from the appendix, we get $H_{\overline{W}_{P^{r}}}(i\to j)\leq n\frac{2(n-1)}{R_{ij}}$ if $j\in\mathcal{N}_{i}.$ For any graph, it is also known that555This follows directly from the Rayleigh’s monotonicity rule [5] which says that if an edge is removed from a graph, effective resistance on any edge can only increase. Therefore, the complete graph provides a lower bound for $R_{ij}$ where $R_{ij}=2/n$ (see also [41]). $\min_{i,j}R_{ij}\geq\frac{2}{n}.$ Therefore, for any neighbors $i$ and $j$ , $H_{\overline{W}_{P^{r}}}(i\to j)\leq n^{2}(n-1).$ For any two vertices $i$ and $j$ not necessarily neighbors, $i\neq j$ , let $v_{0}(=i),v_{1},\dots,v_{\ell}(=j)$ be the shortest path connecting $i$ and $j$ . Then, by the subadditivity property of hitting times, for any $i,j\in\mathcal{N}$ , we obtain $H_{\overline{W}_{P^{r}}}(i\to j)\leq\ell n^{2}(n-1)\leq\mathcal{D}n^{2}(n-1)$ . It follows from an analysis similar to [42] that

[TABLE]

From [42, eqn. (12.12)], we also have

[TABLE]

Combining this with the estimate (14) implies directly $\lambda_{n-1}(\overline{W}_{P^{r}})\leq 1-\frac{1}{6\mathcal{D}n^{3}},$ which proves the claim. ∎

Metropolis vs ER gossiping: Given a connected $\mathcal{G}=(\mathcal{N},\mathcal{E})$ , suppose there are no self-loops, i.e., $(i,i)\not\in\mathcal{E}$ for $i\in\mathcal{N}$ . Uniform weights $p^{u}_{j|i}=\frac{1}{d_{i}}$ can result in slow mixing on some graphs such as the barbell graph (see Proposition 9) or other graphs like lollipop graphs [4] which have both high degree and low degree nodes together. A popular alternative to uniform weights $\{p^{u}_{j|i}\}_{j\in\mathcal{N}_{i}}$ for $i\in\mathcal{N}$ is the Metropolis weights defined as

[TABLE]

Let $M$ denote the matrix whose entries are the Metropolis weights $M_{ij}$ . The weights determined by the matrix ${\widetilde{M}}\triangleq\frac{I+M}{2}$ are also popular in the distributed optimization practice [40] which is referred to as the lazy version of the Metropolis weights. The matrix $\widetilde{M}$ is symmetric and positive semi-definite, unlike the matrix $M$ which may have negative eigenvalues that can be close to $-1$ (therefore, it can be problematic for the convergence of distributed algorithms, see e.g. [43]). Combined with uniform wake-up of nodes, this leads to the following wake-up probabilities for the Metropolis weights based system: ${P_{ij}^{{\widetilde{m}}}}\triangleq\frac{1}{n}\widetilde{M}_{ij},$ and the associated matrix $\overline{W}_{P^{{\widetilde{m}}}}\triangleq\mathbb{E}_{P^{{\widetilde{m}}}}[{W^{(i,j)}}]=\sum_{ij}P_{ij}^{\widetilde{m}}{W^{(i,j)}}.$ In particular, for any connected graph $\mathcal{G}=(\mathcal{N},\mathcal{E})$ with $n$ nodes, we have the following guarantees from [40, Lemma 2.1] on the lazy Metropolis weights:

[TABLE]

By (9), we have also $\overline{W}_{P^{\widetilde{m}}}=(1-\frac{1}{n})I+\frac{1}{n}\widetilde{M}.$ Therefore, from (16), we get the bound $\lambda_{n-1}(\overline{W}_{P^{{\widetilde{m}}}})\leq 1-\frac{1}{71n^{3}}$ , for any connected graph $\mathcal{G}$ . Therefore, we conclude from Theorem 1 that the $\varepsilon$ -averaging time of Metropolis weights-based gossiping on any graph is $\mathcal{O}(n^{3}\log(1/\varepsilon))$ – again using the fact that $-\log(1-x)\approx x$ for $x$ close to [math]. That said, for barbell graphs, Metropolis weights perform similar to uniform weights; both require $\Theta(n^{3}\log(1/\varepsilon))$ time which is improved by the effective resistance-based weights to $\Theta(n^{2}\log(1/\varepsilon))$ . This completes the proof of Result 3.

Remark 12.

Comparing the inequalities $\lambda_{n-1}(\overline{W}_{P^{r}})\leq 1-\frac{1}{6\mathcal{D}n^{3}}$ and $\lambda_{n-1}(\overline{W}_{P^{{\widetilde{m}}}})\leq 1-\frac{1}{71n^{3}}$ , we see that for $\mathcal{D}\leq 11$ , the upper bound on $\lambda_{n-1}(\overline{W}_{P^{r}})$ will be smaller than the upper bound for $\lambda_{n-1}(\overline{W}_{P^{{\widetilde{m}}}})$ . Therefore, performance bounds obtained on the $\varepsilon$ -averaging time through Theorem 1 for ER weights will be better than those of Metropolis weights by a (small) constant factor for $\mathcal{D}\leq 11$ .

5 Numerical Experiments

In this section, we demonstrate the benefits of using effective resistances for solving the consensus problem and also within DPGA-W [17] and EXTRA [43] algorithms for consensus optimization.

5.1 Consensus exploiting effective resistances

Gossiping algorithms have been studied extensively and there have been a number of approaches [28, 44, 45, 46, 47, 48, 49]. In light of Theorem 1, among all the algorithms $\mathcal{A}(P)$ with a symmetric $P$ , the matrix $P^{opt}$ that minimizes the second largest eigenvalue, i.e. $\lambda_{n-1}(\overline{W}_{P})$ , is the fastest. The gossiping algorithm $\mathcal{A}(P^{\text{opt}})$ with optimal choice of the probability matrix $P^{\text{opt}}$ is called the Fastest Mixing Markov Chain (FMMC) in the literature [27]. In [12], Boyd et al. propose a distributed subgradient method to compute the matrix $P^{\text{opt}}$ . This method requires a decaying step size and computation of the subgradient of the objective $\lambda_{n-1}(\overline{W}_{P})$ with respect to the decision variable $P$ at every iteration which itself requires solving a consensus problem at every iteration. This can be expensive in practice in terms of average number of communications required per node, and its convergence to $P^{opt}$ can be slow with at most sublinear convergence rate [12]. In contrast, ER probabilities $P^{r}$ are optimal for some graphs (such as the barbell graph, see Result 3) and can be computed efficiently with the normalized D-RK algorithm (see the Supplementary Material) which admits linear convergence guarantees. Therefore, ER weights can serve as a computationally efficient alternative to optimal weights for consensus. For illustrating this point, we compare communication requirements per node for ER gossiping and FMMC on barbell and small-world graphs. This comparison consists of two stages: $(i)$ pre-computation stage (where the probability matrices $P^{r}$ and $P^{\text{opt}}$ are computed up to a given tolerance) $(ii)$ asynchronous consensus stage (where we run ER and FMMC with probability matrices $P^{r}$ an $P^{\text{opt}}$ obtained from the previous stage to solve a consensus problem).

First, we implement subgradient method with decaying step size $\alpha_{k}=R/k$ from [12] where $R$ is tuned to the graph to achieve the best performance and stop the computation of matrix of FMMC at step $k$ if the iterate $P_{k}$ satisfies $\frac{||P_{k}-P^{opt}||_{F}}{||P^{opt}||_{F}}\leq\epsilon_{1}$ where $\epsilon_{1}$ is the given precision level.666The optimal probability matrix $P^{opt}$ which serves as a baseline in the stopping criterion is estimated accurately by solving the semi-definite program (SDP) [12, eqn. (53)] directly using the CVX software [50] with a centralized method and computations required to solve this SDP is not counted as a part of the communication cost we report for FMMC in Tables 1-2. Similarly, we compute $\mathcal{L}^{{\dagger}}$ for ER and stop the normalized D-RK algorithm when the iterate $X^{k}$ at step $k$ satisfies $\frac{\|X^{k}-\mathcal{L}^{\dagger}\|_{F}}{\|{\mathcal{L}}^{\dagger}\|_{F}}\leq\epsilon_{1}$ . Since the distributed subgradient method of [12] is based on synchronous computations, we also implemented the normalized D-RK algorithm with synchronous computations for fairness of comparison. We define the communication for a node as a contact with its neighbour either to compute an average of their state vectors or to update the matrix $P_{k}$ at any iteration.

We compared both of the algorithms based on their communication performances on stage-i an stage-ii. In particular, we considered the number of communications required per node to obtain the matrix $P_{k}$ for ER and FMMC at stage-i and at stage-ii, we generated 1000 instances of $y_{i}^{0}$ to start consensus and compare the average number of communications per node required to achieve $y_{i}^{k}$ satisfying $\frac{||y^{k}-\bar{y}||}{||\bar{y}||}\leq\epsilon_{2}$ where $\epsilon_{2}$ is the tolerance level.

For the barbell graph, the initial state vector $y_{i}^{0}$ for consensus is sampled from the normal distribution $\textbf{N}(500,10)$ if $i\in\mathcal{N}_{L}$ and from $\textbf{N}(-500,10)$ if $i\in\mathcal{N}_{R}$ where tolerance levels are set to be $\epsilon_{1}=\epsilon_{2}=0.01$ . We also compare ER and FMMC on small-world graphs while the number of nodes $n$ is varied with an edge density $\frac{2m}{n^{2}-n}\approx 0.4$ where $m$ is the total number of edges. On small-world graphs we generated $1000$ instances of $y_{i}^{0}$ drawn from $\textbf{N}(0,100)$ and stopped algorithms whenever tolerance levels $\epsilon_{1}=\epsilon_{2}=0.05$ are obtained or the number of communications per node exceeded $10^{6}$ .

Results for both of the graphs are reported in Tables 1 and 2 in which we compare the average communication per node in the pre-computation (stage- $i$ ) and in the consensus computation (stage- $ii$ ) where results are averaged over 1000 runs. On barbell graph, we observe that FMMC requires less communications at the second (consensus) stage as expected (as FMMC is based on the optimal matrix $P^{opt}$ ), but in terms of total communications (stage- $i$ + stage- $ii$ ) ER outperforms FMMC. In the case of small-world graphs, computation of $P^{opt}$ exceeded the maximum communication limit which caused FMMC to perform worse than ER in stage- $ii$ (since the stage- $i$ solution is not a precise approximation of $P^{opt}$ anymore). We can say that ER performs better than FMMC in terms of total communications for both graph types.

In addition to FMMC, we consider Metropolis weights-based gossiping and fastest quantum gossiping (FQG) proposed by Jafarizadeh in [51]. In the Metropolis-based gossiping approach, each node $i$ wakes up with uniform probabilities (i.e. $p_{i}^{m}=\frac{1}{n}$ ) and communicates with one of its neighbors $j\in\mathcal{N}_{i}\setminus\{i\}$ with probability $p_{j|i}^{m}=\frac{1}{\max\{d_{i},d_{j}\}}$ . The FQG, on the other hand, calculates the wake-up ( $p^{f}_{i}$ ) and conditional communication probabilities $(p^{f}_{j|i})$ at each agent $i\in\mathcal{N}$ by solving an SDP problem. This SDP is targeted to optimize the spectral gap of the expected iteration matrix. The method proposed in [51] for solving this SDP is a centralized algorithm; therefore, we made the comparisons among these methods in terms of the time required to compute the probability matrices $P^{r}$ and $P^{f}$ in a centralized manner. The entries of these matrices are computed according to $[P^{f}]_{ij}:=p^{f}_{i}p_{j|i}^{f}$ and $[P^{r}]_{ij}:=p^{r}_{i}p_{j|i}^{r}$ . For the Metropolis weights, the probability matrix $P^{m}$ does not require any pre-computation time as it is only based on the degrees of the nodes and is assumed to be known. We also introduce the spectral gaps defined as $\Delta_{r}=:1-\lambda_{n-1}(\overline{W}_{P^{r}})$ , $\Delta_{f}:=1-\lambda_{n-1}(\overline{W}_{P^{f}})$ , $\Delta_{m}:=1-\lambda_{n-1}(\overline{W}_{P^{m}})$ of the corresponding expected iteration matrices.

In our next set of experiments, we consider barbell graphs and random graphs generated with the stochastic block model (SBM). The stochastic block model $\mbox{SBM}(n,k,p,q)$ , also known as the planted partition model [52, 53], consist of $n$ nodes and $k$ clusters where each node in every cluster is connected to any other node in the same cluster with probability $p$ , whereas the nodes that are not in the same cluster are connected with probability $q$ .

We summarize our results for barbell graph in Table 4 and for $SBM(n,3,0.9,0.1)$ in Table 5. A gossiping algorithm will be faster if its spectral gap is larger. We observe that $\log(1/\Delta_{r})$ is smaller than $\log(1/\Delta_{m})$ and larger than $1/\log(\Delta_{f})$ as $n$ is varied, therefore $\Delta_{r}$ is larger than $\Delta_{m}$ and is smaller than $\Delta_{f}$ ; so, we conclude that ER performs faster then Metropolis and slower than FQG. This is expected as effective-resistance weights are not optimized to increase the spectral gap whereas FQG weights are targeted to optimize the spectral gap. However, when we look at the CPU time required to compute effective resistances and FQG weights, reported in the columns titled “CPU Time ER" and “CPU Time FQG", we observe that effective resistance weights can be computed faster as it only requires a matrix inversion whereas FQG algorithm requires solving a semi-definite program (SDP).777We used the SDP solver SeDuMi in the software CVX to compute FQG weights and matrix inversion function in Matlab to compute ER weights in a centralized manner.The advantage of using the ER weights is that they are faster to compute, and this effect becomes more pronounced for larger graphs. Moreover, ER weights can be computed efficiently (with a linear convergence rate) and asynchronously in the decentralized setting using the randomized Kaczmarz algorithm. We note that solving SDPs in the decentralized setting is possible with subgradient methods as discussed in [12] but are typically much slower in the decentralized setting as they admit at most sublinear rates. This fact is also reflected in our results in Table 1 and Table 2 for the FMMC method which used a subgradient method to compute the weights. To summarize, we conclude that the Metropolis weights require no pre-computation but they are the slowest in terms of the spectral gap. ER is faster than Metropolis but slower than FQG weights; but the advantage is that computing ER weights require less CPU time. ER weights can also be computed efficiently in the decentralized setting with a linearly convergent algorithm.

Lastly, we compare the performance of ER-based asynchronous gossiping with Metropolis weights-based asynchronous gossiping (Metropolis) and classical asynchronous gossiping (Uniform) on small-world, barbell, and random graphs generated with the stochastic block model, $\mbox{SBM}(n,k,p,q)$ . We considered two types of ER-based gossiping algorithms: (i) The first algorithm ER-Ex uses the exact effective resistance probabilities that are computed based on calculating the pseudo-inverse of the Laplacian matrix (with a centralized approach based on standard matrix inversion techniques), (ii) The second algorithm ER-Kac is based on the effective-resistance weights approximated by the decentralized Kacmarz method.

In the experiments, each node $i$ possesses an initial vector $y_{i}^{(0)}\in\mathbb{R}^{5}$ and the goal is to approximate the node averages $\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_{i}^{(0)}$ . We draw the data $y_{i}^{(0)}$ randomly according to the standard multi-variate normal distribution admitting a zero mean and a unit covariance matrix. In each trial, we record the number of wake-ups required to obtain the relative accuracy $\frac{\sum_{i=1}^{n}\|y_{i}^{(k)}-\bar{y}\|^{2}}{n\|\bar{y}\|^{2}}\leq\varepsilon$ . We set $\varepsilon=5\times 10^{-6}$ and generated 250 independent runs. We calculated the average and the standard deviation of the wake-ups among these 250 runs on SBM, barbell graph, and small-world graphs. We presented our results in the Table 3 (see Figure 4 for the details of these graphs).

We observe that effective-resistance based algorithms (ER-Kac and ER-Ex) improve clearly upon the uniform (uniform weights-based) gossiping on all of the graph types in terms of both average wake-ups required and the standard deviation of the wake-ups. When we compare ER-Kac and ER-Ex with Metropolis weights, we observe that ER-Kac and ER-Ex are more efficient compared to Metropolis weights in the sense that they require a smaller number of wake-ups on average with a smaller standard deviation. The improvement is more pronounced for the graphs in Fig. 4(b)–4(e). These experimental results illustrate the effectiveness of our approach compared to existing approaches on a number of random graph topologies that can arise in practice.

5.2 Effective resistance-based DPGA-W and EXTRA

We implemented our ER-based communication framework into the state of the art distributed algorithms: DPGA-W [17] and EXTRA [43] to solve regularized logistic regression problems over a barbell graph $K_{\tilde{n}}-K_{\tilde{n}}$ with $n=2\tilde{n}$ nodes: We minimize $\min_{x\in\mathbb{R}^{p}}\sum_{i=1}^{n}f_{i}(x)$ with

[TABLE]

where $N_{s}$ is the number of samples at each node, $\{(a_{i\ell},b_{i\ell})\}_{\ell=1}^{N_{s}}\subset\mathbb{R}^{p}\times\{-1,1\}$ for $i\in\mathcal{N}$ denote the set of feature vectors and corresponding labels. We let $p=20$ and $N_{s}=5$ . For each $n\in\{20,40\}$ and $\sigma\in\{1,2\}$ , we randomly generated $20$ i.i.d. instances of the problem in (17) by sampling $a_{i\ell}\sim\textbf{N}(\mathbf{1},\sigma^{2}\mathbf{I})$ independently from the normal distribution and setting $b_{i\ell}=-1$ if $1/(1+e^{-a_{i\ell}^{\top}\mathbf{1}})\leq 0.55$ and to $+1$ otherwise. Both algorithms are terminated after $10^{4}$ iterations. For benchmark, we also solved each instance of (17) using MOSEK [54] within CVX [50]. We initialized the iterates uniformly sampling each $p$ components from the $[500,510]$ interval for nodes in one $K_{\tilde{n}}$ , and from $[-500,-490]$ for nodes in the the other $K_{\tilde{n}}$ . The results for $n=20$ and $n=40$ are displayed in Fig. 5 and Fig. 6, respectively. We plotted relative suboptimality $\left\|\mathbf{x}^{k}-\mathbf{x}^{*}\right\|/\left\|\mathbf{x}^{*}\right\|$ , function value sequence $\sum_{i\in\mathcal{N}}f_{i}(x_{i}^{k})$ for the range $[0,~{}10^{5}]$ , and consensus violation $\left\|\mathbf{x}^{k}-\bar{\mathbf{x}}^{k}\right\|/\sqrt{n}$ , where $k$ denotes the (synchronous) communication round counter – in each communication round neighboring nodes communicate among each other synchronously once – and $\mathbf{x}^{k}=[x_{i}^{k}]_{i\in\mathcal{N}}$ denotes the $k$ th iterate; moreover, $\bar{\mathbf{x}}^{k}=\mathbf{1}\otimes\bar{x}^{k}$ , $\bar{x}^{k}=\sum_{i\in\mathcal{N}}x_{i}^{k}/n$ , $\mathbf{x}^{*}\triangleq\mathbf{1}\otimes x^{*}$ and $x^{*}$ is the minimizer to (17).

Both DPGA-W888In DPGA-W stepsize parameter $\gamma_{i}$ is set to $1/\left\|\omega_{i}\right\|$ for $i\in\mathcal{N}$ – see [17]. and EXTRA uses a communication matrix $W$ that encodes the network topology. DPGA-W uses node-specific step-sizes initialized at $\approx 1/L_{i}$ for $i\in\mathcal{N}$ , where $L_{i}$ denotes the Lipschitz constant of $\nabla f_{i}$ , we adopted the adaptive step-size strategy described in [17, Sec. III.D]; and for EXTRA, we choose the constant step-size, common for all nodes, as suggested in [43], i.e., we choose the step size as $2\lambda_{\min}(\tilde{W})/\max_{i\in\mathcal{N}}{L_{i}}$ , where $\tilde{W}=(\mathbf{I}+W)/2$ .

For both algorithms, we compared two choices of $W$ : $W^{u}$ based on uniform edge weights, and $W^{r}$ based on effective resistances. In DPGA-W, the graph Laplacian is adopted for uniform weights, i.e., $W^{u}=W^{u,\text{DPGA-W}}\triangleq\mathcal{L}$ , while for the ER-based weights, we set $W^{r}=W^{r,\text{DPGA-W}}$ where $W^{r,\text{DPGA-W}}_{ii}\triangleq\sum_{j\in\mathcal{N}_{i}}R_{ij}$ for $i\in\mathcal{N}$ and $W^{r,\text{DPGA-W}}_{ij}=-R_{ij}$ for $(i,j)\in\mathcal{E}$ and [math] otherwise. For EXTRA, $W^{u,\text{EXTRA}}=\mathbf{I}-\mathcal{L}/\tau$ where $\tau=\lambda_{\max}(\mathcal{L})/2+\varepsilon$ where $\lambda_{\max}$ denotes the largest eigenvalue; on the other hand, $W^{r,\text{EXTRA}}=\mathbf{I}-W^{r,\text{DPGA-W}}/\tau$ where $\tau=\lambda_{\max}(W^{r,\text{DPGA-W}})/2+\varepsilon$ for $\varepsilon=0.01$ .

Figures 5 and 6 illustrate the performance comparison of both DPGA-W and EXTRA algorithms with effective resistance and uniform weights in terms of suboptimality, convergence in function values and consensus violation for the barbell graph $K_{10}-K_{10}$ and $K_{20}-K_{20}$ respectively – the reported results are averages over the 20 problem instances. The subfigures on the left of Figures 5 and 6 are for noise level $\sigma=1$ whereas those on the right are for $\sigma=2$ . In Figures 5 and 6, we observe that using ER weights improves upon the uniform weights for both EXTRA and DPGA-W methods consistently to solve the logistic regression problem in terms of suboptimality, function values and consensus violation significantly. We also observe that with noisier data, DPGA-W works typically faster than EXTRA in terms of function values and suboptimality. This is because when noise level $\sigma$ gets larger, the local Lipschitz constant $L_{i}$ of the nodes demonstrate higher variability, and DPGA-W adapts to this variability as it uses a step size that is different at each node in a way to adapt to $L_{i}$ , whereas EXTRA uses a constant step size that is the same for all nodes. On the other hand, in terms of consensus violation, we see that EXTRA with ER weights typically outperforms DPGA-W with ER weights.

6 Conclusions

We obtained a number of theoretical guarantees for ER gossiping algorithms for the consensus problem for $c$ -barbell graphs and barbell graphs, and for arbitrary graphs depending on their diameter. The results fill a gap between the theory and practice of these methods. We also showed that these methods are effective for solving the consensus problem in practice over barbell graphs and small-world graphs. We provided numerical experiments demonstrating that using ER gossiping within EXTRA and DPGA-W methods improves their practical performance in terms of communication efficiency.

Acknowledgments

Bugra Can and Mert Gürbüzbalaban’s research were supported by the the Office of Naval Research Award Number N00014-21-1-2244, and the grants National Science Foundation (NSF) CCF-1814888 and NSF DMS-2053485. N. Serhat Aybat’s research is supported by the grants NSF CMMI-1635106 and ARO W911NF-17-1-0298.

Appendix A Proof of Propositions 6 and 9

Proof of Proposition 6.

The proof is based on finding the subset $S$ of the vertex set of $c$ -barbell graph that determines the conductance, i.e. that solves the minimization problem (10). First, for any given $\mathcal{G}=(\mathcal{N},\mathcal{E})$ , the conductance of a subset $S\subset\mathcal{N}$ with respect to the probability transition matrix $W$ is defined as

[TABLE]

Notice that the definition (10) implies that we have $\Phi(W)=\min_{{S\subset\mathcal{N}:}\pi(S)\in(0,1/2]}\Phi_{S}(W)$ .999This follows after straightforward computations based on the the fact that the Markov chain with transition matrix $W$ and stationary distribution $\pi$ is a reversible Markov chain, i.e. $\pi(S)\Phi_{S}(W)=\pi(S^{c})\Phi_{S^{c}}(W)$ for any $S$ with $\pi(S)\in(0,1)$ .With slight abuse of notation, for a subgraph $\mathcal{H}_{0}$ with a vertex set $S_{0}$ , we define $\Phi_{\mathcal{H}_{0}}(W)\triangleq\Phi_{S_{0}}(W)$ . We say that a vertex set $S\subset\mathcal{N}$ on graph $\mathcal{G}=(\mathcal{N},\mathcal{E},w)$ is a one-cut set if its complement $\mathcal{N}\setminus S$ is a connected subgraph of $\mathcal{G}$ . Similarly, we define two-cut set $S_{2}\subset\mathcal{N}$ to be a set whose complement $\mathcal{N}\setminus S_{2}$ consists of two disjoint non-empty connected subgraphs $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ of $\mathcal{G}$ . We define

[TABLE]

For $c_{0}\in[2,c]$ , we also define

[TABLE]

Note that matrices $W_{P^{u}}$ and $W_{P^{r}}$ are symmetric and Markov chains with these transition matrices have the uniform distribution as a stationary distribution. Therefore, Lemmas 13 and 14 provided in Appendix B imply that a set $S$ with minimal conductance should be a one-cut set and has to be given by the vertices of a subgraph $\mathcal{G}_{c_{0}}$ for some $c_{0}\in[1,c]$ for both $W_{P^{u}}$ and $W_{P^{r}}$ . The conductance of one-cut subgraphs with respect to these transition matrices can be computed explicitly (see Proof of Lemma 14 for details):

[TABLE]

Both of the expressions at (21) are minimized for the choice of $c_{0}=\lfloor\frac{c}{2}\rfloor$ . Therefore, the minimal conductance is attained for the subgraph $\mathcal{G}_{\lfloor\frac{c}{2}\rfloor}$ . Plugging $c_{0}=\lfloor\frac{c}{2}\rfloor$ into the expressions above yields the graph conductance values at (12). The bounds (7) and (8) follow from Theorem 1 and inequalities (13). ∎

Proof of Proposition 9.

It follows from Corollary 15 and Lemma 16 in Appendix B that the second largest eigenvalues of $\bar{W}_{P^{u}}$ and $\bar{W}_{P^{r}}$ are given by: $\lambda_{n-1}(\overline{W}_{P^{u}})=1-\frac{8}{n^{2}(n-2)}+\Theta(\frac{1}{n^{4}})$ and $\lambda_{n-1}(\overline{W}_{P^{r}})=1-\frac{1}{n(n-1)}-\Theta(\frac{1}{n^{3}})$ . This implies directly $\lambda_{n-1}(\overline{W}_{P^{r}})=1-\Theta(\frac{1}{n^{2}})$ and $\lambda_{n-1}(\overline{W}_{P^{u}})=1-\Theta(\frac{1}{n^{3}})$ , which completes the proof. ∎

Appendix B Supporting Results

Lemma 13.

Consider a reversible Markov chain on a $c$ -barbell graph with a uniform stationary distribution. Let $\mathcal{H}_{0}$ be a subgraph of $\mathcal{G}$ whose vertex set is a non-empty two-cut set $\mathcal{S}_{0}$ satisfying $|\mathcal{S}_{0}|\leq\frac{|\mathcal{N}|}{2}$ . Then, there exists another subgraph $\tilde{\mathcal{H}}_{0}$ of $\mathcal{G}$ such that $\Phi_{\tilde{\mathcal{H}}_{0}}(W)<\Phi_{\mathcal{H}_{0}}(W)$ .

Proof.

Let $C_{1}$ and $C_{2}$ be the vertex sets of two disjoint non-empty connected subgraphs within $\mathcal{N}\setminus\mathcal{S}_{0}$ satisfying $\mathcal{N}=C_{1}\cup\mathcal{S}_{0}\cup C_{2}$ . Note that $C_{1}\cap C_{2}=\emptyset$ implies either $|C_{1}\cup\mathcal{S}_{0}|\leq\frac{|\mathcal{N}|}{2}$ or $|C_{2}|\leq\frac{|\mathcal{N}|}{2}$ . Using the fact that the transition matrix $W$ of a reversible Markov chain with a uniform stationary distribution is symmetric, the definition (18) implies $\Phi_{C_{1}\cup\mathcal{S}_{0}}(W)=\Phi_{C_{2}}(W)$ . Without loss of generality, choose $\tilde{\mathcal{H}}_{0}$ to be the subgraph with vertices $\tilde{\mathcal{S}_{0}}=C_{1}\cup\mathcal{S}_{0}$ with $|C_{1}\cup\mathcal{S}_{0}|\leq\frac{|\mathcal{N}|}{2}$ (otherwise, pick the subgraph with vertex set $C_{2}$ instead), then

[TABLE]

which proves Lemma 13. ∎

Lemma 14.

Consider a Markov chain on a $c$ -barbell graph with a probability transition matrix $W$ . If $W=\overline{W}_{P^{u}}$ or $W=\overline{W}_{P^{r}}$ , then for any subgraph $\mathcal{H}_{0}$ having a one-cut vertex set $\mathcal{S}_{0}$ , there exists a subgraph $\mathcal{G}_{c_{0}}$ for some $c_{0}\in[1,c]$ such that $\Phi_{\mathcal{G}_{c_{0}}}(W)\leq\Phi_{\mathcal{H}_{0}}(W)$ where $\mathcal{G}_{c_{0}}$ is defined by (19) and (20).

Proof.

For any subgraph $\mathcal{H}_{0}$ having a one-cut vertex set $\mathcal{S}_{0}$ , we can always a find a subgraph $\mathcal{G}_{c_{0}}$ with vertex set $\mathcal{V}_{c_{0}}$ for some $c_{0}\in[1,c]$ such that either $\mathcal{V}_{c_{0}-1}\subset\mathcal{S}_{0}\subset\mathcal{V}_{c_{0}}$ or $\mathcal{V}_{c_{0}-1}\subset S^{c}_{0}\subset\mathcal{V}_{c_{0}}$ (with the convention that $\mathcal{G}_{c_{0}}$ is a singleton graph with a vertex set $\mathcal{V}_{0}$ consisting of a single node). Let $\mathcal{H}_{0}^{c}$ be the subgraph with vertex set $S^{c}_{0}$ . Since $\Phi_{\mathcal{H}_{0}}(W)=\Phi_{\mathcal{H}^{c}_{0}}(W)$ for both $W=\overline{W}_{P^{r}}$ and $W=\overline{W}_{P^{u}}$ , without loss of generality we can assume that $\mathcal{H}_{0}$ satisfies the property $\mathcal{V}_{c_{0}-1}\subset\mathcal{S}_{0}\subset\mathcal{V}_{c_{0}}$ (otherwise, we can replace $\mathcal{H}_{0}$ with $\mathcal{H}_{0}^{c}$ in the proof below). It follows after a straightforward computation (similar to the proof technique of Lemma 16) that transition probability matrices $\overline{W}_{P^{u}}$ and $\overline{W}_{P^{r}}$ on $c-K_{\tilde{n}}$ admit the explicit formula $[\overline{W}_{P^{u}}]_{i^{*}j^{*}}=\frac{1}{c{\tilde{n}}^{2}}$ , $[\overline{W}_{P^{u}}]_{i^{*}j}=\frac{1}{2c{\tilde{n}}^{2}}\Big{(}\frac{2{\tilde{n}}-1}{{\tilde{n}}-1}\Big{)}$ , $[\overline{W}_{P^{u}}]_{ij}=\frac{1}{c{\tilde{n}}({\tilde{n}}-1)}$ , whereas $[\overline{W}_{P^{r}}]_{i^{*}j^{*}}=\frac{1}{2(c{\tilde{n}}-1)}$ , $[\overline{W}_{P^{r}}]_{i^{*}j}=\frac{1}{{\tilde{n}}(c{\tilde{n}}-1)}$ , $[\overline{W}_{P^{r}}]_{ij}=\frac{1}{{\tilde{n}}(c{\tilde{n}}-1)}$ , where $i^{*}$ and $j^{*}$ denote two adjacent nodes belonging to different complete subgraphs of $c-K_{{\tilde{n}}}$ , i.e., those with degree ${\tilde{n}}$ , and $(i,j)\in\mathcal{E}$ or $(i^{*},j)\in\mathcal{E}$ such that $i$ and $j$ denote nodes in $c-K_{{\tilde{n}}}$ with degree ${\tilde{n}}-1$ . Note $[W_{P^{r}}]_{i^{*}j^{*}}$ is greater than $[W_{P^{u}}]_{i^{*}j^{*}}$ as in the $K_{\tilde{n}}-K_{\tilde{n}}$ case. Hence, for $W=\overline{W}_{P^{u}}$ ,

[TABLE]

In the case of $W=\overline{W}_{P^{r}}$ , let $\mathcal{P}_{0}\subset\mathcal{S}_{0}$ be the subset of nodes in the subgraph $K_{{\tilde{n}}}$ that contains nodes from both $\mathcal{S}_{0}$ and $\mathcal{S}_{0}^{C}$ – if no such $K_{{\tilde{n}}}$ exists, then $\mathcal{S}_{0}$ corresponds to a subgraph $\mathcal{G}_{c_{0}}$ for some $c_{0}\in[1,c]$ . Now consider the former case, let us denote $m_{0}\triangleq|\mathcal{P}_{0}|<{\tilde{n}}$ . The number of edges between $\mathcal{P}_{0}$ and $\mathcal{S}_{0}^{C}$ is given by $m_{0}({\tilde{n}}-m_{0})$ . This is due to the fact that each node in $\mathcal{P}_{0}$ has exactly $({\tilde{n}}-m_{0})$ many edges that connect $\mathcal{S}_{0}$ to its complement. We have also $m_{0}({\tilde{n}}-m_{0})\geq\frac{{\tilde{n}}}{2}$ for ${\tilde{n}}\geq 2$ . This yields $\Phi_{\mathcal{H}_{0}}(\overline{W}_{P^{r}})=\frac{1}{|\mathcal{S}_{0}|}\sum_{\begin{subarray}{c}i\in\mathcal{S}_{0}\\ j\in\mathcal{S}_{0}^{c}\end{subarray}}[\overline{W}_{P^{r}}]_{ij}\geq\frac{1}{|\mathcal{S}_{0}|}\frac{m_{0}({\tilde{n}}-m_{0})}{{\tilde{n}}(c{\tilde{n}}-1)}\\ \geq\frac{1}{|c_{0}{\tilde{n}}|}\frac{1}{2(c{\tilde{n}}-1)}=\Phi_{\mathcal{G}_{c_{0}}}(W_{P^{r}})$ . ∎

Corollary 15.

Under the setting of Proposition 8, assume that the weight matrix $w$ is normalized, i.e., $\sum_{j=1}^{n}w_{ij}=1$ for all $i\in\mathcal{N}$ . Then $W=w$ is a doubly stochastic matrix and the eigenvalues of $W$ become

•

$\lambda_{a}=1$ * with multiplicity one,*

•

$\lambda_{b}=-1+(A+G)+F$ * with multiplicity one,*

•

$\lambda_{c}=D-C$ * with multiplicity $2{\tilde{n}}-4$ ,*

•

$\lambda_{\pm}=\frac{1}{2}\Big{(}F+G-A\,\pm\,\sqrt{S}\Big{)}$ ,

where $A,B,C,D,E,F,G$ and $S$ are as in Proposition 8. Moreover, $\lambda_{+}$ satisfies

[TABLE]

and is the second largest eigenvalue, i.e. $\lambda_{n-1}(W)=\lambda_{+}$ .

Proof.

Since $w$ is normalized, Proposition 8 applies with $A+G+E=1$ and $B+F=1$ . Thus eigenvalues simplify to the forms given in the statement. Note that $\sqrt{S}=\sqrt{(F+G-A)^{2}-4(FG-BE-AF)}=\sqrt{(F-G+A)^{2}+4BE}\geq 0$ . Therefore, $\lambda_{+}$ satisfies (22). $\lambda_{a}=1$ is the unique largest eigenvalue since $W$ is stochastic. It remains to show that $\lambda_{+}$ is the second largest eigenvalue. Using (22), we can write $\lambda_{+}\geq\frac{1}{2}\big{(}F+G-A+|F-G+A|\big{)}.$ There are two cases: $F\geq(G-A)$ or $F<(G-A)$ . In both cases, we observe $\lambda_{+}\geq F\geq 0$ . Since $A+G+E=1$ , we also have $A+G-1=-E\leq 0$ . Therefore $\lambda_{b}=F-E\leq F\leq\lambda_{+}.$ Furthermore, $\lambda_{c}=D-C\leq F=D+({\tilde{n}}-2)C$ since $C\geq 0$ ; therefore $\lambda_{c}\leq F\leq\lambda_{+}$ . Finally, $\lambda_{+}\geq 0$ since $S\geq 0$ . Thus, $\lambda_{+}$ is non-negative and is the second largest eigenvalue. ∎

Lemma 16.

Consider the setting of Proposition 8:

$(i)$

If $W=\overline{W}_{P^{u}}$ , then Proposition 8 applies with $A=A^{u}$ , $B=B^{u}$ , $C=C^{u}$ , $D=D^{u}$ and $G=G^{u}$ where

[TABLE]

The second largest eigenvalue of $\overline{W}_{P^{u}}$ is given by $\lambda_{n-1}(\bar{W}_{P^{u}})=1-\frac{n^{2}+n-8}{2n^{2}(n-2)}+\frac{1}{8}\sqrt{S_{n}^{u}}=1-\frac{8}{n^{2}(n-2)}+\Theta(\frac{1}{n^{4}})$ , for $S_{n}^{u}=\frac{4n^{3}+24n^{2}-156n+192}{(0.5n-1)^{2}n^{3}}$ .

$(ii)$

If $W=\overline{W}_{P^{r}}$ , then Proposition 8 applies with $A=A^{r}$ , $B=B^{r}$ , $C=C^{r}$ , $D=D^{r}$ and $G=G^{r}$ where

[TABLE]

Moreover, the second largest eigenvalue of $\overline{W}_{P^{u}}$ is given by $\lambda_{n-1}(\bar{W}_{P^{r}})=1-\frac{1}{(n-1)}+\frac{1}{2}\sqrt{S_{n}^{r}}=1-\frac{1}{n(n-1)}-\Theta(\frac{1}{n^{3}})$ , for $S_{n}^{r}=\frac{4n-8}{n(n-1)^{2}}$ .

Proof of Lemma 16.

We first compute the entries of both $P^{u}$ and $P^{r}$ matrices explicitly for the barbell graph (i.e. $K_{\tilde{n}}-K_{\tilde{n}}$ ). Former one can be found directly from degrees of the nodes: $P^{u}_{ij}=\frac{1}{2{\tilde{n}}({\tilde{n}}-1)}$ if $i\notin\{i^{*},j^{*}\}$ , $P^{u}_{ij}=\frac{1}{2{\tilde{n}}^{2}}$ if $i\in\{i^{*},j^{*}\}$ . Calculating $P^{r}$ requires us to find effective resistances on the graph. Following definition of resistance allows us to calculate them using Cayley’s formula for complete graphs,

[TABLE]

A complete graph with $\tilde{n}$ vertices has ${\tilde{n}}^{{\tilde{n}}-2}$ spanning trees, therefore barbell graph has ${\tilde{n}}^{2{\tilde{n}}-4}({\tilde{n}}^{{\tilde{n}}-2}\times{\tilde{n}}^{{\tilde{n}}-2})$ spanning trees. Let $K$ be the number of trees passing from an edge then $K\times\binom{{\tilde{n}}}{2}={\tilde{n}}^{{\tilde{n}}-2}({\tilde{n}}-1)$ . So we have $K=2{\tilde{n}}^{{\tilde{n}}-3}$ . This implies that number of spanning trees passing from an edge is $2{\tilde{n}}^{2{\tilde{n}}-5}$ on barbell graph, and definitely the number of spanning trees passing from the edge $(i^{*},j^{*})$ is ${\tilde{n}}^{2{\tilde{n}}-4}$ . This implies, $R_{ij}=1$ if $(i,j)\in\{(i^{*},j^{*}),(j^{*},i^{*})\}$ , $R_{ij}=\frac{2}{{\tilde{n}}}$ otherwise. Once we have explicit characterizations of $P^{u}$ and $P^{r}$ , using Lemma 4 we can compute the entries of $\overline{W}_{P^{u}}$ and $\overline{W}_{P^{r}}$ to be given as in $(i)$ and $(ii)$ . The second largest eigenvalues of $\bar{W}_{P^{u}}$ and $\bar{W}_{P^{r}}$ follow from Corollary 15. ∎

Lemma 17.

[55, Eqn. (2.2)]** Let $W$ be the transition matrix of a Markov chain with stationary distribution $\pi$ . Let $j$ be a neighbor of $i$ , i.e. $j\in\mathcal{N}_{i}$ , then $H_{W}(i\to j)\leq(\pi_{j}W_{ji})^{-1}.$

Discussions on The Momentum-Based Acceleration Methods and ER-based Gossiping

In the literature, there have been two main approaches to improve the performance of gossiping algorithms: (i) improving the communication weights, (ii) modifying the averaging scheme, e.g., adding a momentum term. ER-based approach corresponds to the first category whereas the papers [19, 20, 21] belong to the second category and proposes alternative averaging techniques based on a momentum term. In momentum-based approaches, the next iterate $y_{i}^{k+1}$ at node $i\in\mathcal{N}$ does not only depend on the current iterate $y_{i}^{k}$ but also on the previous iterate $y_{i}^{k-1}$ as well as $\{y_{j}^{k},~{}y_{j}^{k-1}\}_{j\in\mathcal{N}_{i}}$ , i.e., the current and previous iterates of the neighbors of node $i$ , (see for example [19]).

In the following discussion, we illustrate the benefits of momentum-based approaches and how they can be used together with effective resistance weights to improve performance. For the sake of simplicity of the argument, we consider the case when the updates are synchronous. In this case, if $y_{i}^{k}{\in\mathbb{R}}$ denotes the local estimate of the global average, $\frac{1}{n}\mathbf{1}^{\top}y^{0}$ , at node $i$ in iteration $k\geq 0$ , where $\mathbf{1}$ denotes the vector of ones, gossiping algorithms consist of updates of the form:

[TABLE]

starting from the initial point $y^{0}{\in\mathbb{R}^{n}}$ , where $W$ is a doubly stochastic matrix. A common choice for the mixing matrix $W$ is

[TABLE]

where $L=[L_{ij}]_{i,j\in\mathcal{N}}$ is a symmetric weighted Laplacian matrix and $\alpha>0$ is a scalar satisfying $\alpha<2/\|L\|$ (see e.g. [43, Section 2.4]). For each $i\in\mathcal{N}$ , $L_{ij}<0$ for all $j\in\mathcal{N}_{i}$ , where $\mathcal{N}_{i}$ is the set of neighbors of the node $i\in\mathcal{N}$ ; $L_{ij}=0$ if $j\not\in\mathcal{N}_{i}{\cup\{i\}}$ and $L_{ii}=-\sum_{j{\in\mathcal{N}_{i}}}L_{ij}>0$ . Different choices of the matrix $L$ gives different algorithms. For example, uniform gossiping corresponds to the choice $L=L^{u}\in\mathbb{R}^{n\times n}$ such that101010Note that instead of $L_{ij}=\frac{1}{d_{i}}$ , for uniform gossip we set it as in (25) so that $L$ becomes symmetric.

[TABLE]

Similarly, we can study gossiping based on the ER-based weights in synchronous setting by considering the choice $L=L^{r}\in\mathbb{R}^{n\times n}$ such that

[TABLE]

where $R_{ij}$ is the effective resistance on the edge $(i,j){\in\mathcal{E}}$ , and $R_{i}:=\sum_{j\in\mathcal{N}_{i}}R_{ij}$ for all $i\in\mathcal{N}$ .

Gossiping algorithms with weighted Laplacian matrix $L$ are related to first-order, i.e., gradient-based, optimization algorithms. To illustrate this point further, consider the following convex quadratic optimization problem:

[TABLE]

where $y_{*}:=\bar{y}\textbf{1}{\in\mathbb{R}^{n}}$ and $\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_{i}^{0}{\in\mathbb{R}}$ is the global average that we want to compute. Noting that $Ly_{*}=\bar{y}L\textbf{1}=0$ , the updates given in (23) with the choice of $W$ as in (24) can be viewed as applying a gradient descent update with step size $\alpha>0$ on the quadratic optimization problem in (27). From the standard theory of gradient descent, it is well-known that the distance of $y^{k}$ converges to $y_{*}$ linearly at a rate $\rho(L):=1-\alpha\lambda_{\min}^{+}(L){\in[0,1)}$ for $\alpha<2/\|L\|$ where $\lambda_{\min}^{+}(L)$ denotes the minimum positive eigenvalue of $L$ , i.e., the second smallest eigenvalue for connected graphs. Therefore, we get the following non-asymptotic convergence:

[TABLE]

If we set the stepsize as $\alpha=\frac{1}{\lambda_{\max}(L)}$ where $\lambda_{\max}(L)=\|L\|$ denotes the largest eigenvalue of $L$ , we get

[TABLE]

is called the condition number. When the condition number $\kappa(L)$ is very large, the convergence can be slow. Adding a momentum term is a technique to improve the convergence rate of gradient descent methods with respect to its dependency to the condition number. For example, Polyak’s heavy-ball (HB) method applied to the objective (27) consists of the iterations

[TABLE]

where the last term $\beta(y^{k}-y^{k-1})$ is referred to as the momentum term and $\beta$ is called the momentum parameter (see e.g., [19]). The convergence rate of the heavy-ball (HB) method on quadratic objectives of the form (27) has been well-studied in the literature and it can be shown that the heavy-ball method given in iterations (28) will converge to the consensus vector $y_{*}$ with the asymptotic linear convergence rate

[TABLE]

for a specific choice of the stepsize $\alpha$ provided that $\beta$ is tuned properly as a function of the eigenvalue $\lambda^{+}_{\min}(L)$ [19]. Achieving this rate with the choice of $\beta$ in [19] would require estimating $\lambda_{\min}^{+}(L)$ . That being said for ill-conditioned problems when the condition number $\kappa(L)$ is sufficiently large, we observe that HB converges faster, i.e. $\rho_{HB}(L)<\rho(L)$ . For example, for the barbell graph, with an analysis similar to that in Proposition 8 of the revised manuscript, we can characterize the eigenvalues of the weighted graph Laplacians $L^{u}$ and $L^{r}$ that correspond to uniform weights and ER-based weights given in (25) and (26) respectively and obtain

[TABLE]

Therefore, without momentum averaging (when $\beta=0$ ), we obtain the convergence rates

[TABLE]

for uniform weights and ER-based weights. On the other hand, for HB method, we obtain the rates

[TABLE]

We observe that fastest rate is obtained by using the HB method on the quadratic problem in (27) defined by the weighted Laplacian corresponding to the ER-weights, i.e., $\rho_{HB}^{r}$ is the fastest rate in terms of its dependency to $n$ . This shows that ER weights can be used together with momentum averaging techniques. Basically, from (29), we observe that effective-resistance based approach yields to a better conditioned Laplacian compared to uniform weights; and further improvement can be achieved by employing momentum averaging. In other words, ER weights are needed to improve the conditioning of the weighted Laplacian matrix and momentum-based approaches can be used on top of this to get further performance improvement. Besides the HB method, Nesterov’s accelerated gradient method is an alternative momentum averaging-based technique which will also yield to similar accelerated convergence rates.

The discussion we provided was for the synchronous setup, the asynchronous setup can be analyzed similarly.111111In the asynchronous setup, at every iteration, node $i$ contacts a neighbor randomly to update its decision variable rather than contacting all the neighbors. In the case of the barbell graph, each node has $\Theta(n)$ neighbors so needs on average $\Theta(n)$ iterations to contact all the neighbors. Consequently, more iterations will be required to converge compared to the synchronous setup. With a similar analysis to above, it can be shown that ER weights on barbell graphs lead to $\mathbb{E}\|y^{k}-y^{*}\|^{2}\leq{\left[\rho_{async}^{r}\right]^{2k}}\mathbb{E}\|y^{0}-y^{*}\|^{2}$ where $\rho^{r}_{async}=1-\Theta(\frac{1}{n^{3}})$ instead of $\rho^{r}=1-\Theta(\frac{1}{n^{2}})$ obtained above in (30). The rate $\rho^{r}_{async}$ also follows directly from Proposition 9.

Further Discussions on Our Conductance Bounds and Averaging Time with Effective Resistances

We recall that the averaging time $T_{ave}(\varepsilon,P)$ with an expected iteration matrix $\bar{W}_{P}$ satisfies

[TABLE]

where $\lambda_{n-1}(\cdot)$ denotes the second-largest eigenvalue. Therefore, comparing effective-resistance (ER) weights with uniform weights amounts to comparing the second-largest eigenvalues $\lambda_{n-1}(\overline{W}_{P^{r}})$ and $\lambda_{n-1}(\overline{W}_{P^{u}})$ , where $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ are the expected iteration matrices defined using ER and uniform weights, respectively. For barbell graphs (that correspond to the special case of $c$ -barbell graphs with $c=2$ ), our analysis is tight as we have developed an explicit formula for computing the second-largest eigenvalue of the matrix $\overline{W}_{P^{r}}$ as well as the second-largest eigenvalue of $\overline{W}_{P^{u}}$ . However, for $c$ -barbell graphs with $c>2$ , the second-largest eigenvalues of the gossiping matrices $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ are not explicitly known. Therefore, in our paper, we resorted to the conductance bounds which is a common technique in the literature to obtain lower and upper bounds on the second largest eigenvalue $\lambda_{n-1}(\overline{W}_{P})$ and consequently the spectral gap $\Delta:=1-\lambda_{n-1}(\overline{W}_{P})$ through the Cheeger inequalities. Based on this approach, we can obtain the following lower and the upper bounds for the spectral gaps $\Delta_{r}{:=}1-\lambda_{n-1}(\overline{W}_{P^{r}})$ and $\Delta_{u}:=1-\lambda_{n-1}(\overline{W}_{P^{u}})$ that correspond to ER and uniform weights, respectively:

[TABLE]

where $\Phi(\overline{W}_{P})$ denotes the graph conductance as defined in the paper for the reversible Markov chain corresponding to transition probability matrix $\overline{W}_{P}$ .

To illustrate the tightness of our bounds, we consider the approximation ratio, i.e., the ratio of these bounds in a logarithmic scale

[TABLE]

We define $a_{u}^{lb}$ and $a_{u}^{up}$ similarly for the uniform weights.

The closer the ratios ${a}_{r}^{lb}$ and ${a}_{r}^{ub}$ are to 1, the better the approximation quality is. In Table 6, we illustrate the tightness of our bounds for $c$ -barbell graphs $(c-K_{\tilde{n}})$ that consists of $c$ cliques where each clique has $\tilde{n}$ nodes, where we report ${a}_{r}^{lb}$ , ${a}_{r}^{ub}$ . We also display the ratios ${a}_{u}^{lb}$ and ${a}_{u}^{ub}$ for uniform weights, which are computed similarly. The results illustrate that all the ratios lie in a reasonable range (in the interval $[0.80,1.95]$ ) with lower bounds being tighter than the upper bounds. These results show that conductance-based analysis leads to useful approximations. In particular, we can see that the lower bounds are becoming tighter ( $a_{r}^{lb}$ is increasing) as the number of nodes increases on the graph.

As an additional experiment, we also computed the eigenvalues of $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ with the standard eigenvalue solver in Matlab 2021a (using the function eig with default settings). Using the second largest eigenvalues of $\overline{W}_{P^{r}}$ and $\overline{W}_{P^{u}}$ , we compute the times $T(\overline{W}_{P^{r}})$ and $T(\overline{W}_{P^{u}})$ required for both approaches. From (31), we see that

[TABLE]

In Figure 7, we plot the ratio on the right hand-side for the $c$ -barbell graph, denoted as $c-K_{\tilde{n}}$ . For different values of $c$ fixed, we vary $\tilde{n}$ and observe that the ratio $\frac{\log([\lambda_{n-1}(\overline{W}_{P^{r}})])}{\log([\lambda_{n-1}(\overline{W}_{P^{u}})])}$ is always larger than 1 and the ratio is growing as $\tilde{n}$ increases. This shows that ER weights admits better (smaller) averaging times for especially large networks, i.e., the performance gain being more and more significant as the number of nodes $\tilde{n}$ increases. In light of these experiments, we can conclude the superiority of ER weights over the uniform weights from a numerical perspective as well.

Normalized D-RK Algorithm

D-RK method for computing the effective resistances in a decentralized way and its normalized version which we call normalized D-RK has been introduced in [14] where the authors show that these methods converge linearly with rates

[TABLE]

respectively where $\lambda_{\min}^{+}(\cdot)$ denotes the smallest positive eigenvalue and $S$ is a normalization matrix defined as

[TABLE]

Based on numerical evidence, it was conjectured in [14] that normalized D-RK is faster than D-RK, i.e. $\rho_{S}\leq\rho$ . First, we provide a technical result and then the following proposition proves this conjecture.

Lemma 18.

The Laplacian $\mathcal{L}$ has the following property: $\frac{1}{n^{2}}\sum_{i=1}^{n}\frac{1}{s_{i}}\geq\frac{1}{||\mathcal{L}||_{F}^{2}}$ , where $s_{i}$ is defined by (34).

Proof.

Note that $||\mathcal{L}||_{F}^{2}=\sum_{i=1}^{n}\sum_{j=1}^{n}\mathcal{L}_{ij}^{2}=\sum_{i=1}^{n}\sum_{j\in N_{i}\cup\{i\}}\mathcal{L}_{ij}^{2}=\sum_{i=1}^{n}s_{i},$ where we used the fact that $\mathcal{L}_{ij}=0$ for all $(i,j)\notin\mathcal{E}$ . Applying arithmetic-harmonic mean inequality to the sequence $\{s_{i}\}_{i\in\{1,..,n\}}$ , we obtain $\frac{1}{n}||\mathcal{L}||_{F}^{2}=\frac{1}{n}\sum_{i=1}^{n}s_{i}\geq n\Big{[}\sum_{i=1}^{n}\frac{1}{s_{i}}\Big{]}^{-1}$ . We conclude by multiplying both sides with $1/n$ .∎

Now we are ready to prove our conjecture.

Proposition 19.

For $S$ defined by (34), the following inequality holds: $\frac{1}{n}\lambda_{\min}^{+}(\mathcal{L}S^{-1}\mathcal{L})\geq\left(\frac{\lambda_{\min}^{+}(\mathcal{L})}{\|\mathcal{L}\|_{F}}\right)^{2}$ . Then, it follows that $\rho_{S}\leq\rho$ where $\rho$ and $\rho_{S}$ are defined by (33).

Proof.

Since $\mathcal{L}$ and $S$ are symmetric matrices so are $\mathcal{L}^{2}$ and $S^{-1}$ . Let $\{\lambda_{i}(\mathcal{L})\}_{i=1}^{n}$ and $\{\lambda_{i}(\mathcal{S})\}_{i=1}^{n}$ denote the eigenvalues of these matrices sorted in increasing order, i.e. $\lambda_{n}$ is the largest eigenvalue, $\lambda_{1}$ is the smallest one. By the eigenvalue interlacing result in [56, Chapter 2, Eq. (2.0.7)], we obtain121212We set $l=n$ and $i_{t}=2$ for $t=1,\ldots,l$ in Eq. (2.0.7) in [56].

[TABLE]

where all the matrices have non-negative real eigenvalues as both $\mathcal{L}$ and $S$ are symmetric with non-negative eigenvalues. Clearly, $\lambda_{2}(\mathcal{L}^{2})=\lambda_{2}(\mathcal{L})^{2}>\lambda_{1}(\mathcal{L}^{2})=0$ . Furthermore, the eigenvalues of $\mathcal{L}^{2}\mathcal{S}^{-1}$ and $\mathcal{L}\mathcal{S}^{-1}\mathcal{L}$ are the same 131313If $u$ is an eigenvector of the latter matrix corresponding to a non-zero eigenvalue $\lambda$ , then $\mathcal{L}u$ would be the right eigenvector of the former matrix with the same eigenvalue; similarly, if $u$ is a right-eigenvector of $\mathcal{L}^{2}\mathcal{S}^{-1}$ corresponding to a nonzero eigenvalue $\lambda$ , then $\mathcal{L}\mathcal{S}^{-1}u$ is an eigenvector of $\mathcal{L}\mathcal{S}^{-1}\mathcal{L}$ with the same eigenvalue. Therefore, since $\mathcal{L}\mathcal{S}^{-1}\mathcal{L}$ is positive semidefinite with $\lambda_{1}(\mathcal{L}\mathcal{S}^{-1}\mathcal{L})=0$ , we also have

[TABLE]

Moreover, $S$ is a diagonal matrix with diagonal entries $S_{ii}=s_{i}$ ; therefore, eigenvalues of $S$ are given by $s_{i}$ with $i=1,2,\dots,n$ . Hence (35) is equivalent to

[TABLE]

where the inequalities follow from Lemma 18 and the fact that $\lambda_{2}(\mathcal{L})=\lambda_{\min}^{+}(\mathcal{L})>0$ due to $\mathcal{G}$ being a connected graph, where $\lambda_{\min}^{+}(\cdot)$ denotes the smallest positive eigenvalue. From (36) and (37), we conclude that $\lambda_{2}(\mathcal{L}^{2}\mathcal{S}^{-1})$ is the smallest positive eigenvalue of $\mathcal{L}^{2}\mathcal{S}^{-1}$ , i.e.,

[TABLE]

Finally, using the fact that the eigenvalues of $\mathcal{L}^{2}\mathcal{S}^{-1}$ and $\mathcal{L}\mathcal{S}^{-1}\mathcal{L}$ are the same once again, we get $\lambda_{\min}^{+}(\mathcal{L}\mathcal{S}^{-1}\mathcal{L})=\lambda_{\min}^{+}(\mathcal{L}^{2}\mathcal{S}^{-1})$ . Combining this with (37) and (38) leads to

[TABLE]

which directly implies $\rho_{S}\leq\rho$ . This completes the proof. ∎

Proof of Proposition 8

The proof follows by adapting the proof of [37, Proposition 5.1] to our setting with minor modifications. It is based on exploiting the symmetry group properties of the barbell graph with algebraic techniques. We first give relevant background material below before going into the details of the proof.

Background Material

Consider a weighted graph $\mathcal{G}=(\mathcal{N},\mathcal{E},w)$ . A permutation $p:\mathcal{N}\rightarrow\mathcal{N}$ is a mapping that rearranges the vertices, i.e. it is a bijection from the node set $\mathcal{N}$ to itself. We consider a permutation group $H$ , which is a group whose elements are permutations of $\mathcal{N}$ and whose group operation is the composition of permutations in $H$ . By the group property, if two permutations $s_{1},s_{2}\in H$ , then the composition $s_{1}s_{2}\in H$ and in particular the identity permutation $e$ which maps all the elements of $\mathcal{N}$ to itself is also contained in $H$ . The group that contains all the $n!$ permutations defined on $\mathcal{N}$ is denoted as $S_{n}$ .

The direct product $(H_{1}\times H_{2})$ of two groups $H_{1},H_{2}$ is defined as the group that consists of elements from the Cartesian product of $H_{1}$ and $H_{2}$ with the elementwise composition, i.e. $(h_{1},h_{2})\in(H_{1}\times H_{2})$ if and only if $h_{1}\in H_{1}$ and $h_{2}\in H_{2}$ and if $(h_{1},h_{2})\in(H_{1}\times H_{2})$ and $(\tilde{h}_{1},\tilde{h}_{2})\in(\tilde{H}_{1}\times\tilde{H}_{2})$ then the composition operation $\cdot$ over $(H_{1}\times H_{2})$ is defined as $(h_{1},h_{2})\cdot(\tilde{h}_{1},\tilde{h}_{2})=(h_{1}\tilde{h}_{1},h_{2}\tilde{h}_{2})$ . A subgroup $M$ of a group $H$ is normal if for all $h\in H$ and $m\in M$ we have $hmh^{-1}\in M$ . The semidirect product $H_{1}\ltimes H_{2}$ of two groups $H_{1}$ and $H_{2}$ is the group that consists of elements $h=h_{1}h_{2}$ with $h_{1}\in H_{1}$ and $h_{2}\in H_{2}$ and the subgroup $H_{1}$ is normal in $H_{1}\ltimes H_{2}$ with the condition $H_{1}\cap H_{2}=\{e\}$ . The orbit $O_{i}$ of an element $i\in\mathcal{N}$ , under a permutation group $H$ is the set $O_{i}\triangleq\{v\in\mathcal{N}~{}|~{}\exists s\in H\;\;\text{s.t.}\;\;s(v)=i\}$ . In other words, the orbit of node $i$ is the set of vertices that can be mapped to $i$ by an element of the permutation group $H$ . This definition creates an equivalence relation $\sim$ on $\mathcal{N}$ ; for $i,j\in\mathcal{N}$ , we say $i\sim j$ if $O_{i}=O_{j}$ . In particular, equivalence classes form a partition of $\mathcal{N}$ .

A permutation $s$ is called an automorphism of the weighted graph $\mathcal{G}$ if the weight matrix $w$ is invariant under $s$ , i.e. if $w(i,j)=w(s(i),s(j))$ . From this definition, an automorphism $s$ also satisfies $W(i,j)=W(s(i),s(j))$ where $W(i,j)=w(i,j)/\sum_{j\in\mathcal{N}_{i}}w(i,j)$ is the transition probability. We are interested in such permutations that preserve the structure of $w$ and therefore $W$ . The group of all automorphisms with the operation of composition of permutations is called the automorphism group of the graph and is denoted by $\mbox{Aut}(\mathcal{G})$ . Let $S$ be a subgroup of $\mbox{Aut}(\mathcal{G})$ and consider the orbits $\{O_{i}\}_{i\in\mathcal{N}}$ under the permutation group $S$ which partition the set $\mathcal{N}$ . We define orbit graph to be the graph whose vertices consist of the equivalence classes $O_{i}$ for $i\in\mathcal{N}$ and we consider an induced Markov chain on the orbit graph with probability transition probabilities defined as

[TABLE]

This Markov chain is also called the orbit chain. It can be shown that the definition of the weights $W_{S}$ above does not depend on the choice of the element $i$ from the set $O_{i}$ (see e.g. [37]).

Proof

First, we consider the automorphism group of the barbell graph $K_{{\tilde{n}}}-K_{{\tilde{n}}}$ with edge weights given by Proposition 8. Consider the nodes $i_{*}$ and $j_{*}$ that connect the complete subgraphs of the barbell graph and without loss of generality assume that we enumerate the nodes so that $i_{*}={\tilde{n}}$ , $j_{*}={\tilde{n}}+1$ and a node $i<{\tilde{n}}$ is on the complete subgraph on the left hand-side and any node $j>{\tilde{n}}+1$ is on the complete subgraph on the right-hand side. We see from the symmetry structure of $W$ that if we take any two nodes from a complete subgraph and permute them, this would be an automorphism. Similarly, swapping the two complete subgraphs between them would be an automorphism; i.e. the permutation $C_{2}:\mathcal{N}\rightarrow\mathcal{N}$ that maps $i\overset{C_{2}}{\mapsto}-i~{}\mod(n+1)$ is an automorphism. It follows from these observations that the automorphism group of $K_{{\tilde{n}}}-K_{{\tilde{n}}}$ is the group $C_{2}\ltimes(S_{{\tilde{n}}-1}\times S_{{\tilde{n}}-1})$ (see also [37] for more details). It is known that for any subgroup $S$ of the automorphism group, the eigenvalues of the transition matrix $W_{S}$ defined by (39) should also be an eigenvalue of the transition matrix $W$ (see e.g. [37, Section 3]). Note that the square matrix $W_{S}$ has dimension $n_{S}\times n_{S}$ where $n_{S}\leq n$ , so the set of eigenvalues of $W_{S}$ are a subset of the set of all eigenvalues of $W$ . We are going to use this result to prove the Proposition 8. Next, we consider the eigenvalues of the transition matrices $W_{S}$ of the orbit chains under subgroups $S$ of $C_{2}\ltimes(S_{{\tilde{n}}-1}\times S_{{\tilde{n}}-1})$ :

a) The orbit chain under $C_{2}\ltimes(S_{{\tilde{n}}-1}\times S_{{\tilde{n}}-1})$ (Figure 8) has the transition matrix $\begin{bmatrix}\frac{A+G}{A+G+E}&\frac{E}{A+G+E}\\ \frac{E}{(n-1)F+E}&\frac{(n-1)F}{(n-1)F+E}\end{bmatrix}$ . Since $\lambda_{a}=1$ is an eigenvalue, and its trace is the sum of eigenvalues; it follows that the other eigenvalue of this matrix is given by $\lambda_{b}=-1+\frac{A+G}{A+G+E}+\frac{F}{F+B}$ .

b) Consider the orbit chain under $C_{2}$ illustrated on the left panel of Figure 9.

This orbit graph has two orbits under permutation $S_{{\tilde{n}}-1}$ : One of them contains only one node (the node with a self-loop with weight $(A+G)$ ) and the other orbit has the remaining ${\tilde{n}}-1$ nodes. Notice that the latter orbit has identical ${\tilde{n}}-1$ elements and therefore the permutation group $C_{2}\ltimes(S_{{\tilde{n}}-2}\times S_{{\tilde{n}}-2})$ fixes one of the nodes having a loop with weight $D$ and permutes the remaining ${\tilde{n}}-2$ nodes among themselves without affecting the orbit with one node. Therefore, by [37, Thereom 3.1], the eigenvalues of the transition matrix $W^{\prime}$ of the orbit graph obtained by the permutation group $S=C_{2}\ltimes(S_{{\tilde{n}}-2}\times S_{{\tilde{n}}-2})$ (illustrated on the right panel of Figure 9) are also eigenvalues of the transition matrix $W$ . The transition matrix $W^{\prime}$ is $3\times 3$ with three eigenvalues, including $\lambda_{a}$ and $\lambda_{b}$ that we have already found at part $(a)$ . The third eigenvalue $\lambda_{c}$ can be computed from the transition matrix $W^{\prime}$ of the orbit chain under $C_{2}\ltimes(S_{{\tilde{n}}-2}\times S_{{\tilde{n}}-2})$ :

[TABLE]

where we use $*$ to denote the entries of this matrix that will not be relevant to our discussion. In particular, the eigenvalues of this matrix will be $\lambda_{a}$ , $\lambda_{b}$ and $\lambda_{c}$ ; the latter will be an eigenvalue of $W$ with multiplicity $2{\tilde{n}}-4$ . Again, using the fact that the trace of a matrix is equal to the sum of its eigenvalues, we obtain

[TABLE]

c) Lastly, orbit chain under $(S_{{\tilde{n}}-1}\times S_{{\tilde{n}}-1})$ consists of four orbits: $({\tilde{n}}-1)$ points in the left and right complete graphs and vertices $i_{*}$ and $j_{*}$ as illustrated in Figure 10.

This orbit chain has the transition matrix of the form

[TABLE]

After a straightforward computation, it can be checked that this matrix has the eigenvalues, $1,\lambda_{+},\lambda_{-},(-1+\frac{A+G}{A+E+G}+\frac{F}{B+F})$ where

[TABLE]

and $S=\bigg{(}\frac{F}{B+F}+\frac{G-A}{A+E+G}\bigg{)}^{2}-\frac{4(FG-BE-AF)}{(B+F)(A+E+G)}$ .

Remark 20.

Boyd et al. [37] studied the case $W_{i^{*}i^{*}}=0=W_{j^{*}j^{*}}$ where similar orbit chains and graphs arise. The proof of Proposition 8 given here is a minor modification of the original proof of Boyd et al. [37, Proposition 2.2] and extends it to the more general case where $W_{i^{*}i^{*}}$ or $W_{j^{*}j^{*}}$ can be strictly positive.

Bibliography56

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. J. Klein. Resistance-distance sum rules. Croatica chemica acta , 75(2):633–649, 2002.
2[2] D. J. Klein and M. Randić. Resistance distance. Journal of Mathematical Chemistry , 12(1):81–95, 1993.
3[3] A. Ghosh, S. Boyd, and A. Saberi. Minimizing effective resistance of a graph. SIAM review , 50(1):37–66, 2008.
4[4] D. Aldous and J. A. Fill. Reversible Markov chains and random walks on graphs, 2014. Unfinished monograph, available at: http://www.stat.berkeley.edu/ ∼ similar-to \sim aldous/RWG/book.html.
5[5] P. G. Doyle and J. L. Snell. Random walks and electric networks . Mathematical Association of America,, 1984.
6[6] D. A. Spielman and N. Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing , 40(6):1913–1926, 2011.
7[7] Rajat Chandra Mishra and Himadri Barman. Effective resistances of two-dimensional resistor networks. European Journal of Physics , 42(1):015205, Dec 2020.
8[8] M. A. Jafarizadeh, R. Sufiani, and S. Jafarizadeh. Calculating effective resistances on underlying networks of association schemes. Journal of Mathematical Physics , 49(7):073303, Jul 2008.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Randomized Gossiping with Effective Resistance Weights: Performance Guarantees and Applications

Abstract

1 Introduction

2 Preliminaries

2.1 Randomized gossiping

Theorem 1** ([12, Theorem 3]).**

2.2 Randomized uniform gossiping

2.3 Effective-resistance (ER) gossiping

3 Main Results

Result 1**.**

Result 2**.**

Result 3**.**

Remark 2**.**

Remark 3**.**

4 Proofs of Main Results

Lemma 4**.**

4.1 Proof of Result 1 via conductance-based analysis

Definition 5** (Conductance).**

Proposition 6**.**

Remark 7**.**

4.2 Proof of Result 2 via spectral analysis

Proposition 8** (Generalization of Proposition 5.1 in [37]).**

Proposition 9**.**

4.3 Proof of Result 3 via hitting and mixing times

Definition 10**.**

Theorem 11**.**

Proof.

Remark 12**.**

5 Numerical Experiments

5.1 Consensus exploiting effective resistances

5.2 Effective resistance-based DPGA-W and EXTRA

6 Conclusions

Acknowledgments

Appendix A Proof of Propositions 6 and 9

Proof of Proposition 6.

Proof of Proposition 9.

Appendix B Supporting Results

Lemma 13**.**

Proof.

Lemma 14**.**

Proof.

Corollary 15**.**

Proof.

Lemma 16**.**

Proof of Lemma 16.

Lemma 17**.**

Discussions on The Momentum-Based Acceleration Methods and ER-based Gossiping

Further Discussions on Our Conductance Bounds and Averaging Time with Effective Resistances

Normalized D-RK Algorithm

Lemma 18**.**

Proof.

Proposition 19**.**

Proof.

Proof of Proposition 8

Background Material

Proof

Remark 20**.**

Theorem 1 ([12, Theorem 3]).

Result 1.

Result 2.

Result 3.

Remark 2.

Remark 3.

Lemma 4.

Definition 5 (Conductance).

Proposition 6.

Remark 7.

Proposition 8 (Generalization of Proposition 5.1 in [37]).

Proposition 9.

Definition 10.

Theorem 11.

Remark 12.

Lemma 13.

Lemma 14.

Corollary 15.

Lemma 16.

Lemma 17.

Lemma 18.

Proposition 19.

Remark 20.