Localization and Approximations for Distributed Non-convex Optimization

Hsu Kao; Vijay Subramanian

arXiv:1706.02599·math.OC·June 22, 2021·J. Optim. Theory Appl.

Localization and Approximations for Distributed Non-convex Optimization

Hsu Kao, Vijay Subramanian

PDF

TL;DR

This paper extends distributed non-convex optimization methods by reducing communication complexity and relaxing gradient assumptions, demonstrating improved resource allocation in multi-cell networks with enhanced stability.

Contribution

It generalizes existing frameworks for non-convex distributed optimization by reducing communication and relaxing gradient bounds, with practical applications in network resource allocation.

Findings

01

Reduced communication complexity in separable variable cases.

02

Relaxed gradient assumptions via proximal approximations.

03

Improved stability and performance in resource allocation simulations.

Abstract

Distributed optimization has many applications, in communication networks, sensor networks, signal processing, machine learning, and artificial intelligence. Methods for distributed convex optimization are widely investigated, while those for non-convex objectives are not well understood. One of the first non-convex distributed optimization frameworks over an arbitrary interaction graph was proposed by Di Lorenzo and Scutari [IEEE Trans. on Signal and Information Processing over Network, 2 (2016), pp. 120-136], which iteratively applies a combination of local optimization with convex approximations and local averaging. We generalize the existing results in two ways. In the case when the decision variables are separable such that there is partial dependency in the objectives, we reduce the communication complexity of the algorithm so that nodes only keep and communicate local variables…

Figures3

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1: Storage and communication required in Section 4.1 .

Node	Before localization	After localization
1	$𝐱^{1}$ , $𝐲^{1}$ , ${\tilde{π}}^{1}$ , $𝐱^{2}$ , $𝐲^{2}$ , ${\tilde{π}}^{2}$ , $𝐱^{3}$ , $𝐲^{3}$ , ${\tilde{π}}^{3}$ to 2, 4	$𝐱^{1}$ , $𝐲^{1}$ , ${\tilde{π}}^{1}$ to 4, $𝐱^{2}$ , $𝐲^{2}$ , ${\tilde{π}}^{2}$ to 2
2	$𝐱^{1}$ , $𝐲^{1}$ , ${\tilde{π}}^{1}$ , $𝐱^{2}$ , $𝐲^{2}$ , ${\tilde{π}}^{2}$ , $𝐱^{3}$ , $𝐲^{3}$ , ${\tilde{π}}^{3}$ to 1, 3	$𝐱^{2}$ , $𝐲^{2}$ , ${\tilde{π}}^{2}$ to 1, $𝐱^{3}$ , $𝐲^{3}$ , ${\tilde{π}}^{3}$ to 3
3	$𝐱^{1}$ , $𝐲^{1}$ , ${\tilde{π}}^{1}$ , $𝐱^{2}$ , $𝐲^{2}$ , ${\tilde{π}}^{2}$ , $𝐱^{3}$ , $𝐲^{3}$ , ${\tilde{π}}^{3}$ to 2, 4	$𝐱^{3}$ , $𝐲^{3}$ , ${\tilde{π}}^{3}$ to 2
4	$𝐱^{1}$ , $𝐲^{1}$ , ${\tilde{π}}^{1}$ , $𝐱^{2}$ , $𝐲^{2}$ , ${\tilde{π}}^{2}$ , $𝐱^{3}$ , $𝐲^{3}$ , ${\tilde{π}}^{3}$ to 1, 3, 5	$𝐱^{1}$ , $𝐲^{1}$ , ${\tilde{π}}^{1}$ to 1, 5
5	$𝐱^{1}$ , $𝐲^{1}$ , ${\tilde{π}}^{1}$ , $𝐱^{2}$ , $𝐲^{2}$ , ${\tilde{π}}^{2}$ , $𝐱^{3}$ , $𝐲^{3}$ , ${\tilde{π}}^{3}$ to 4	$𝐱^{1}$ , $𝐲^{1}$ , ${\tilde{π}}^{1}$ to 4

Table 2. Table 2: Fraction of utilized channels and scheduled users per BS.

Channels/Users	0	1	2	3	4	5
LXLP-RM Channels	0.3447	0.6316	0.0237	0	-	-
\hdashlineLXLP-RM Users	0.3447	0.5395	0.1158	0	0	0
LXLP-PL Channels	0.3763	0.6105	0.0132	0	-	-
\hdashlineLXLP-PL Users	0.3763	0	0	0	0.0079	0.6158
SC Channels	0	0	0	1	-	-
\hdashlineSC Users	0	0	0	0	0	1

Table 3. Table 3: Total utilities and average numbers of iterations required

Algorithm	LXGP-RM	LXLP-RM	LXLP-PL	SC
Utilities
$η = 1$	89.61	89.73	86.98	19.62
$η = 0.5$	182.7	182.5	182.1	86.10
Iterations required
$η = 1$	280.3	176.9	176.2	2
$η = 0.5$	273.6	216.9	142.9	2

Equations301

x \in K min U (x) = F (x) + G (x) = i = 1 \sum I f_{i} (x) + G (x),

x \in K min U (x) = F (x) + G (x) = i = 1 \sum I f_{i} (x) + G (x),

x_{j} \in K min i = 1 \sum I f_{i} (x_{j}) + G (x_{j})

x_{j} \in K min i = 1 \sum I f_{i} (x_{j}) + G (x_{j})

x_{1} = x_{2} = \dots = x_{I} .

x_{1} = x_{2} = \dots = x_{I} .

\tilde{x}_{i}^{*} (x_{i} [n], \tilde{π}_{i} [n]) = x_{i} \in \prod_{k \in S_{i}} K_{k} ar g min [\tilde{f}_{i, n}^{*} (x_{i}; x_{i} [n]) + k \in S_{i} \sum \tilde{π}_{i}^{k} [n]^{T} (x_{i}^{k} - x_{i}^{k} [n]) + G (x_{i}^{c})],

\tilde{x}_{i}^{*} (x_{i} [n], \tilde{π}_{i} [n]) = x_{i} \in \prod_{k \in S_{i}} K_{k} ar g min [\tilde{f}_{i, n}^{*} (x_{i}; x_{i} [n]) + k \in S_{i} \sum \tilde{π}_{i}^{k} [n]^{T} (x_{i}^{k} - x_{i}^{k} [n]) + G (x_{i}^{c})],

x_{5} \in K ar g min \tilde{f}_{5} (x_{5}; x_{5} [n]) + \tilde{π}_{5} [n]^{T} (x_{5} - x_{5} [n])

x_{5} \in K ar g min \tilde{f}_{5} (x_{5}; x_{5} [n]) + \tilde{π}_{5} [n]^{T} (x_{5} - x_{5} [n])

(x_{5}^{4}, x_{5}^{5}) \in K_{4} \times K_{5} ar g min \tilde{f}_{5} (x_{5}^{4}, x_{5}^{5}; x_{5}^{4} [n], x_{5}^{5} [n]) + \tilde{π}_{5}^{4} [n]^{T} (x_{5}^{4} - x_{5}^{4} [n]) + \tilde{π}_{5}^{5} [n]^{T} (x_{5}^{5} - x_{5}^{5} [n]) .

(x_{5}^{4}, x_{5}^{5}) \in K_{4} \times K_{5} ar g min \tilde{f}_{5} (x_{5}^{4}, x_{5}^{5}; x_{5}^{4} [n], x_{5}^{5} [n]) + \tilde{π}_{5}^{4} [n]^{T} (x_{5}^{4} - x_{5}^{4} [n]) + \tilde{π}_{5}^{5} [n]^{T} (x_{5}^{5} - x_{5}^{5} [n]) .

f_{t, s} (x) = z sup y in f {f (y) + \frac{1}{2 t} ∥ z - y ∥^{2} - \frac{1}{2 s} ∥ x - z ∥^{2}},

f_{t, s} (x) = z sup y in f {f (y) + \frac{1}{2 t} ∥ z - y ∥^{2} - \frac{1}{2 s} ∥ x - z ∥^{2}},

f^{*}_{i,n}(x)=\sup_{z}\inf_{y}\Big{\{}f_{i}(y)+\frac{L_{i,n}}{4}\|z-y\|^{2}-\frac{L_{i,n}}{2}\|x-z\|^{2}\Big{\}},

f^{*}_{i,n}(x)=\sup_{z}\inf_{y}\Big{\{}f_{i}(y)+\frac{L_{i,n}}{4}\|z-y\|^{2}-\frac{L_{i,n}}{2}\|x-z\|^{2}\Big{\}},

\left\{\begin{array}[]{lll}\lim_{n\rightarrow\infty}\alpha[n]\frac{(L^{\max}_{n})^{5}}{(\tau^{\min}_{n})^{3}}=0&\Rightarrow&\beta-5\lambda-3\delta>0,\\ \sum_{n=0}^{\infty}(L^{\max}_{n})^{3}\left(\frac{\alpha[n]}{\tau^{\min}_{n}}\right)^{2}<\infty&\Rightarrow&2\beta-3\lambda-2\delta>1,\\ \sum_{n=0}^{\infty}\tau^{\min}_{n}\alpha[n]=\infty&\Rightarrow&\beta+\delta\leq 1,\\ \lim_{n\rightarrow\infty}L^{\min}_{n}=\infty&\Rightarrow&\lambda>0,\\ \sum_{n=0}^{\infty}\alpha[n]L^{\max}_{n}\epsilon[n]<\infty&\Rightarrow&\beta+\gamma-\lambda>1.\end{array}\right.

\left\{\begin{array}[]{lll}\lim_{n\rightarrow\infty}\alpha[n]\frac{(L^{\max}_{n})^{5}}{(\tau^{\min}_{n})^{3}}=0&\Rightarrow&\beta-5\lambda-3\delta>0,\\ \sum_{n=0}^{\infty}(L^{\max}_{n})^{3}\left(\frac{\alpha[n]}{\tau^{\min}_{n}}\right)^{2}<\infty&\Rightarrow&2\beta-3\lambda-2\delta>1,\\ \sum_{n=0}^{\infty}\tau^{\min}_{n}\alpha[n]=\infty&\Rightarrow&\beta+\delta\leq 1,\\ \lim_{n\rightarrow\infty}L^{\min}_{n}=\infty&\Rightarrow&\lambda>0,\\ \sum_{n=0}^{\infty}\alpha[n]L^{\max}_{n}\epsilon[n]<\infty&\Rightarrow&\beta+\gamma-\lambda>1.\end{array}\right.

\nabla f (x, y) = \frac{1}{2} - \frac{x}{1 - ( x ^{2} + y ^{2} )} - \frac{y}{1 - ( x ^{2} + y ^{2} )}

\nabla f (x, y) = \frac{1}{2} - \frac{x}{1 - ( x ^{2} + y ^{2} )} - \frac{y}{1 - ( x ^{2} + y ^{2} )}

(P1) p_{B K}, x_{B I (B) K} max

(P1) p_{B K}, x_{B I (B) K} max

i \in I_{b} \sum x_{bik} \leq 1 \forall b \in B, k \in K

k \in K \sum p_{bk} \leq P_{b} \forall b \in B

0 \leq x_{bik} \leq 1 \forall b \in B, k \in K, i \in I_{b}

0 \leq p_{bk} \forall b \in B, k \in K,

\text{(P2)}\qquad\max_{p^{B}_{BK},x_{BI(B)K}}\sum_{b\in B}\sum_{i\in I_{b}}w_{i}\sum_{k\in K}x_{bik}\log\big{(}1+\frac{\Gamma^{b}_{bik}}{\sigma^{2}+\bar{\Gamma}^{b}_{bik}}\big{)}

\text{(P2)}\qquad\max_{p^{B}_{BK},x_{BI(B)K}}\sum_{b\in B}\sum_{i\in I_{b}}w_{i}\sum_{k\in K}x_{bik}\log\big{(}1+\frac{\Gamma^{b}_{bik}}{\sigma^{2}+\bar{\Gamma}^{b}_{bik}}\big{)}

\text{(P3)}\quad\max_{p^{B}_{BK},x_{BI(B)K}}\sum_{b\in B}\sum_{i\in I_{b}}w_{i}\sum_{k\in K}\left[x_{bik}\log\left(\sigma^{2}+\frac{\Gamma^{b}_{bik}+\bar{\Gamma}^{b}_{bik}}{x_{bik}}\right)-x_{bik}\log\big{(}\sigma^{2}+\bar{\Gamma}^{b}_{bik}\big{)}\right],

\text{(P3)}\quad\max_{p^{B}_{BK},x_{BI(B)K}}\sum_{b\in B}\sum_{i\in I_{b}}w_{i}\sum_{k\in K}\left[x_{bik}\log\left(\sigma^{2}+\frac{\Gamma^{b}_{bik}+\bar{\Gamma}^{b}_{bik}}{x_{bik}}\right)-x_{bik}\log\big{(}\sigma^{2}+\bar{\Gamma}^{b}_{bik}\big{)}\right],

\tilde{f}_{b} (p_{B K}^{b}, x_{b I (b) K}; \overset{p}{ˉ}_{B K}^{b}, \overset{x}{ˉ}_{b I (b) K}) + \tilde{π}_{B K}^{b} \cdot (p_{B K}^{b} - \overset{p}{ˉ}_{B K}^{b})

\tilde{f}_{b} (p_{B K}^{b}, x_{b I (b) K}; \overset{p}{ˉ}_{B K}^{b}, \overset{x}{ˉ}_{b I (b) K}) + \tilde{π}_{B K}^{b} \cdot (p_{B K}^{b} - \overset{p}{ˉ}_{B K}^{b})

\displaystyle f_{b}+\frac{\tau_{b}}{2}\Big{[}\sum_{i,k}(x_{bik}-\bar{x}_{bik})^{2}+\sum_{b^{\prime}\in B,k}(p^{b}_{b^{\prime}k}-\bar{p}^{b}_{b^{\prime}k})^{2}\Big{]}

\displaystyle f_{b}+\frac{\tau_{b}}{2}\Big{[}\sum_{i,k}(x_{bik}-\bar{x}_{bik})^{2}+\sum_{b^{\prime}\in B,k}(p^{b}_{b^{\prime}k}-\bar{p}^{b}_{b^{\prime}k})^{2}\Big{]}

+ i, k \sum \frac{\partial f _{b}}{\partial x _{bik}} \cdot (x_{bik} - \overset{x}{ˉ}_{bik}) + b^{'} \in N b (b), k \sum \frac{\partial f _{b}}{\partial p _{b^{'} k}^{b}} \cdot (p_{b^{'} k}^{b} - \overset{p}{ˉ}_{b^{'} k}^{b}),

f^{*}_{b\cup,n}=-\sum_{i,k}w_{i}(x_{bik}+e[n])\log\Big{(}\sigma^{2}+\frac{\Gamma_{bik}+\bar{\Gamma}_{bik}}{x_{bik}+e[n]}\Big{)}.

f^{*}_{b\cup,n}=-\sum_{i,k}w_{i}(x_{bik}+e[n])\log\Big{(}\sigma^{2}+\frac{\Gamma_{bik}+\bar{\Gamma}_{bik}}{x_{bik}+e[n]}\Big{)}.

W_{ij}=\left\{\begin{array}[]{cl}0&\text{if }j\not\in N(i)\\ \frac{1}{\bar{d}}&\text{if }j\in N(i)\text{ and }i\neq j\\ \frac{\bar{d}-d_{i}}{\bar{d}}&\text{if }j\in N(i)\text{ and }i=j\end{array}\right.,

W_{ij}=\left\{\begin{array}[]{cl}0&\text{if }j\not\in N(i)\\ \frac{1}{\bar{d}}&\text{if }j\in N(i)\text{ and }i\neq j\\ \frac{\bar{d}-d_{i}}{\bar{d}}&\text{if }j\in N(i)\text{ and }i=j\end{array}\right.,

min ∥ W - 1_{∣ B ∣} 1_{∣ B ∣}^{T} /∣ B ∣ ∥_{2} subject to W \in Ω.

min ∥ W - 1_{∣ B ∣} 1_{∣ B ∣}^{T} /∣ B ∣ ∥_{2} subject to W \in Ω.

W = 1/2 1/4 1/4 1/4 1/2 1/4 1/4 1/4 1/2,

W = 1/2 1/4 1/4 1/4 1/2 1/4 1/4 1/4 1/2,

U (R_{T}) = i \in I (B) \sum U_{i} (R_{i, T}),

U (R_{T}) = i \in I (B) \sum U_{i} (R_{i, T}),

U_{i}(\mathbf{R}_{i,t})=\left\{\begin{array}[]{cl}\frac{c_{i}}{\eta}(R_{i,t})^{\eta},&\eta\leq 1,\eta\neq 0,\\ c_{i}\log(R_{i,t}),&\eta=0,\end{array}\right.

U_{i}(\mathbf{R}_{i,t})=\left\{\begin{array}[]{cl}\frac{c_{i}}{\eta}(R_{i,t})^{\eta},&\eta\leq 1,\eta\neq 0,\\ c_{i}\log(R_{i,t}),&\eta=0,\end{array}\right.

r_{t} \in R (e_{t}) max i \sum c_{i} (R_{i, t})^{η - 1} r_{i, t} .

r_{t} \in R (e_{t}) max i \sum c_{i} (R_{i, t})^{η - 1} r_{i, t} .

p_{b K}, x_{b I_{b} K} max i \in I_{b} \sum w_{i} k \in K \sum x_{bik} lo g (1 + \frac{p _{bk} g _{bik}}{σ ^{2} x _{bik}}),

p_{b K}, x_{b I_{b} K} max i \in I_{b} \sum w_{i} k \in K \sum x_{bik} lo g (1 + \frac{p _{bk} g _{bik}}{σ ^{2} x _{bik}}),

\hat{x}_{i, n}^{S_{i}} (\tilde{x}) = x^{S_{i}} ar g min \tilde{f}_{i, n}^{*} (x^{S_{i}}; \tilde{x}^{S_{i}}) + π_{i}^{S_{i}} (\tilde{x})^{T} (x^{S_{i}} - \tilde{x}^{S_{i}}) + G (x^{c}) \forall i \in N,

\hat{x}_{i, n}^{S_{i}} (\tilde{x}) = x^{S_{i}} ar g min \tilde{f}_{i, n}^{*} (x^{S_{i}}; \tilde{x}^{S_{i}}) + π_{i}^{S_{i}} (\tilde{x})^{T} (x^{S_{i}} - \tilde{x}^{S_{i}}) + G (x^{c}) \forall i \in N,

(\hat{x}_{i, n} (z) - z)^{T} \nabla F (z) + G (\hat{x}_{i, n} (z)) - G (z) \leq - τ_{n}^{m i n} ∥ \hat{x}_{i, n} (z) - z ∥^{2},

(\hat{x}_{i, n} (z) - z)^{T} \nabla F (z) + G (\hat{x}_{i, n} (z)) - G (z) \leq - τ_{n}^{m i n} ∥ \hat{x}_{i, n} (z) - z ∥^{2},

P [n, l] - \frac{1}{I} 1 1^{T} \leq c_{0} ρ^{n - l + 1}, \forall n \geq l

P [n, l] - \frac{1}{I} 1 1^{T} \leq c_{0} ρ^{n - l + 1}, \forall n \geq l

m \in S_{i} \sum (x_{i}^{m} [n] - \tilde{x}_{i}^{m} [n])^{T} [\nabla_{x_{i}^{m}} \tilde{f}_{i, n}^{*} (\tilde{x}_{i}^{S_{i}} [n]; x_{i}^{S_{i}} [n]) + \tilde{π}_{i}^{m} [n]] + (x_{i}^{c} [n] - \tilde{x}_{i}^{c} [n])^{T} \partial G (\tilde{x}_{i}^{c} [n]) \geq 0.

m \in S_{i} \sum (x_{i}^{m} [n] - \tilde{x}_{i}^{m} [n])^{T} [\nabla_{x_{i}^{m}} \tilde{f}_{i, n}^{*} (\tilde{x}_{i}^{S_{i}} [n]; x_{i}^{S_{i}} [n]) + \tilde{π}_{i}^{m} [n]] + (x_{i}^{c} [n] - \tilde{x}_{i}^{c} [n])^{T} \partial G (\tilde{x}_{i}^{c} [n]) \geq 0.

\tilde{π}_{i}^{m} [n] = I_{m} \cdot y_{i}^{m} [n] - \nabla_{x_{i}^{m}} \tilde{f}_{i, n}^{*} (x_{i}^{S_{i}} [n]; x_{i}^{S_{i}} [n]) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\newsiamthm

factFact \newsiamthmexExample

\headersLocalization and Approximations for Distributed Non-convex OptimizationH. Kao and V. Subramanian

\externaldocumentex_supplement

Localization and Approximations for Distributed Non-convex Optimization

††thanks: Submitted to the editors on June 14, 2021.

Hsu Kao Electrical and Computer Engineering, University of Michigan, Ann Arbor, MI (). [email protected]

Vijay Subramanian Electrical and Computer Engineering, University of Michigan, Ann Arbor, MI (). [email protected]

Abstract

Distributed optimization has many applications, in communication networks, sensor networks, signal processing, machine learning, and artificial intelligence. Methods for distributed convex optimization are widely investigated, while those for non-convex objectives are not well understood. One of the first non-convex distributed optimization frameworks over an arbitrary interaction graph was proposed by Di Lorenzo and Scutari [IEEE Trans. on Signal and Information Processing over Network, 2 (2016), pp. 120-136], which iteratively applies a combination of local optimization with convex approximations and local averaging. We generalize the existing results in two ways. In the case when the decision variables are separable such that there is partial dependency in the objectives, we reduce the communication complexity of the algorithm so that nodes only keep and communicate local variables instead of the whole vector of variables. In addition, we relax the assumption that the objectives’ gradients are bounded and Lipschitz by means of successive proximal approximations. Having developed the methodology, we then discuss many ways to apply our algorithmic framework to resource allocation problems in multi-cellular networks, where the two generalizations are found useful and practical. Simulation results show the superiority of our resource allocation algorithms over naive single cell methods, and furthermore, our approximation framework lead to algorithms that are numerically more stable.

keywords:

Distributed optimization, non-convex optimization, localization, proximal approximation, resource allocation.

{AMS}

90C26, 90C90

1 Introduction

Distributed computation has received great attention in response to the overwhelming need in applications [30, 13]. In all of these applications there are multiple agents or devices with their own local data that want to perform a joint computational task that is either impractical or infeasible to be centralized. Reasons for this choice include high communication costs, availability of large amount of information, unavailability of centralized processing, or simply harnessing the efficiency of parallelization [30, 13]. Some specific examples are as follows: communication networks, e.g. scheduling and allocation in a multi-cell setting; sensor networks, e.g., remote parameter estimation with sensor data [30]; and statistical machine learning, with large datasets. In all of these cases the underlying computational task can be formulated as an optimization problem, with objective being utilities, loss functions, etc. [13]. Such motivations have lead to a burgeoning of the research in distributed optimization.

Distributed optimization methods with convex objective functions have been well investigated in the literature. Many proposed methods fall into the category of gradient or subgradient methods [26], which take gradient descent steps at each node and then average the results. Another class of methods utilize a dual-decomposition idea, like the Alternating Direction Method of Multipliers (ADMM) method [10]. In contrast, distributed non-convex optimization has received much less attention. One of the first provably convergent algorithms for non-convex objectives using a fully distributed scheme, called NEXT, over a network with arbitrary graphical structure was introduced in [24]. The main idea in [24] is to perform local optimization by finding surrogate convex functions based on the current iterate and utilizing successive convex approximations of the non-convex objective, and then enforcing consensus among the network so that a global objective can be solved in a distributed manner. Other papers on this topic assume much stronger conditions, such as existence of a central controller to align the outputs in each step [15], or even a complete graph network (interaction) structure [22]. Given this we will adopt the framework of NEXT [24] in our work. Here we consider the scenario where the network that specifies the communication structure is given; the reader is referred to [19] for how to decide the network structure actively, with minimizing the energy consumption for delay-constrained singular value decomposition computation as a motivating example.

There are some fundamental issues with NEXT [24] though, which we will address in our work. First, the algorithm requires each node to store and update the entire vector of decision variables, irrespective of the underlying dimension or structure; this is also an issue more broadly for most distributed optimization algorithms. In certain applications, the decision variables might be the ensemble of sets of control parameters at each node, which could be of a significant dimension themselves: e.g., multiple platoons of automated cars with a local controller for each team, or cellular base-stations each with many connected devices. Directly using NEXT would necessitate greatly increased storage at each node and also high-rate and low-latency communication between all nodes, which is impractical for a large network. In the illustrative examples above, the decision variables typically can be decomposed into blocks with a sparse interconnection between different blocks. Such a block structure could be used in reducing the storage and communication requirements. This, however, has not received much attention in the distributed optimization literature. The block coordinate descent method for centralized optimization is studied in [4], wherein gradient descent is effectively carried out one block at a time. When the objective is the sum of separable functions, convergence is shown for extensions of ADMM when the number of blocks is two [4], but this no longer holds when there are three blocks [11]. A distributed optimization scheme for variables with block structure is proposed in [14], but only the convex part of the objective is decomposable. An optimizatiton problem where the separable variables of agents are coupled through a convex social cost function is studied in [36], where the author uses a dual method to decouple the variables and leverage the separability; the goal is to show that the duality gap vanishes when the number of agents grows. In our work we will address this lacuna and present an algorithm that exploits the underlying block structure of the decision variables through a process which we call localization when the objective is the sum of separable non-convex functions. Our idea of localization is similar to [37], which exploit the sparsity of the constraints in convex feasibility problems (CFPs) to reduce the memory and communication needed. As we will explain in Section 4.3, not only is our framework more general, but it also works for non-convex problems.

A second issue with the approach in [24] is that the objective in many applications may not have the required smoothness. For example, the common assumption is of Lipschitz and bounded gradients as in [24], but this may not hold; we will demonstrate this explicitly with the motivating application. In centralized convex optimization, apart from subgradient methods, proximal methods are used for non-smooth functions [29]: e.g., a substitute is the Moreau envelope that is strongly convex and maintains the minimizer. We will develop a general scheme for continuous objective functions with non-Lipschitz unbounded gradients that takes any sequence of smooth approximations as input, such as the Moreau envelope.

Motivating Example

There is a trend for increased access-point deployment density coupled with the increasing usage of high-speed connections as well as fiber for backhaul [39, 2] in modern wireless networks. Hence, it is feasible to envisage high-rate and low-latency communications based coordination between neighboring base-sites to implement distributed optimization methods for resource allocation. Even this only allows communication of locally relevant decision variables, and rules out communicating network-wide decision variables, as would be the case if one used the algorithm in [24]. We will use the resource allocation problem, solved via the Network Utility Maximization framework [28] with some specific use cases described in [16, 17], as our motivating example. In the one-shot weighted sum-rate maximization problem derived from the decomposition of network utility across time instances in a time-varying channels environment, we have to jointly decide the power transmitted by base stations (BSs) at each channel (power control), as well as the resource blocks (RBs) a BS should transmit data to its users (scheduling); jointly this is termed resource allocation. This problem is hard because the objective is non-convex, and the constraints are knapsack-like constraints; the latter are usually solved by relaxing to real-valued variables and then rounding. We will follow the same approach, and concentrate on obtaining (locally) optimal solutions to the relaxed problem. While this problem for the single-cell is well characterized [16], the solution to even the multi-cell power control problem with interference impacts remains unresolved [12].

Given the difficulty of solving the multi-cell resource allocation problem, existing methods in the literature use heuristic approaches such as decomposition of the problem followed by greedy algorithms that lead to sub-optimal solutions [35], or make strong assumptions on the network interference graph [40]. Some algorithms are centralized [38], which is an impractical assumption in a realistic scenario. Prior work [33] also utilizes interference prices to solve the multi-cell power control problem, where each BS-user pair maximizes its own utility minus the sum of marginal “costs” to all other users with increase in its power. This literature only considers power control and not scheduling. Also, though having extensions in multi-channel settings, only one receiver is considered under each transmitter, and users need to sequentially broadcast their interference prices to ensure convergence in the multiple-input single-output case. In [31], a distributed power control and scheduling algorithm is proposed; $\epsilon$ -optimality is established for the grid network with a $K$ -hop interference model.

In this paper, we generalize the results in [24] by resolving the two issues mentioned earlier. For the first issue, we exploit the separability of the decision variables and the objectives’ partial dependencies to reduce the storage and communication needs. Although all components of the decision variable are entangled via each node’s objective, we show that each component can be maintained and optimized within a local network – this is different from directly applying NEXT to the setting, and its convergence is not obvious from [24]. Secondly, inspired by the proximal method, we use a series of functions with Lipschitz gradients to approximate the original objective, and significantly relax the smoothness assumptions made in [24]. We show that as long as the series of approximation functions approach the original function slowly enough, we are still guaranteed to obtain stationary solutions in many situations. While following the same steps as NEXT when the gradients are Lipschitz, our algorithm appears to have superior numerical stability over NEXT with non-Lipschitz gradients. We establish convergence for our algorithm with the proposed two generalizations under the condition of no unbounded gradients on the boundary. We then apply the results and algorithms to the multi-cell resource allocation problem in many different ways, which gives numerous algorithms with provable convergence to locally optimal solutions. Last but not least, we give a stochastic approximation interpretation of NEXT in Appendix D, where we provide an alternative proof of NEXT and discuss its relation to [6].

The rest of the paper is organized as follows. We describe the distributed optimization problem setup and the idea of NEXT in Section 2. Then we present our generalized algorithm with localization and proximal approximations in Section 3. In Section 4, we discuss the effect of localization, as well as address the practicability issues of our algorithm. We give examples where gradients are unbounded such that NEXT fails to converge to the correct solution while our algorithm does. We describe the application to the multi-cell resource allocation problem in Section 5. We discuss simulation results for the approximation functions and resource allocation application in Section 6, and conclude in Section 7.

2 Preliminaries

In this section, we give the system model of distributed non-convex optimization and assumptions. The bulk of the setup directly follows from [24]. Consider a network $\mathcal{N}=\{1,\dots,I\}$ that consists of $I$ nodes. We aim to solve an optimization problem of the form

[TABLE]

where all $f_{i}$ ’s are $C^{1}$ smooth but can be non-convex, and $G$ is convex but may be non-smooth. The goal is to let these nodes cooperatively solve the problem in a distributed fashion. Therefore, each $j\in\mathcal{N}$ maintains a copy of the entire decision variable $\mathbf{x}$ , referred to as $\mathbf{x}_{j}$ . Then Eq. 1 is equivalent to solving the optimization problem

[TABLE]

at each $j\in\mathcal{N}$ subject to the constraint that all nodes agree on their optimal choices, i.e., we enforce

[TABLE]

In the context of distributed optimization, node $i$ only has the information of $f_{i}$ . It would require communication between the nodes to solve the problem in Eq. 2-Eq. 3.

Below are the standard assumptions on the objective functions and the constraint set.

**Assumption A

(A1)** The set $\mathcal{K}\in\mathbb{R}^{d}$ is closed and convex;

(A2) $G$ is convex with bounded subgradient $L_{G}$ for all $\mathbf{x}\in\mathcal{K}$ ;

(A3) $U$ is coercive, that is, $\lim_{\mathbf{x}\in\mathcal{K},|\mathbf{x}|\rightarrow\infty}U(\mathbf{x})=\infty$ ; based on this we can effectively assume that $\mathcal{K}$ is compact;

(A4) $f_{i}$ ’s have bounded gradients, i.e. $\exists\enspace B$ s.t. $\|\nabla f_{i}(\mathbf{x})\|<B$ for all $\mathbf{x}\in\mathcal{K}$ .

The set of nodes $\mathcal{N}$ along with a set of undirected edges $\mathcal{E}$ form a graph $\mathcal{G}=(\mathcal{N},\mathcal{E})$ . This graph captures how communications take place – node $i$ and $j$ can only communicate if $(i,j)\in\mathcal{E}$ . $\mathcal{G}$ is assumed to be connected to foster communication between the nodes; otherwise, the problem is generally unsolvable111All results can be trivially extended to directed time-varying graphs: see [24] for NEXT and the Appendix for our algorithm. For easy of presentation we adopt the above settings..

Our methods follow the solution scheme of NEXT. In the NEXT algorithm, each node performs a local convex optimization, and then some information will be exchanged in the network. At a high level, the first step is the “descent step” and the second is the “consensus step;” the two steps are iteratively applied to obtain the solution [24]. In the first step of time $n$ , the node $i$ solves a convex approximation of the whole objective function by convexizing its own objective function $f_{i}$ parametrized by the current iterate $\mathbf{x}_{i}[n]$ to be a strongly convex surrogate function $\tilde{f}_{i}(\bullet;\mathbf{x}_{i}[n])$ , while linearizing the sum of other nodes’ objective functions $\sum_{j\neq i}f_{j}$ . We assume the surrogate function satisfies the following assumption:

**Assumption F

(F1)** $\tilde{f}_{i}(\bullet;\mathbf{x})$ is convex;

(F2) $\nabla\tilde{f}_{i}(\mathbf{x};\mathbf{x})=\nabla f_{i}(\mathbf{x})$ for all $\mathbf{x}\in\mathcal{K}$ ;

(F3) Either $\tilde{f}_{i}(\bullet;\mathbf{x})$ ’s are coercive for all $\mathbf{x}\in\mathcal{K}$ and $i\in\mathcal{N}$ or $G(\bullet)$ is coercive.

The result established is convergence to the stationary solutions, whose definition is given as follows.

Definition 2.1.

*A point $\mathbf{x}^{*}$ is a stationary solution of Problem Eq. 1 if a subgradient $g\in\partial G(\mathbf{x}^{*})$ exists such that $(\nabla F(\mathbf{x}^{*})+g)^{T}(\mathbf{y}-\mathbf{x}^{*})\geq 0$ for all $\mathbf{y}\in\mathcal{K}$ . *

3 Main Result

In this section, we introduce our two generalizations – localization and approximations of NEXT, which are the key contribution of this paper. We will first describe our settings for the generalizations, and then give the revised algorithm and the convergence theorem.

3.1 Localization Setting and Assumptions

Consider the setup where there are $M$ local dependency sets $\mathcal{N}_{1},\dots,\mathcal{N}_{M}$ , and the local objective function of node $i$ , i.e. $f_{i}$ , only depends on a common variable $\mathbf{x}^{c}$ and the local variables $\mathbf{x}^{m}$ whenever $i\in\mathcal{N}_{m}$ . To be more specific, the decision variables can be split into $M+1$ parts $\mathbf{x}\triangleq(\mathbf{x}^{1},\dots,\mathbf{x}^{M},\mathbf{x}^{c})$ where $M$ is an arbitrary positive integer. For all $m\in[M]$ (which stands for $1,\dots,M$ ), define the local dependency set $\mathcal{N}_{m}\triangleq\{i:f_{i}\text{ is a function of }\mathbf{x}^{m}\}\subseteq\mathcal{N}$ . Denote the sizes of $\mathcal{N}_{1},\dots,\mathcal{N}_{M}$ as $I_{1},\dots,I_{M}$ . We can think $\mathcal{N}$ itself as the $(M+1)$ -th local dependency set $\mathcal{N}_{M+1}$ and every $f_{i}$ may depend on $\mathbf{x}^{c}\triangleq\mathbf{x}^{M+1}$ . We adopt the convention that whenever $M+1$ appears in either superscript or subscript, it means $c$ or anything associated with the original network $\mathcal{G}$ ; this includes $I_{M+1}\triangleq|\mathcal{N}|=I$ . In the other direction, define the dependent part $\mathcal{S}_{i}\triangleq\{m:f_{i}\text{ is a function of }\mathbf{x}^{m},m\in[M+1]\}$ . Note that for all $i$ , $\mathcal{S}_{i}$ contains $M+1$ . Also, when $\mathcal{S}_{i}$ appears in the superscript of a variable, for example $\mathbf{x}$ , it means the vector concatenated from all $\mathbf{x}^{k}$ ’s such that $k\in\mathcal{S}_{i}$ , i.e. $\mathbf{x}^{\mathcal{S}_{i}}\triangleq(\mathbf{x}^{k})_{k\in\mathcal{S}_{i}}$ . Concrete examples of local dependency sets are provided in Section 4.1. Furthermore, we have the following assumptions:

**Assumption L

(L1)** Besides the fact that $f_{i}$ depends on $\mathbf{x}^{m}$ only if $i\in\mathcal{N}_{m}$ for $m\in[M+1]$ , $G$ also only depends on $\mathbf{x}^{c}$ ;

(L2) The set $\mathcal{K}$ is separable, i.e. it is the direct product of $(M+1)$ convex sets in proper subspaces $\mathcal{K}=\mathcal{K}_{1}\times\dots\times\mathcal{K}_{M+1}\subset\mathbb{R}^{d}$ such that $\mathbf{x}^{m}\in\mathcal{K}_{m}\subset\mathbb{R}^{d_{m}}$ for all $m\in[M+1]$ if and only if $\mathbf{x}\triangleq(\mathbf{x}^{1},\dots,\mathbf{x}^{M+1})\in\mathcal{K}$ ;

(L3) The local network $\mathcal{G}_{m}=(\mathcal{N}_{m},\mathcal{E}_{m}=\{(i,j)\in\mathcal{E}:i\in\mathcal{N}_{m}\text{ and }j\in\mathcal{N}_{m}\}$ is connected for all $m\in[M]$ 222We add nodes to get connectedness if it does not hold. More on this in Section 4.2, we already assume the connectedness of $\mathcal{G}_{M+1}$ in Section 2;

(L4) For all $m\in[M+1]$ there is a matrix $\mathbf{W}^{m}$ associated with $\mathcal{N}_{m}$ – each entry is non-zero if and only if there is a corresponding edge in $\mathcal{E}_{m}$ , and all non-zero entries must be greater than or equal to some fixed $\vartheta>0$ . Equivalently, for all rows $i\not\in\mathcal{N}_{m}$ and all columns $j\not\in\mathcal{N}_{m}$ , we have $(\mathbf{W}^{m})_{i,:}=0$ and $(\mathbf{W}^{m})_{:,j}=0$ . In addition, $\mathbf{W}^{m}$ is doubly-stochastic after deleting these zero rows and columns. $\mathbf{W}^{M+1}$ does not contain zero row or column, and corresponds to the $\mathbf{W}$ defined in NEXT.

Note that $\mathcal{N}_{m}$ ’s do not have to be disjoint and form a partition of $\mathcal{N}$ . Having $\mathcal{N}_{m}\cap\mathcal{N}_{l}\neq\emptyset$ for some $m\neq l$ is allowed. Without loss of generality we can assume they form a covering of $\mathcal{N}$ ; i.e., $\bigcup_{m=1}^{M+1}\mathcal{N}_{m}=\mathcal{N}$ . When there exists a part $\mathbf{x}^{c}$ on which all $f_{i}$ ’s and $G$ depend, then $\mathcal{N}_{M+1}$ is $\mathcal{N}$ itself; otherwise, nodes outside the union do not have any cross-dependence with all the rest and can be optimized themselves.

3.2 Proximal Approximations Setting and Assumptions

In contrast to the common setting in the optimization literature, we consider a scenario where $\nabla f_{i}$ is not Lipschitz continuous for some $i$ . Our idea to relax this Lipschitz assumption is to use a sequence of functions whose gradients are Lipschitz continuous to approximate $f_{i}$ . This is commonly known as the proximal approximation method in the literature of convex optimization, except that our objective is now non-convex.

To be more specific, we want to find a series of functions $\{f^{*}_{i,n}\}_{n\geq 1}$ such that $\nabla f^{*}_{i,n}$ is globally Lipschitz continuous with constant $L_{i,n}$ and that as $n\rightarrow\infty$ we have $f^{*}_{i,n}\rightarrow f_{i}$ pointwise, or even better – uniformly. Then at iteration $n$ we can use the well-behaved $f^{*}_{i,n}$ instead of $f_{i}$ . We will see that as long as the schedule of $\{L_{i,n}\}_{n\geq 1}$ satisfies certain conditions, we can still have convergence to optimality.

The following assumption is the key feature of $f^{*}_{i,n}$ that our algorithm needs for convergence to optimality.

Assumption N:

(N1) $\nabla f^{*}_{i,n}$ is globally Lipschitz continuous with constant $L_{i,n}$ ;

(N2) $\lim_{n\rightarrow\infty}f^{*}_{i,n}\rightarrow f_{i}$ uniformly, and $\lim_{n\rightarrow\infty}\nabla f^{*}_{i,n}\rightarrow\nabla f_{i}$ pointwise.

We will also need a surrogate function of $f^{*}_{i,n}$ , denoted as $\tilde{f}^{*}_{i,n}$ . These surrogate functions should satisfy Assumption F´ similar to Assumption F given as follows.

**Assumption F´

(F1´)** $\tilde{f}^{*}_{i,n}(\bullet;\mathbf{x})$ is uniformly strongly convex with constant $\tau_{i,n}>0$ ;

(F2´) $\nabla\tilde{f}^{*}_{i,n}(\mathbf{x};\mathbf{x})=\nabla f^{*}_{i,n}(\mathbf{x})$ for all $\mathbf{x}\in\mathcal{K}$ ;

(F4´) $\nabla\tilde{f}^{*}_{i,n}(\mathbf{x};\bullet)$ is uniformly Lipschitz continuous with constant $L_{i,n}$ .

As [24] does not have approximation functions, they assume Assumption F´ for the surrogate function of $f_{i}$ , i.e. $\tilde{f}_{i}$ , with fixed $\tau_{i}>0$ and $L_{i}<\infty$ . Here our assumptions are more general than NEXT in that we do not require the strong convexity of $\tilde{f}_{i}(\bullet;\mathbf{x})$ and the Lipschitz continuity of $\nabla\tilde{f}_{i}(\mathbf{x};\bullet)$ . In particular, we can have $\lim_{n\rightarrow\infty}\tau_{i,n}=0$ and $\lim_{n\rightarrow\infty}L_{i,n}=\infty$ as long as the schedules of $\{\tau_{i,n}\}_{n}$ and $\{L_{i,n}\}_{n}$ satisfies certain conditions. We do have an additional Assumption F3. However, that assumption is implicitly implied if $\tilde{f}_{i}(\bullet;\mathbf{x})$ ’s are strongly convex; thus, we do not lose any generality from [24]. Also for simplicity, we assume $L_{i,n}$ is a non-decreasing sequence. We denote $F^{*}_{n}=\sum_{i=1}^{I}f^{*}_{i,n}$ .

3.3 Localized Proximal Inexact NEXT and Main Convergence Theorem

Our localized proximal inexact version of NEXT, which requires less storage and communication, and allows unbounded non-Lipschitz objective gradients, is presented in Algorithm 1. All operations that contain index $i$ , i.e. Line 5, 6, 7, 9, 10, and 11, are performed for all $i\in\mathcal{N}$ . Also, except Line 5, the operations have a superscript $k$ , and are performed for all $k\in\mathcal{S}_{i}$ . In Line 5,

[TABLE]

with $\mathbf{x}_{i}=(\mathbf{x}^{k}_{i})_{k\in\mathcal{S}_{i}}=\mathbf{x}^{\mathcal{S}_{i}}_{i}$ and $\tilde{\mathbf{\pi}}_{i}=\{\tilde{\mathbf{\pi}}^{k}_{i}:k\in\mathcal{S}_{i}\}=\tilde{\mathbf{\pi}}^{\mathcal{S}_{i}}_{i}$ . Note that Line 6 along with Line 5 means that one solves the optimization problem in Eq. 4 with accuracy $\epsilon_{i}[n]$ . Also denote $\nabla f^{*}_{i,n}[n]=\nabla f^{*}_{i,n}(\mathbf{x}_{i}[n])$ . The output of the algorithm is the concatenation of $[\bar{\mathbf{x}}^{m}]_{m\in[M+1]}$ , where $\bar{\mathbf{x}}^{m}\triangleq\frac{1}{I_{m}}\sum_{i\in\mathcal{N}_{m}}\mathbf{x}^{m}_{i}$ .

NEXT is a special case of Algorithm 1. First, NEXT would either be the case where $I_{m}=I$ for all $m\in[M]$ , or the equivalent case where $\mathbf{x}$ consists of one part $\mathbf{x}^{c}$ only. In the NEXT algorithm, node $i$ keeps the whole $\mathbf{x}$ , and communicates the whole $\mathbf{x}$ as well with all of its neighbors; the same also applies to the variables $\mathbf{y}$ and $\tilde{\mathbf{\pi}}$ . On the contrary, under our localization setting, node $i$ turns out only has to maintain $\mathbf{x}^{\mathcal{S}_{i}}$ (also $\mathbf{y}^{\mathcal{S}_{i}}$ and $\tilde{\mathbf{\pi}}^{\mathcal{S}_{i}}$ ) in Algorithm 1; moreover, it only communicates $\mathbf{x}^{\mathcal{S}_{i}\cap\mathcal{S}_{j}}$ (also $\mathbf{y}^{\mathcal{S}_{i}\cap\mathcal{S}_{j}}$ ) with its neighbor $j$ . This could mean substantial savings for memory and communication, especially when the dependency structure is sparse. Section 4.1 provides an example that explicitly specifies the storage and communication needed before and after localization. Secondly, NEXT would be when $f^{*}_{i,n}=f_{i}$ , $L_{i,n}=L_{i}<\infty$ , and $\tau_{i,n}=\tau_{i}>0$ for all $i$ and $n$ . Algorithm 1 also accommodates a bigger class of objective functions and supports a more flexible choice of the schedules of $\{L_{i,n}\}$ and $\{\tau_{i,n}\}$ .

Remark.

*The insight of the variables $\mathbf{y}$ and $\tilde{\mathbf{\pi}}$ can be found in [24]. In short, node $i$ needs the information of $\sum_{j\neq i}\nabla f_{j}$ at the current iterate $\mathbf{x}_{i}[n]$ when linearizing others’ objectives. For this purpose, node $i$ tracks the average of gradients $\frac{1}{I}\sum_{j}\nabla f_{j}$ using $\mathbf{y}$ , keeps its neighbors updated, and obtains $\tilde{\mathbf{\pi}}_{i}$ , an approximation of $\sum_{j\neq i}\nabla f_{j}(\mathbf{x}_{i}[n])$ , by subtracting its own gradient from $\mathbf{y}$ . In our localized algorithm, only nodes in $\mathcal{N}_{m}$ participate in the decision of $\mathbf{x}^{m}$ and the tracking of $\mathbf{y}^{m}$ and $\tilde{\mathbf{\pi}}^{m}$ , which requires some subtle treatment. *

Remark.

*Note that Algorithm 1 is not the result of the direct application of NEXT in the localization setting given in Section 3.1. Suppose $m^{\prime}$ is such that $i\not\in\mathcal{N}_{m^{\prime}}$ and under the connectedness assumption. Different from NEXT, node $i$ no longer keeps the gradient trace $\tilde{\mathbf{\pi}}^{m^{\prime}}_{i}$ in Algorithm 1, nor does the variable $\mathbf{x}^{m^{\prime}}_{i}$ appear in the objective of the local optimization. The fact that convergence can still be obtained with Algorithm 1 is not obvious from [24]. *

The following theorem generalizes the convergence to stationary solutions result of NEXT when gradients are bounded, specifying restrictions on the schedules of $\{L_{i,n}\}$ , $\{\tau_{i,n}\}$ . When gradients are unbounded, stricter constraints are needed.

Theorem 3.1.

For all $m\in[M+1]$ , let $\{\mathbf{x}^{m}[n]\}_{n}\triangleq\{(\mathbf{x}^{m}_{i}[n])_{i\in\mathcal{N}_{m}}\}_{n}$ be the sequences generated by Algorithm 1, $\{\bar{\mathbf{x}}^{m}[n]\}_{n}\triangleq\left\{\frac{1}{I_{m}}\sum_{i\in\mathcal{N}_{m}}\mathbf{x}^{m}_{i}[n]\right\}_{n}$ be their averages, and $\{\bar{\mathbf{x}}[n]\}_{n}=\{(\bar{\mathbf{x}}^{1}[n],\dots,\bar{\mathbf{x}}^{M}[n],\bar{\mathbf{x}}^{c}[n])\}_{n}$ be the ensemble of averages. Let $L^{\max}_{n}=\max_{i}L_{i,n}$ , $\tau^{\min}_{n}=\min_{i}\tau_{i,n}$ , $\epsilon[n]=\min_{i,k}\epsilon^{k}_{i}[n]$ , $\zeta_{n}=\max_{\mathbf{x}\in\mathcal{K}}\|F^{*}_{n}(\mathbf{x})-F^{*}_{n-1}(\mathbf{x})|$ , $\eta_{i,n}=\max_{\mathbf{x}\in\mathcal{K}}\|\nabla f^{*}_{i,n}(\mathbf{x})-\nabla f^{*}_{i,n-1}(\mathbf{x})\|$ , and $\eta^{\max}_{n}=\max_{i}\eta_{i,n}$ .

(a)

Suppose Assumptions A, F, L, N, and F´ hold333Note that Assumption N1 requires $\lim_{n\rightarrow\infty}L^{\min}_{n}=\infty$ where $L^{\min}_{n}=\min_{i}L_{i,n}$ with $f^{*}_{i,n}$ given as Eq. 6. For any other choices of $f^{*}_{i,n}$ other than the double Moreau envelope function [21], Assumption N1 requires $\lim_{n\rightarrow\infty}L_{i,n}=\infty$ for any $i$ such that $\nabla f_{i}$ is non-Lipschitz continuous.. Also, $\alpha[n]\in(0,1]$ is such that $\sum_{n=0}^{\infty}(L^{\max}_{n})^{3}\left(\frac{\alpha[n]}{\tau^{\min}_{n}}\right)^{2}<\infty$ , $\sum_{n=0}^{\infty}\tau^{\min}_{n}\alpha[n]=\infty$ , $\sum_{n=0}^{\infty}\alpha[n]L^{\max}_{n}\epsilon[n]<\infty$ . Then (1) all sequences $\{\mathbf{x}^{m}_{i}[n]\}_{n}\enspace\forall\enspace m\in[M+1]$ asymptotically agree, i.e., $\lim_{n\rightarrow\infty}\|\mathbf{x}^{m}_{i}[n]-\bar{\mathbf{x}}^{m}[n]\|=0\quad\forall\enspace i\in\mathcal{N}_{m}$ ; (2) $\{\bar{\mathbf{x}}[n]\}_{n}$ is bounded, and its limit points are stationary solutions of the original problem. 2. (b)

If we do not assume Assumption A4, but we have $\lim_{n\rightarrow\infty}\alpha[n]\frac{(L^{\max}_{n})^{5}}{(\tau^{\min}_{n})^{3}}=0$ , $\lim_{n\rightarrow\infty}\frac{\eta^{\max}_{n}}{\tau^{\min}_{n}}=0$ , $\sum_{n=0}^{\infty}\frac{\alpha[n]L^{\max}_{n}\eta^{\max}_{n}}{\tau^{\min}_{n}}<\infty$ , $\sum_{n=0}^{\infty}\zeta_{n}<\infty$ , and $\lim_{n\rightarrow\infty}\nabla F^{*}_{n}\rightarrow\nabla F$ . Then we still have results (1) and first part of (2). When the limit points lie in the interior of $\mathcal{K}$ , or $\nabla F$ is bounded on those limit points, they will be stationary solutions. 3. (c)

Continuing (b), if a limit point $\bar{\mathbf{x}}^{\infty}$ lies on the boundary of $\mathcal{K}$ and $\|\nabla F(\bar{\mathbf{x}}^{\infty})\|=\infty$ , then the definition of stationary solution does not apply, and $\bar{\mathbf{x}}^{\infty}$ could be a local minimum, or a point that is not a local minimum.

Proof 3.2.

*See Appendix B. *

Remark.

*Note that the conditions in (a) imply $\lim_{n\rightarrow\infty}\alpha[n]\frac{(L^{\max}_{n})^{3}}{(\tau^{\min}_{n})^{3}}=0$ . In (b) we apply the stricter $\lim_{n\rightarrow\infty}\alpha[n]\frac{(L^{\max}_{n})^{5}}{(\tau^{\min}_{n})^{3}}=0$ as well as other constraints. *

We can see that although all parts of the variable $\mathbf{x}$ are coupled through all the objectives $f_{i}$ ’s, the decision on $\mathbf{x}^{m}$ is actually dictated by the nodes in $\mathcal{N}_{m}$ . We give a class of approximation functions that satisfies Assumption N, namely the Lasry-Lions envelope or double envelope [21], in Section 4.4. Note that the convergence of Algorithm 1 only requires the Lipschitz constants of $\{\nabla f^{*}_{i,n}\}_{n\geq 1}$ follow certain schedules, but $\{f^{*}_{i,n}\}_{n\geq 1}$ can be any sequence of functions that approaches $f_{i}$ in the limit, and does not need to be double envelope. We will give many examples of such sequences in Section 4.6 and Section 5.4. Also, an example of the schedules of the the parameters $\{L^{\max}_{n}\}_{n}$ , $\{\tau^{\min}_{n}\}_{n}$ , $\{\alpha[n]\}_{n}$ , and $\{\epsilon[n]\}_{n}$ that satisfies the related conditions in Theorem 3.1 (a) and (b) using $p$ -series is given in Section 4.5. We will discuss the unbounded gradient issue in full detail in Section 4.6, study different examples there, and show the numerical stability of our algorithm for these examples in Section 6.1.

4 Discussions

In this section, we discuss some details of our proposed algorithm. Two localization examples that compare the effect of localization are given in Section 4.1. We comment on how to exploit localization when Assumption L3 is violated in Section 4.2. Comparison to the localization algorithm proposed by Hu et al. in [37] is given in Section 4.3. Section 4.4 and Section 4.5 summarize the property of double envelope that we will use and provide the $p$ -series example for the parameters $(L,\tau,\alpha,\epsilon)$ , respectively.

4.1 Examples of Localization

In this subsection we give a few examples to illustrate the concept and effectiveness of localization.

{ex}

Consider the communication graph given in Fig. 1 (a). In this example we consider the most common situation, where the dependency is directly given by the communication graph. Specifically, every node $i$ has a corresponding variable part $\mathbf{x}^{i}$ in the whole variable tuple $\mathbf{x}$ . Moreover, the objective at node $i$ only depends on its own variable $\mathbf{x}^{i}$ and its neighbors’ variables $\{\mathbf{x}^{j}:j\in N(i)\}$ . In other words, the local dependency set $\mathcal{N}_{i}$ corresponding to $\mathbf{x}^{i}$ is equal to $Nb(i):=N(i)\cup\{i\}$ . $G$ is assumed to be [math]. This situation also arises in the resource allocation application we discuss in Section 5.

For this communication graph, this situation will be the following. The variable $\mathbf{x}=(\mathbf{x}^{1},\mathbf{x}^{2},\mathbf{x}^{3},\mathbf{x}^{4},\mathbf{x}^{5})$ can be split into five parts, and we have $f_{1}=f_{1}(\mathbf{x}^{1},\mathbf{x}^{2},\mathbf{x}^{4})$ ( $f_{1}$ only depends on $\mathbf{x}^{1}$ , $\mathbf{x}^{2}$ , and $\mathbf{x}^{4}$ ), $f_{2}=f_{2}(\mathbf{x}^{1},\mathbf{x}^{2},\mathbf{x}^{3})$ , $f_{3}=f_{3}(\mathbf{x}^{2},\mathbf{x}^{3},\mathbf{x}^{4})$ , $f_{4}=f_{4}(\mathbf{x}^{1},\mathbf{x}^{3},\mathbf{x}^{4},\mathbf{x}^{5})$ , and $f_{5}=f_{5}(\mathbf{x}^{4},\mathbf{x}^{5})$ . The local dependency sets for this example are given in Fig. 1 (b). Before localization, node 1 needs to store all of $\mathbf{x}^{1}$ to $\mathbf{x}^{5}$ and communicates this information to its neighbors $\{2,4\}$ . In contrast, after localization node 1 only keeps $\mathbf{x}^{1}$ , $\mathbf{x}^{2}$ , and $\mathbf{x}^{4}$ , and exchanges some of this information with $\{2,4\}$ . In addition, node 1 does not maintain the information of $\mathbf{y}^{3}$ , $\mathbf{y}^{5}$ , or $\tilde{\mathbf{\pi}}^{3}$ , $\tilde{\mathbf{\pi}}^{5}$ , either. Before localization, NEXT performs

[TABLE]

in the local optimization step at node 5, while after localization Algorithm 1 performs the following

[TABLE]

The first terms of the two operations are actually the same. Indeed, since $f_{5}$ only depends on $\mathbf{x}^{4}$ and $\mathbf{x}^{5}$ , we can definitely choose a surrogate function that only depends on the local storage of $\mathbf{x}^{4}$ and $\mathbf{x}^{5}$ . The difference is node 5 does not have to store, say $\tilde{\mathbf{\pi}}_{5}^{1}$ , and optimize $\mathbf{x}_{5}^{1}$ based on $\tilde{\mathbf{\pi}}_{5}^{1}$ . The variable $\tilde{\mathbf{\pi}}_{5}^{1}$ is asymptotically tracking $\sum_{j\neq 5}\nabla_{\mathbf{x}^{1}}f_{j}$ . Our localization result says that since node 5 does not have any preference on $\mathbf{x}^{1}$ , it only follows others’ decision of $\mathbf{x}^{1}$ through $\tilde{\mathbf{\pi}}_{5}^{1}$ . Hence, it is unnecessary for node 5 to maintain $\mathbf{x}_{5}^{1}$ and $\tilde{\mathbf{\pi}}_{5}^{1}$ – it can just take other nodes’ decision at the end, even through $\mathbf{x}^{1}$ and $\mathbf{x}^{5}$ are coupled through, say $f_{4}$ . This saves nodes from unnecessary memory storage and communication in the presence of sparse dependency structure, which is crucial when the network is large.

{ex}

We consider the same communication network but a different dependency structure: $f_{1}=f_{1}(\mathbf{x}^{1},\mathbf{x}^{2})$ , $f_{2}=f_{2}(\mathbf{x}^{2},\mathbf{x}^{3})$ , $f_{3}=f_{3}(\mathbf{x}^{3})$ , $f_{4}=f_{4}(\mathbf{x}^{1})$ , and $f_{5}=f_{5}(\mathbf{x}^{1})$ . There are three local dependency sets in this example as shown in Fig. 1 (c). The variable parts that are stored at each node and the communication required to other node for each part are given in Table 1. Naturally, before localization every node keeps all parts and communicates them with all of its neighbors. On the contrary, the storage and communication required are greatly reduced after localization. Notice that even though node 3 is directly linked with node 4, they do not communicate as there is no common part they both depend on.

4.2 Relaxing Assumption L3

Our requirement of connectedness of $\mathcal{G}_{m}$ is not a strong assumption. If any $\mathcal{G}_{m}$ is not connected, we can always “add” nodes as “relays” into the local dependency set. For example, consider Section 4.1 with $f_{3}$ revised to be a constant and $f_{4}$ revised as $f_{4}=f_{4}(\mathbf{x}^{1},\mathbf{x}^{3})$ . Then $\mathcal{G}_{3}=(\{2,4\},\phi)$ is not connected. We can add node 3 into $\mathcal{G}_{3}$ by making $f_{3}=f_{3}(\mathbf{x}^{3})$ where the dependency is actually trivial. Node 3 can thus relay the information of $\mathbf{x}^{3}$ for node 2 and 4. Alternatively, node 1 can also do the relay job. Hence, we assume without loss of generality that the nodes have been added by some algorithm that might depends on network structure and communication requirements so that every $\mathcal{G}_{m}$ is connected.

4.3 Comparison to Hu et al.

In [37], a similar idea of localization for convex feasibility problems (CFPs) is proposed, where they also exploit the sparsity of the constraints to reduce the storage and communication required. Our framework is different from theirs in two aspects. First, in their framework each node $i$ owns a part of the variable tuple $\mathbf{x}^{i}$ , whose corresponding “dependency network graph” $(\mathcal{N}_{i},\{(i,j):j\in\mathcal{N}_{i}\})$ must be a subgraph of $i$ ’s local graph $(N(i),\{(j,k)\in\mathcal{E}:j\in N(i)\text{ and }k\in N(i)\})$ where $N(i)$ is $i$ ’s neighbors in the communication graph $\mathcal{G}$ . On the other hand, in our framework each part of the variable tuple $\mathbf{x}^{m}$ does not have specific relation with the nodes and the local dependency set $\mathcal{N}_{m}$ is only required to be a connected component in $\mathcal{G}$ , which is more general in the sense that their framework is a sub-case of ours. Second, their dependency is built in the constraint sets, while ours is based on objective function’s dependency. This difference arises from the nature of CFPs and optimization problems. Namely, we focus on solving the optimal solution for optimization problems while they aim to find the solution lying in the intersection of a batch of sets. Although one would be able to solve convex optimization with CFP algorithms [9], it is still unclear whether this could be generalized to the case of non-convex optimization.

4.4 An Example of Approximation Functions Satisfying Assumption N

The Lasry-Lions envelope or double envelope [21][32] is a class of approximation functions that serves this purpose. We use this to illustrate the feasibility of our approach but also point out that any such sequence of functions satisfying the assumptions and conditions can be used instead.

Definition 4.1.

The double envelope, or Lasry-Lions envelope [21][32], of a function $f$ is defined by

[TABLE]

*where $0<s<t<\infty$ . *

Fact 1.

*If $f$ is lower bounded, then $\nabla f_{t,s}(\cdot)$ is Lipschitz continuous with constant $\max\left\{\frac{1}{s},\frac{1}{t-s}\right\}$ . *

Fact 2.

$f_{t,s}\rightarrow f$ * pointwise as $s,t\rightarrow 0$ . If further we have $f$ being uniformly continuous, then $f_{t,s}\rightarrow f$ uniformly as $s,t\rightarrow 0$ . Furthermore, $\nabla f_{t,s}\rightarrow\nabla f$ pointwise as $s,t\rightarrow 0$ . *

Proof 4.2.

*See [3]. *

Now it is clear that if we define

[TABLE]

then we have $\nabla f^{*}_{i,n}$ being globally Lipschitz continuous with constant $L_{i,n}$ . Since $U$ is coercive (Assumption A3), we can restrict our attention to some compact set in $\mathcal{K}$ , where $f_{i}$ is uniformly continuous. Then over this set, we will have $\lim_{n\rightarrow\infty}f^{*}_{i,n}\rightarrow f_{i}$ uniformly as well.

4.5 An Example of Sequences Satisfying the Conditions Using $p$ -series

Examples of the tuples $(\alpha[n],\epsilon[n],L^{\max}_{n},L^{\min}_{n},\tau^{\min}_{n})$ satisfying the conditions of Theorem 3.1 exist with all schedules in $p$ -series form. Assume $\alpha[n]=\alpha_{0}n^{-\beta}$ , $\epsilon[n]=\epsilon_{0}n^{-\gamma}$ , $L^{\max}_{n}=L^{\min}_{n}=L_{0}n^{\lambda}$ , and $\tau^{\min}_{n}=\tau_{0}n^{-\delta}$ for some positive constants $\alpha_{0}$ , $\epsilon_{0}$ , $L_{0}$ , and $\tau_{0}$ . Then the constraints on the parameters are

[TABLE]

A possible tuple $(\beta,\gamma,\lambda,\delta)$ satisfying the above equations is $(0.9,0.2,0.05,0.1)$ . If the $\nabla f_{i}$ ’s are Lipschitz continuous and then constant $\tau_{i}>0$ for all $i$ , then $\lambda=\delta=0$ and the above requirements degenerate to $0.5<\beta\leq 1$ and $\gamma>1-\beta$ as in [24].

4.6 Examples of Objectives with Unbounded Gradients

When it comes to non-Lipschitz gradients, a first example might be functions with unbounded gradients, but this need not be the only case; for example, $x\sqrt{x}$ on $[0,1]$ . Its derivative is $\frac{3\sqrt{x}}{2}$ , which is obviously bounded on $[0,1]$ but actually not Lipschitz continuous on $[0,1]$ . Though convergence is not established for this case in [24], NEXT still works well numerically in these kind of simple examples. Next we turn to more challenging examples with unbounded gradients.

{ex}

[Interior local optimum] Consider $\mathcal{N}=\{1,2,3\}$ in a triangle network, with $f_{1}(x)=x^{2}-(\ln 2)x$ , $f_{2}(x)=x\ln\left(\frac{8}{x^{2}}\right)$ , and $f_{3}(x)=-x\ln\left(\frac{4}{x^{2}}\right)$ , where $x\in\mathcal{K}=[-1,2]$ . There is no $G$ . The unique stationary solution is $x^{*}=0$ as $F(x)=x^{2}$ . The derivatives for the node objectives are $f_{1}^{\prime}(x)=2x-(\ln 2)$ , $f_{2}^{\prime}(x)=\ln\left(\frac{8}{x^{2}}\right)-2$ , and $f_{3}^{\prime}(x)=-\ln\left(\frac{4}{x^{2}}\right)+2$ . Obviously the derivatives are unbounded at $x=0$ for $f_{2}^{\prime}(x)$ and $f_{3}^{\prime}(x)$ , and this example is thus not covered in the theory developed in [24].

On the other hand, we can simply choose the following approximation functions: $f^{*}_{2,n}(x)=x\ln\left(\frac{8}{x^{2}+1/p(n)}\right)$ and $f^{*}_{3,n}(x)=-x\ln\left(\frac{4}{x^{2}+1/p(n)}\right)$ with derivatives being ${f^{*}_{2,n}}^{\prime}(x)=\ln\left(\frac{8}{x^{2}+1/p(n)}\right)-\frac{2x^{2}}{x^{2}+1/p(n)}$ and ${f^{*}_{3,n}}^{\prime}(x)=-\ln\left(\frac{4}{x^{2}+1/p(n)}\right)+\frac{2x^{2}}{x^{2}+1/p(n)}$ . We can choose $f^{*}_{1,n}(x)=f_{1}(x)$ throughout. One may check that $L_{1,n}=O(1)$ and $L_{2,n}=L_{3,n}=O(p(n)^{1/2})$ . The conditions in Theorem 3.1 (b) are checked in the following.

•

$(\alpha[n],L^{\max}_{n},\tau^{\min}_{n})$ series conditions: suppose we choose $\alpha[n]=\alpha_{0}n^{-0.7}$ and $\tau^{\min}_{n}=\tau>0$ , since $L^{\max}_{n}=O(p(n)^{1/2})$ , if we further choose $p(n)=n^{0.2}$ then it is evident that all conditions are satisfied.

•

$\sum_{n=0}^{\infty}\zeta_{n}<\infty$ : we actually have $F^{*}_{n}(x)=F(x)$ for all $x$ and hence $\zeta_{n}=0\enspace\forall\enspace n$ .

•

$\lim_{n\rightarrow\infty}\frac{\eta^{\max}_{n}}{\tau^{\min}_{n}}=0$ : the maximum differences of derivatives for node 2 and 3 always occur at $x=0$ , and $\nabla f^{*}_{2,n}(0)=\nabla f^{*}_{3,n}(0)=O(\log p(n))=O(\log n)$ for the choice $p(n)=n^{0.2}$ . The condition holds because $\log n-\log(n-1)=O(1/n)\rightarrow 0$ .

•

$\sum_{n=0}^{\infty}\frac{\alpha[n]L^{\max}_{n}\eta^{\max}_{n}}{\tau^{\min}_{n}}<\infty$ : notice that for our choices of $\alpha[n]=\alpha_{0}n^{-0.7}$ and $p(n)=n^{0.2}$ the summed term equals $O(n^{-0.7+0.2/2-1})$ and thus is summable.

•

$\lim_{n\rightarrow\infty}\nabla F^{*}_{n}\rightarrow\nabla F$ : notice $\nabla F^{*}_{n}(x)=\nabla F(x)$ for all $x$ .

As we have $\nabla F$ finite in $\mathcal{K}$ , by Theorem 3.1 it is guaranteed that our algorithm converges to the unique stationary solution $x=0$ , which is also a global minimum in this example. Note that this minimum lies in the interior of $\mathcal{K}$ .

{ex}

[Boundary local optimum] Consider a one node network, and the objective function is $f(x,y)=\sqrt{1-(x^{2}+y^{2})}+\frac{x}{2}$ on the region inside the unit circle $\mathcal{K}=\{(x,y):x^{2}+y^{2}\leq 1\}$ , so that the graph of $f(x,y)$ , namely $(x,y,f(x,y))$ , is the upper half of the unit sphere lifted in the direction of positive $x$ -axis. For the upper half of the unit sphere, the set of global minima is the unit circle; now that we tilt the sphere, the unique global minimum is $(-1,0)$ . There is no stationary solution inside the unit circle. Since the gradient

[TABLE]

is unbounded on the unit circle, the definition of stationary solution fails there. This example again clearly lies outside the theory of [24].

We consider the approximation function $f^{*}_{n}(x,y)=\sqrt{1-(x^{2}+y^{2})+1/p(n)}+\frac{x}{2}$ . Its gradient can also be obtained by changing the $1$ ’s in the denominators of the original gradient to $1+1/p(n)$ . We have $L_{n}=O(n^{3/2})$ .

•

$(\alpha[n],L^{\max}_{n},\tau^{\min}_{n})$ series conditions: choose $\alpha[n]=\alpha_{0}n^{-0.8}$ and $\tau^{\min}_{n}=\tau>0$ , and $p(n)=n^{0.1}$ .

•

$\sum_{n=0}^{\infty}\zeta_{n}<\infty$ : notice that $F^{*}_{n}\downarrow F$ monotonically. Since the largest decreasing of $F^{*}_{n}-F^{*}_{n-1}$ always happen on the unit circle, $\sum_{n=0}^{\infty}\zeta_{n}<\infty$ is simply $F^{*}_{1}-F$ evaluated on the unit circle, which is $\sqrt{2}$ .

•

$\lim_{n\rightarrow\infty}\frac{\eta^{\max}_{n}}{\tau^{\min}_{n}}=0$ : roughly $\|\nabla f^{*}_{n}\|=O(\sqrt{p(n)})=O(n^{0.05})$ for the choice of $p(n)=n^{0.1}$ . The condition is true because $n^{0.05}-(n-1)^{0.05}=O(n^{-0.95})\rightarrow 0$ .

•

$\sum_{n=0}^{\infty}\frac{\alpha[n]L^{\max}_{n}\eta^{\max}_{n}}{\tau^{\min}_{n}}<\infty$ : the term is decaying in the rate of $O(n^{-0.8+0.15-0.95})$ , which is summable.

•

$\lim_{n\rightarrow\infty}\nabla F^{*}_{n}\rightarrow\nabla F$ : this one is obvious as we only have one node.

Our method will not converge to any point in $int(\mathcal{K})$ ; otherwise, by part (b) of Theorem 3.1 we know it must be a stationary solution, but there is no stationary solution inside the unit circle. Thus, our method will converge to some point in $bd(\mathcal{K})$ ; since $\nabla F$ is infinite on the unit circle, this falls into the case of Theorem 3.1 (c), and the point is not guaranteed to be a local minimum. Even so, we find in Section 6.1 that for a wide range of initialization of $x$ , our algorithm can converge to $(-1,0)$ while NEXT does not. Unfortunately, since the objective is symmetric with respect to $x$ -axis and the unique global maximum is $(\frac{1}{\sqrt{5}},0)$ , if we start with any point to the right of the maximum on the $x$ -axis, the process will inevitably approach $(1,0)$ in the limit.

We will see in Section 6.1 that NEXT fails numerically in the above examples while our method works much better.

5 Application to Resource Allocation

We now describe how to apply our algorithmic framework to wireless resource allocation, and along way also describe how the two issues that motivated our generalizations arise.

5.1 Problem Formulation

We consider an OFDMA wireless cellular network, where a set of base stations (BSs) $B$ transmit downlink data to users through the set of channels or resource blocks (RBs) $K$ . For a BS $b\in B$ , $I_{b}$ denotes the set of users associated with it, which is an input that’s fixed. The transmitted power of BS $b$ in channel $k$ is denoted by $p_{bk}$ , and the maximum total sum power transmitted by BS $b$ is limited to $P_{b}$ . The allocation variable of BS $b$ to user $i$ in channel $k$ is denoted by $x_{bik}$ , with gain $g_{bik}$ : $x_{bik}=1$ means $b$ transmits to $i$ in the RB $k$ , and $x_{bik}=0$ otherwise. For all users, we also introduce a scheduling weight, $w_{i}$ for user $i$ . Finally, $\sigma^{2}$ is the variance of the independent zero mean additive white Gaussian noise (AWGN) for all BSs. We assume that BS $b$ only possesses the information of $\{g_{b^{\prime}ik}:b^{\prime}\in B,i\in I_{b},k\in K\}$ . In other words, BS $b$ can only compute the weighted-sum rate of the users associated with itself (knowing the powers of the other base-sites). This is a reasonable assumption, as each user equipment (UE) reports its measured channel gains to the BS it is associated with, whereas all the channel gains of UEs served by other BSs is unknown.

When the BS $b$ transmits a non-zero power in channel $k$ , it interferes with all the other transmissions in channel $k$ . However, owing to propagation-based loss, the powers of nearby BSs will dominate the whole interference term. Hence, with the definition that $N(b)$ is the neighboring BSs of BS $b$ , we can neglect all the interference from $b^{\prime}\not\in N(b)$ to $b$ . This is a modeling assumption that is reasonably accurate in practice, and will be in force henceforth. For ease of exposition we assume that neighbor relation is mutual, i.e. $b^{\prime}\in N(b)$ if and only if $b\in N(b^{\prime})$ . In terms of the interference graph, where nodes are BSs and edges only exist between BSs that interfere with each other, we reduce a complete graph to an undirected and connected one. Our work can be trivially extended to the directed case assuming that strong connectivity holds.

We consider a one-shot weighted sum-rate maximization problem, subject to the allocation limit constraint, the power limit constraint, non-negative power constraint, and the fact that $x_{bik}$ is either [math] or $1$ ; we will justify the weighted sum-rate maximization problem in the next section. To make the overall constraint set convex, we relax the integer constraints on $x_{bik}$ as in [16, 17]. In future work we will study appropriate integer rounding schemes.

The joint power control and scheduling problem (P1) is then formalized as:

[TABLE]

where $\Gamma_{bik}=p_{bk}g_{bik}$ and $\bar{\Gamma}_{bik}=\sum_{b^{\prime}\in N(b)}p_{b^{\prime}k}g_{b^{\prime}ik}$ are the signal and interference for user $i$ in channel $k$ . We use $p_{BK}$ to refer to the collection of the variables $p_{bk}\enspace\forall\enspace b\in B,k\in K$ ; also, $x_{BI(B)K}$ can be viewed in a similar way, where $I(B)\triangleq\bigcup_{b\in B}I_{b}$ . This is a shorthand for easy referencing.

As we will solve the problem in a distributed manner, we let each BS maintain the decision variables $p_{BK}$ . Denote the copy of $p_{b^{\prime}k}$ at BS $b$ by $p^{b}_{b^{\prime}k}$ for all $b^{\prime}\in B,k\in K$ . The idea is to perform the optimization at each BS, and then enforce consensuses of the decision variables among all BSs, transforming (P1) into (P2) given in the following:

[TABLE]

where $\Gamma^{b}_{bik}=p^{b}_{bk}g_{bik}$ and $\bar{\Gamma}^{b}_{bik}=\sum_{b^{\prime}\in N(b)}p^{b}_{b^{\prime}k}g_{b^{\prime}ik}$ . The set of constraints includes all the constraints in Eq. 7 now with $p^{b}_{b^{\prime}k}$ and the second and fourth constraints hold for all copies at $b\in B$ , and an additional constraint that the consensus is reached $p^{b_{1}}_{b^{\prime}k}=p^{b_{2}}_{b^{\prime}k}\quad\forall\enspace b_{1},b_{2},b^{\prime}\in B,k\in K$ .

We can split the $\log(\cdot)$ term in the objective to two parts $x_{bik}\log(\sigma^{2}+\Gamma^{b}_{bik}+\bar{\Gamma}^{b}_{bik})$ and $-x_{bik}\log(\sigma^{2}+\bar{\Gamma}^{b}_{bik})$ , and then modify the former to be $x_{bik}\log(\sigma^{2}+\frac{\Gamma^{b}_{bik}+\bar{\Gamma}^{b}_{bik}}{x_{bik}})$ as in [16, 17], which is jointly strictly concave in $x_{bik}$ and $p^{b}_{Bk}$ . We define the modified function to be 0 when $x_{bik}=0$ so that it is continuous. Then (P2) becomes the following:

[TABLE]

subject to the same set of constraints as in Eq. 8. Note that (P2) and (P3) are the same if $x_{bik}$ ’s are restricted to be integers, that is, $x_{bik}\in\{0,1\}$ .

We will now reiterate the two issues we identified earlier with existing distributed optimization approaches but in the specific context of Eq. 9. If we take the approach in [24], then every BS would need to keep a copy of all the decision variables, both $p_{BK}$ and $x_{BI(B)K}$ , and perform consensus on them and also any relevant gradient terms. This is simply impractical and forces the localization idea. We also note that Eq. 9 contains functions of the form $x\log\big{(}a+(p+p^{\prime})/x\big{)}-x\log(a+p^{\prime})$ that are smooth but where the gradients are not Lipschitz; in particular $x\log(a+(p+p^{\prime})/x)$ has some terms of its gradient becoming infinite when $x\downarrow 0$ . These functions clearly fall outside the framework of [24], and force approaches like our proximal approximations scheme. In the following, we apply the developed distributed optimization framework to the problem in Eq. 7-Eq. 9. In the problem, the set of BSs in the cellular network $B$ corresponds to $\mathcal{N}$ in the framework, a BS $b$ corresponds to a node $i$ , and the set of edges $\mathcal{E}$ in the framework is the one-tier interference graph here. There are many different ways to apply our framework to the resource allocation, which we discuss in detail next.

5.2 Direct Method

In this method, we directly let $f_{b}$ be the weighted sum-rate of BS $b$ : $f_{b}=-\sum_{i,k}w_{i}x_{bik}\log\left(1+\frac{\Gamma_{bik}}{\sigma^{2}+\bar{\Gamma}_{bik}}\right)$ , and $G=0$ . In the first version of this method, every BS $b$ keeps the powers of all BSs as decision variables, that is, we have $p^{b}_{b^{\prime}k}$ for all $b,b^{\prime}\in B$ . But instead of keeping $x^{b}_{b^{\prime}ik}$ ’s as decision variables at BS $b$ for all $b^{\prime}\in B$ if we exactly follow NEXT, we allow every BS $b$ only keeps its own $x^{b}_{bik}$ , which we denote as $x_{bik}$ for short. As we see from Section 3.3, this suffices because BS $b$ dictates the decision of $x_{bik}$ .

If we reconsider the problem from the perspective of our localization framework Section 3.1, there are $2|B|$ local dependency sets. There are $|B|$ local dependency sets $\{b\}$ for all $b\in B$ which correspond to the variables $x_{bI_{b}K}$ , and $|B|$ local dependency sets $Nb(b)\triangleq N(b)\cup\{b\}$ for all $b\in B$ which correspond to the variables $p_{bK}$ . In the first version of the direct method, referred to as the Localized X Globalized P-diRect Method (LXGP-RM) algorithm, only $x_{BI(B)K}$ follows the localization framework, $p_{BK}$ is still globalized in the sense that every BS keeps a copy of the whole variable.

We only present algorithm in words here, for the pseudo code see Appendix B. The algorithm basically proceeds as Algorithm 1 with the common variable $p_{BK}$ and $|B|$ local variables $x_{bI_{b}K}\enspace\forall\enspace b$ (and without approximation functions since objectives have Lipschitz gradients). At BS $b$ we use $r^{b}_{b^{\prime}k}$ to track $\frac{1}{|B|}\sum_{b^{\prime\prime}\in B}\frac{\partial f_{b^{\prime\prime}}}{\partial p_{b^{\prime}k}}$ and $\tilde{\pi}^{b}_{b^{\prime}k}$ to track $\sum_{b^{\prime\prime}\neq b\in B}\frac{\partial f_{b^{\prime\prime}}}{\partial p_{b^{\prime}k}}$ , which correspond to $\mathbf{y}$ and $\mathbf{\pi}$ in Algorithm 1. In each iteration, we let $\alpha[n]=\frac{\alpha_{0}}{(n+1)^{\beta}}$ , and BS $b$ performs the minimization of

[TABLE]

with respect to the variables $p^{b}_{BK}$ and $x_{bI(b)K}$ , and $\bar{p}^{b}_{BK}$ , $\bar{x}_{bI(b)K}$ , and $\tilde{\pi}^{b}_{BK}$ being the current iterate of the variables. The surrogate function $\tilde{f}_{b}(\mathbf{p}^{b},\mathbf{x}_{b};\bar{\mathbf{p}}^{b},\bar{\mathbf{x}}_{b})$ is chosen as

[TABLE]

where $f_{b}$ and $\frac{\partial f_{b}}{\partial p^{b}_{b^{\prime}k}}$ are functions of $(\bar{\mathbf{p}}^{b},\bar{\mathbf{x}}_{b})$ , but $\frac{\partial f_{b}}{\partial x_{bik}}$ is just a function of $\bar{\mathbf{p}}^{b}$ . The quadratic term in LABEL:11 is to maintain the strict convexity of the surrogate. Finally, we have a universal doubly stochastic matrix $W$ to average $p_{BK}$ and $r_{BK}$ for them to reach consensus. We remark that with the objective in Eq. 10 only having up to quadratic terms and our constraints being linear, the minimization can be solved efficiently using quadratic programming (QP) with coefficient matrices of the quadratic term being positive-semidefinite [20].

The second version of the direct method, termed as the Localized X Localized P-diRect Method (LXLP-RM) algorithm, makes better use of our localization idea as opposed LXGP-RM so that for all $b\in B$ we now only maintain the variables $p^{b}_{b^{\prime}k}$ for $b^{\prime}\in Nb(b)$ in BS $b$ , as the BSs in $Nb(b^{\prime})$ dictate the decision of $p_{b^{\prime}k}$ ; the variables $x_{BI(B)K}$ still follow the localization framework just as before in LXGP-RM. The main change from LXGP-RM is that the index set of the variable tuple $B$ is now replaced by $Nb(b)$ , as BS $b$ does not keep the variable $p^{b}_{b^{\prime}k}$ for $b^{\prime}\not\in Nb(b)$ any more. As a result, the steps regarding the weighted sum for reaching consensus need to be modified. We introduced the matrix $W(b)$ for each BS $b$ . The matrix $W(b)$ concerns with the weighting of the local dependency set $Nb(b)$ regarding the variable $p_{bK}$ , that is, $\mathcal{G}(b)=(Nb(b),\mathcal{E}(b))$ where $\mathcal{E}(b)=\{(i,j)\in\mathcal{E}:i,j\in Nb(b)\}$ if the whole network is $\mathcal{G}=(B,\mathcal{E})$ . Its $i$ -th row $W_{i:}(b)$ and $j$ -th column $W_{:j}(b)$ should be zero if and only if $i\not\in Nb(b)$ or $j\not\in Nb(b)$ . After deleting all these zero rows and columns, it would become a doubly-stochastic matrix, as described in Assumption L4.

The second version of the direct method still follows the framework of Algorithm 1, with $2|B|$ local variables $p_{bK},x_{bI_{b}K}\enspace\forall\enspace b$ and no common variable. Now at BS $b$ we use $r^{b}_{b^{\prime}k}$ to track $\tfrac{1}{|Nb(b^{\prime})|}\sum_{b^{\prime\prime}\in Nb(b^{\prime})}\tfrac{\partial f_{b^{\prime\prime}}}{\partial p_{b^{\prime}k}}$ , and $\tilde{\pi}^{b}_{b^{\prime}k}$ to track $\sum_{b^{\prime\prime}\neq b\in Nb(b^{\prime})}\tfrac{\partial f_{b^{\prime\prime}}}{\partial p_{b^{\prime}k}}$ , where $b^{\prime}\in Nb(b)$ . Also, the surrogate $\tilde{f}_{b}$ is minorly changed so that now the quadratic term of $p^{b}_{b^{\prime}k}$ only sums over $b^{\prime}\in Nb(b)$ instead of $b^{\prime}\in B$ .

We finally remark that one can also consider a GXGP-RM algorithm (basically NEXT from [24]) where copies of both the power and allocation variables are maintained at each node, or even a GXLP-RM algorithm where localization is performed only for the power variables. Note that only the fully localized scheme, i.e. LXLP-RM, will be scalable in practice. However, we will evaluate its performance relative to the other schemes.

5.3 Decomposed Method

In (P3), there is a part of the objective that is concave (or convex after taking minus sign), and the optimization of this part should be easy. The algorithm might run faster if we properly exploit this fact. To achieve this goal, let us assume that the channel gains $g_{bik}$ ’s are known to all BSs. Then we could apply the framework in Section III by letting $f_{b}=\sum_{i,k}w_{i}x_{bik}\log(\sigma^{2}+\bar{\Gamma}_{bik})$ and $G=-\sum_{b,i,k}w_{i}x_{bik}\log\big{(}\sigma^{2}+\tfrac{\Gamma_{bik}+\bar{\Gamma}_{bik}}{x_{bik}}\big{)}$ .

As $G$ is in general a function of not only $p_{bk}$ but also $x_{bik}$ for all $b\in B$ and we need to optimize $G$ at every BS, this method does not allow any localization. In other words, the tuple consisting of all decision variables is the common variable $\mathbf{x}^{c}$ in Algorithm 1 itself, and there is no local dependency set besides $\mathcal{N}$ itself. At BS $b$ we need to maintain $p^{b}_{b^{\prime}k}$ as well as $x^{b}_{b^{\prime}ik}$ $\forall\enspace b^{\prime}\in B$ . Note that with the derivatives of $f_{b}$ being Lipschitz continuous and no localization, this method is a direct application of [24].

The algorithm, which we call the Globalized X Globalized P-deComposed Method (GXGP-CM), is also largely the same as LXGP-RM, except that now we have to optimize $G$ as well, and we need to maintain and update $x^{b}_{b^{\prime}ik}$ . The algorithm and the surrogate function also need minor revisions (see Appendix B).

5.4 Partially Linearized Method

While the decomposed method enjoys the benefit of using the intrinsic convex part in the objective, it is impractical since it requires every BS knows all channel gains. We could instead put the convex part in $f_{b}$ as well, and take advantage of it by not linearizing it when forming the surrogate function. Specifically, we let $f_{b}=f_{b\cup}+f_{b\cap}$ where $f_{b\cup}=-\sum_{i,k}w_{i}x_{bik}\log\left(\sigma^{2}+\frac{\Gamma_{bik}+\bar{\Gamma}_{bik}}{x_{bik}}\right)$ and $f_{b\cap}=\sum_{i,k}w_{i}x_{bik}\log\left(\sigma^{2}+\bar{\Gamma}_{bik}\right)$ . Then we can choose the surrogate function $\tilde{f}_{b}(\mathbf{p}^{b},\mathbf{x}_{b};\bar{\mathbf{p}}^{b},\bar{\mathbf{x}}_{b})$ as $f_{b\cup}(\mathbf{p}^{b},\mathbf{x}_{b})+\tilde{f}_{b\cap}(\mathbf{p}^{b},\mathbf{x}_{b};\bar{\mathbf{p}}^{b},\bar{\mathbf{x}}_{b})$ , where $\tilde{f}_{b\cap}$ has exactly the same form as in LABEL:11 (for LXGP case), i.e., linearized with the current iterate $(\bar{\mathbf{p}}^{b},\bar{\mathbf{x}}_{b})$ plus the quadratic terms.

With $\nabla f_{b}$ not being Lipschitz continuous, we have to apply the approximation functions detailed in Section 3.2 to guarantee convergence. We may choose $f^{*}_{b,n}=f^{*}_{b\cup,n}+f_{b\cap}$ where

[TABLE]

One can easily show that $\nabla f^{*}_{b,n}$ is Lipschitz continuous with constant $L_{b,n}$ the reciprocal of $e[n]$ . We can then choose a schedule of $e[n]\rightarrow 0$ according to Theorem 3.1 and Section 4.5. We refer to this method as the Partially Linearized method (PL) algorithm, which could be LXLP or any of the other combinations. Note that we do not have the guarantee of convergence to stationary solution in this case because the objective function has unbounded gradient on the boundary.

5.5 Consensus Scheme

Let $\mathcal{G}=(B,\mathcal{E})$ be the BS network we are considering, and let $d_{i}$ be the degree of BS $i$ . The choice of $W$ must meet the following two criteria to conform to Assumption L4: (1) it must be doubly-stochastic; (2) $W_{ij}\geq 0$ is non-zero if and only if $(i,j)\in\mathcal{E}$ . We denote the set of $W$ ’s that satisfy these criteria as $\Omega$ , which is a subset in $\mathbb{R}_{+}^{|B|\times|B|}$ . We choose $W$ as follows

[TABLE]

where $\bar{d}=\max_{i}d_{i}+1$ . It is easy to verify that this choice of $W$ is row-stochastic. Since $W$ is symmetric, it is then also column-stochastic. By definition of $\bar{d}$ , we will have $W_{ii}>0$ . By construction, $W_{ij}>0$ if $(i,j)\in\mathcal{E},\enspace i\neq j$ .

In LXLP-RM algorithm we need a $W(b)$ for every BS $b$ . Let $\mathcal{E}(b)=\{(i,j)\in\mathcal{E}:i,j\in Nb(b)\}$ . Then we choose $W(b)$ as described above but treat $\mathcal{G}$ as $\mathcal{G}(b)=(Nb(b),\mathcal{E}(b))$ .

With symmetric weights $W=W^{T}$ , [23] suggests that the best convergence speed is obtained with the solution of

[TABLE]

For symmetric graphs, the optimized result is $W_{ij}=\frac{1}{d_{i}}$ when $(i,j)\in\mathcal{E}$ and $W_{ij}=0$ otherwise, which is exactly our choice.

6 Simulation Results

In Section 6.1 we compare the performance of Algorithm 1 with NEXT of the examples with unbounded gradients on either interior point or boundary point described in Section 4.6. In Section 6.2 we compare our algorithms with single-cell scheduling and resource allocation method.

6.1 Approximation Functions

Fig. 2 depicts how the local decision variables in Section 4.6 change for NEXT and Algorithm 1 within $500$ iterations. We start with initial value of $(x_{1}[0],x_{2}[0],x_{3}[0])=(-1,-0.5,-0.25)$ . We choose the following parameters: $\tau=0.1$ (independent of $i$ and $n$ ), $\alpha[n]=0.98n^{-0.7}$ , $p(n)=10n^{0.2}$ ,

[TABLE]

and the surrogate functions are chosen to be direct linearization plus the quadratic regularization term (see LABEL:11 for an example of such kind of choices). In Fig. 2 (a) we can see that NEXT oscillates and is not numerically stable for this example. Whenever the iterates go near the global minimum at $x=0$ , they jump to values far away. This is because the gradients $\nabla f_{2}$ and $\nabla f_{3}$ are infinite at the point, even though $\nabla F$ is zero there. The trackings of $(x_{1},x_{2},x_{3})$ to $\bar{x}$ and $(y_{1},y_{2},y_{3})$ to $\bar{y}$ are in the fast time-scale (see Appendix D) and actually happen quite fast, as can be seen in the figure. However, a slight mismatch between $(x_{1},x_{2},x_{3})$ and $(y_{1},y_{2},y_{3})$ is sufficient to cause very large $\{\tilde{\mathbf{\pi}}_{i}[n]+\nabla f_{i}[n]\}_{i}$ , which is supposed to be very small when $\bar{x}\approx 0$ , driving $x_{i}$ ’s to the boundary. If two of them jump to $2$ and one of them to $-1$ , then in the next few iterations they jump up; if two jump to $-1$ and one to $2$ , they jump down.

We cannot ensure that NEXT is not converging when $n\rightarrow\infty$ theoretically, but we observe it is still oscillating when $n$ is as large as $10^{4}$ . In contrast, Algorithm 1 essentially converges to the global minimum $x=0$ within $150$ iterations in Fig. 2 (b). In Fig. 2 (c), we show the case when $f^{*}_{2,n}(x)=x\ln\left(\frac{8}{x^{2}+1.1/p(n)}\right)$ , where we have everything satisfied except $\lim_{n\rightarrow\infty}\nabla F^{*}_{n}\rightarrow\nabla F$ – one can check there will be an additional $\ln(1/1.1)$ term. From the figure we see that not only it exhibits oscillating behavior, but it seems to converge to a wrong point other than $x=0$ .

The converging behaviors of NEXT and Algorithm 1 for Section 4.6 are shown in Fig. 3 for two different initializations. The parameters are $\tau=0.05$ , $\alpha[n]=0.98n^{-0.85}$ , and $p(n)=10n^{0.1}$ . The surrogate function is again direct linearization plus quadratic regularizer. The 2-D iterates for both algorithms are plotted from red to blue, with NEXT being circle dots and Algorithm 1 being square dots. In Fig. 3 (a), we start with $(0.5,0.5)$ , and both algorithms are executed for $5000$ iterations (the dots are down sampled though). We see that while our algorithm is converging to the global minimum at $\mathbf{x}=(-1,0)$ , NEXT is “converging” to some point $(-0.2499,0.9683)$ near the boundary. Due to the unbounded gradient near the boundary, the $\frac{1}{2}$ term in $\partial_{x}f(x,y)$ is completely dominated by the rest $(-x,-y)/\sqrt{1-(x^{2}+y^{2})}$ , which directs the iterate only to descend in the radial direction. Again we cannot ensure NEXT does not converge to $(-1,0)$ if we run it forever; however, it does not visit the region $x<-0.25$ after $10^{5}$ iterations. By slowly changing the objective, we are able to escape this dominance and obtain the correct solution.

In (b) we start from $\mathbf{x}=(0.6,0)$ , which is to the right of the global maximum $(\frac{1}{\sqrt{5}},0)$ , for $100$ iterations, and both algorithms are converging to $(1,0)$ . Since we start on the $x$ -axis and the gradients always direct to $(1,0)$ , without any perturbation there is no way to escape $x$ -axis for all gradient-descent-like methods. Section 4.6 falls into the case (c) in Theorem 3.1, where the algorithm is converging to some point in the boundary and $\nabla F$ is not bounded. The definition of stationary solution does not apply, and NEXT fails to converge to global minimum for both $\mathbf{x}[0]=(0.5,0.5)$ and $\mathbf{x}[0]=(0.6,0)$ , while Algorithm 1 succeeds for $\mathbf{x}[0]=(0.5,0.5)$ but also fails for $\mathbf{x}[0]=(0.6,0)$ .

6.2 The Resource Allocation Application

We adopt the framework of the network utility maximization problem as in [16, 17] where we maximize

[TABLE]

where $U_{i}(\cdot)$ is given by

[TABLE]

$R_{i,t}$ is the average throughput of user $i$ up to time $t$ , $\eta\leq 1$ is the fairness parameter, and $c_{i}$ is a QoS weight. The gradient-based scheduling approach [27] leads to solving the optimization problem given below at each time instance

[TABLE]

This is exactly the one-shot optimization problem we consider in Eq. 7, where $w_{i}=c_{i}(R_{i,t})^{\eta-1}$ is the weight of user $i$ , $r_{i,t}$ is the rate of user $i$ given by the Shannon capacity, and $\mathcal{R}(e_{t})$ is the capacity region dependent on current channel state $e_{t}$ and constrains the choice of $r_{i,t}$ as the constraints set in Eq. 7.

Now consider problem Eq. 7. A naive solution would be disregarding the interference and solving the resource allocation and scheduling for each cell separately. The optimization for a single-cell is well solved in literature, e.g. in [16]. Specifically, neglecting the interference, for a BS $b$ we can solve

[TABLE]

subject to the constraints. This is a convex problem, and can be solved with existing methods in convex optimization. We call this method the Single-Cell No-Iteration (SC-NI) algorithm.

A refinement of the SC-NI algorithm is to update the interference terms after first optimization for each cell. We then optimize for each cell again while treating the powers of neighboring BSs as constants, and then repeat until convergence. We call this the Single-Cell (SC) algorithm.

We adopt the 19 cell wrap-around model from [18] as the network scenario used in our simulations. Furthermore, each UE associates with exactly one BS and each BS has five UEs associated with it. Suppose a UE is served by a BS. Then there is a signal link between the UE and the BS, while all the neighbors of the BS cause interferences to the UE. The time horizon $T$ is chosen to be $20$ . The channel gains are directly generated by Rayleigh distribution, with parameter $1$ for associated BS-UE pair, and $0.5$ for interference, instead of choosing random locations for the UEs and calculating the path loss. The channel gains are assumed to be independent among all links in a scheduling instance and also across all scheduling instances. We use identical QoS weights ( $c_{i}=1$ ). Other parameters include: $|K|=3$ , $\sigma^{2}=0.01$ , $\alpha_{0}=0.99$ , $\beta=0.53$ , and $P_{b}=10\enspace\forall\enspace b$ . For simplicity we treat all scheduling terms $x_{bik}$ as real numbers and use the local optimal results to compute utilities. In future work we will include integer rounding procedures in the simulations. The entire process is simulated only one time, as multiple time slots already brought in the averaging effect.

Fig. 4 depicts the CDFs of user throughput of algorithms LXGP-RM, LXLP-RM, LXLP-PL, and SC for $\eta=1$ (maximum total throughput) and $\eta=0.5$ , respectively. When $\eta=0.5$ , we can observe that the RM and PL methods stochastically dominate the SC algorithm, and this is also nearly the case when $\eta=1$ . In fact, the RM and PL methods yield a roughly 4-fold average throughput gain over SC method. This is not surprising, as we simulate a rich interference environment, and the new methods can coordinate the scheduled UEs and transmission powers of nearby BSs, while the SC method does not. The RM methods show similar performance as we use the same termination criteria. They are a little bit different from the PL method possibly due to optimizing different objective functions (only equivalent before integer relaxation).

Fig. 5 illustrates the power and SINR distributions of the same four algorithms for $\eta=1$ . The proposed three new methods choose one of three values for the power: the maximum, half of the maximum or zero. This corresponds to assigning either one, two or no blocks; the two blocks can be assigned to one UE or two. On the contrary, the SC methods schedule all three blocks and UEs with the power to each block around $\tfrac{10}{3}$ . At each time instance the three new methods give up serving some subcarriers and users in exchange of boosting the SINR of the scheduled UEs on the chosen blocks. On the other hand, the SC method tries to serve everyone, and ends up with lower power and increased interference. The details of the scheduling decisions are in Table 2.

Table 3 compares the performance of the four algorithms with $\eta=0.5$ and $\eta=1$ . The three coordination-based methods outperform the SC method significantly in this problem instance. The LXGP method always requires more iterations to converge than the fully-localized methods like the LXLP family. When $\eta=0.5$ , LXLP-PL converges much faster than RM methods.

7 Conclusion

In this paper, we generalized existing distributed non-convex optimization methods in two directions. First, we reduced the algorithm storage and communication complexity by exploiting a decomposable structure of the problem, and obtained a localized algorithm. Second, we relaxed the requirement of Lipschitz continuous gradients with a series of slowly-changing approximations. We then applied the developed algorithmic framework in different ways to generate distributed algorithms for the multi-cell resource allocation problem; the flexibility of implementing our framework is also a contribution. We compared these algorithms with the single-cell algorithm via simulation and showed the potential gains of using the distributed optimization methods developed.

Appendix A Generalizations to time-varying graphs

In this appendix we provide the less strict assumptions to accommodate the case where the underlying graph is time-varying and directed. The proof of our result is based on this more general setting. But the purpose of these assumptions remains the same – a distribution on the set of nodes will go to the uniform distribution exponentially fast with repeated application of the $W$ matrix, which is captured in Lemma B.2.

At each time slot $n$ , the set of nodes $\mathcal{N}$ along with a set of time-variant directed edges $\mathcal{E}[n]$ , form an directed graph $\mathcal{G}[n]=(\mathcal{N},\mathcal{E}[n])$ . Node $j$ can only send message to node $i$ in time slot $n$ if $j\rightarrow i\in\mathcal{E}[n]$ .

**Assumption L´

(L3´)** $\mathcal{G}_{m}[n]$ is $B_{m}$ -strongly connected for all $m\in[M+1]$ , where $\mathcal{G}_{m}[n]=(\mathcal{N}_{m},\mathcal{E}_{m}[n]=\{i\rightarrow j\in\mathcal{E}[n]:i\in\mathcal{N}_{m}\text{ or }j\in\mathcal{N}_{m}\})$ , i.e. $(\mathcal{N}_{m},\bigcup_{n=kB_{m}}^{(k+1)B_{m}-1}\mathcal{E}_{m}[n])$ is strongly connected for all $k\geq 0$ ;

(L4´) For all $m\in[M+1]$ there is a matrix $\mathbf{W}^{m}[n]$ associated with $\mathcal{N}_{m}$ – each entry is non-zero if and only if there is a corresponding edge in $\mathcal{E}_{m}[n]$ , and all positive entries must be greater than or equal to some fixed $\vartheta>0$ . As before, $\mathbf{W}^{m}[n]$ is doubly-stochastic after deleting the zero rows and columns whose indices are not in $\mathcal{E}_{m}[n]$ .

Appendix B Proof of the Main Result

In this appendix we prove Theorem 3.1. We start with an intuitive description of the roadmap of the entire proof in the following. Our proof follows the main structure of [24]. The convergence of NEXT basically consists of two parts – at the faster time scale the “consensus convergence” of local variable iterates $\mathbf{x}_{i}$ and local gradient iterates $\mathbf{y}_{i}$ to the average iterate $\bar{\mathbf{x}}$ and $\overline{\nabla f}(\bar{\mathbf{x}})\triangleq\sum_{i\in\mathcal{N}}\nabla f_{i}(\bar{\mathbf{x}})$ (the average gradient evaluated at the average iterate), respectively, and on the slower time scale the fixed point iterations of $\hat{\mathbf{x}}_{i}(\bullet)$ (defined in Eq. 45) which we call “path convergence” – for details of this viewpoint and connection to two time-scale stochastic approximation see Appendix D. It is shown in [24] that the fixed points of $\hat{\mathbf{x}}_{i}(\bullet)$ coincide with the stationary solutions. The goal is thus to prove the iterates converge to the fixed points of $\hat{\mathbf{x}}_{i}(\bullet)$ and the iterates of all nodes asymptotically agree. Our contributions are introducing partial dependency structure with a localization scheme and in addition successively approximating the possibly non-Lipschitz gradient objective functions. The latter significantly increases the proof hardness as the gradients of the objective functions may now be unbounded.

The core idea of NEXT as in most primal algorithms, is in performing the following steps iteratively.

•

The local optimization step for each node $i$ is to find the value of $\hat{\mathbf{x}}_{i}$ at its iterate $\mathbf{x}_{i}$ . This $\hat{\mathbf{x}}_{i}$ maps the iterate to the optimum of an approximated strongly convex version of the overall objective function $U$ , which involves using a strongly convex surrogate for $i$ ’s own objective function, and a linearized approximation of others objective functions. This step is similar to doing gradient descent in a much more efficient way by taking advantage of the surrogates, and corresponds to the “path convergence” mentioned above.

•

The consensus step is where every node communicates its iterate to its neighbors and also takes an average of its neighbors’ iterates. We refer to the $\mathbf{x}_{i}$ ’s asymptotically agreeing on their average $\bar{\mathbf{x}}$ as the “ $x$ -consensus convergence.”

To make the algorithm more practical and fully decentralized, the original Inexact NEXT [24] and our Localized Proximal Inexact NEXT add multiple layers of approximations.

(1)

We mentioned that the node $i$ linearize other nodes’ gradients with $\pi_{i}\triangleq\sum_{j\in\mathcal{N}\setminus\{i\}}\nabla f_{j}(\mathbf{x}_{i})$ . However, node $i$ only has the information of $\nabla f_{i}(\mathbf{x}_{i})$ . It hence keeps a variable $\mathbf{y}_{i}$ that tracks the average gradient $\frac{1}{I}\sum_{j\in\mathcal{N}}\nabla f_{j}$ so that node $i$ can approximate $\pi_{i}$ by $\tilde{\mathbf{\pi}}_{i}$ with the knowledge of $\nabla f_{i}(\mathbf{x}_{i})$ and $\mathbf{y}_{i}$ as in Algorithm 1. The convergence of $\mathbf{y}_{i}$ to the average gradient is then the “ $y$ -consensus convergence.” The convergence is also achieved through a gossip-type consensus scheme similar to that used for $\mathbf{x}$ . 2. (2)

With the approximation scheme using the $y$ variable, the fixed point iteration we actually use in Algorithm 1 for the local optimization step is $\tilde{\mathbf{x}}_{i}$ defined in (4), which is similar to $\hat{\mathbf{x}}_{i}$ but using $\tilde{\mathbf{\pi}}_{i}$ as the linearization constant. This makes the algorithm viable in practice in a fully decentralized scenario. To compare the behaviors of $\hat{\mathbf{x}}_{i}(\bar{\mathbf{x}})$ and $\tilde{\mathbf{x}}_{i}=\tilde{\mathbf{x}}_{i}^{*}(\mathbf{x}_{i},\tilde{\mathbf{\pi}}_{i})$ , we construct a new “averaging system” assuming that the $\mathbf{x}_{i}$ ’s already converge to $\bar{\mathbf{x}}$ ; this includes $\mathbf{y}_{i}^{av}$ , the average gradient evaluated at $\bar{\mathbf{x}}$ , and $\tilde{\mathbf{x}}_{i}^{av}=\tilde{\mathbf{x}}^{*}(\bar{\mathbf{x}},\tilde{\mathbf{\pi}}_{i}^{av})$ , the local optimization map where $\tilde{\mathbf{\pi}}_{i}^{av}$ computed from $\mathbf{y}_{i}^{av}$ is used as the linearization constant. Note that this “averaging system” is constructed purely for analysis purposes. 3. (3)

We use a series of functions $\{\tilde{f}_{i,n}^{*}\}$ to approximate $f_{i}$ so that the local optimization map with ideal linearization is $\hat{\mathbf{x}}_{i,n}$ , and $\tilde{\mathbf{x}}_{i}^{*}$ with the $\mathbf{y}$ -approximation in contrast to just $\tilde{\mathbf{x}}_{i}$ in NEXT. This is the approximation we add in addition to what was done in Inexact NEXT [24]. 4. (4)

The inexactness of the algorithm chooses $\mathbf{x}_{i}^{inx}$ within $\epsilon_{i}$ range of $\tilde{\mathbf{x}}_{i}$ , which leads to another source of approximation. Because of the function approximation we use with the relaxed constraints on the schedules of $\{L_{i,n}\}$ and $\{\tau_{i,n}\}$ , the difference between $\mathbf{x}_{i}^{inx}$ and $\mathbf{x}_{i}$ can potentially become increasingly larger if the error propagates, which also increases the proof hardness in contrast to the case of Inexact NEXT where the difference $\|\mathbf{x}_{i}^{inx}-\tilde{\mathbf{x}}_{i}\|$ is bounded.

The proof of Theorem 3.1 consists of six parts. The theorem basically makes two claims: the nodes’ iterates asymptotically agree, and they converge to one of the optima. The former will be the side product as we aim to prove the latter. In the first part of the proof, we first summarize the list of notations that will be used in the proof in Section B.1, and then describe a few results and one key proposition in Section B.2. Proposition B.1 as the variant of Proposition 5 in [24] shows Lipschitz properties of $\hat{\mathbf{x}}_{i,n}$ . Lemma B.2 and 3 describe the main machinery we use to show the “ $x$ -consensus convergence” and “ $y$ -consensus convergence,” that is, the geometric convergence of the product of doubly stochastic matrices to the all one matrix. The results Lemma E.1, Lemma E.2, Lemma E.3, and Technical Assumption T are technical lemmas regarding series or summations that arise in the analysis. The key proposition is Proposition B.3, which constitutes the core components of the proof.

We prove Proposition B.3 (a) in the second part Section B.3, which says the difference between $\mathbf{x}_{i}^{inx}$ and $\mathbf{x}_{i}$ cannot grow beyond a certain rate, and the tools used are the definition of minimization (4), the strong convexity of $\tilde{f}_{i,n}^{*}$ , and the $y$ -consensus convergence. The difference is decomposed into $\|\mathbf{x}_{i}^{inx}-\tilde{\mathbf{x}}_{i}\|$ and $\|\tilde{\mathbf{x}}_{i}-\mathbf{x}_{i}\|$ , where the former is bounded by $\epsilon_{i}[n]$ by definition, and the latter is bounded by $O\left(\frac{L_{i,n}}{\tau_{i,n}}\right)$ using a mathematical induction argument.

Part (b) of Proposition B.3 establishes asymptotic consensus on $\mathbf{x}$ among nodes as the third part of the proof Section B.4. This part involves exploiting Lemma B.2, Lemma E.1, 3, Part (a) of Proposition B.3, and series convergence. The generalization to multiple local dependency sets is also taken care of in Part (a) and (b) of Proposition B.3 using simple inequalities regarding multi-dimensional spaces.

In the fourth part contained in Section B.5, Part (c) of Proposition B.3 is proved, which shows that the locally-optimized result using the “ $y$ approximation” $\tilde{\mathbf{x}}_{i}^{av}$ converges to the locally-optimized result with ideal linearization $\hat{\mathbf{x}}_{i,n}(\bar{\mathbf{x}})$ evaluated at $\bar{\mathbf{x}}$ . In addition to applying the definitions of the maps, it is intuitive that the “ $y$ -consensus convergence” in the “averaging system” and hence Lemma B.2 play a crucial role in the proof; Part (a) of Proposition B.3 and Technical Assumption T which comes from Lemma E.3 and the conditions of Theorem 3.1 are also used.

We prove Part (d) of Proposition B.3 in the fifth part Section B.6. Part (d) claims the actual locally-optimized result in the algorithm $\tilde{\mathbf{x}}_{i}$ converges to the locally-optimized result using the “ $y$ approximation” $\tilde{\mathbf{x}}_{i}^{av}$ . The underlying reason of the convergence is the “ $x$ -consensus convergence.” Several previous results are all used in this proof, including all of Parts (a), (b), and (c), Lemma E.1, Lemma E.3, and Technical Assumption T.

Section B.7 then combines Parts (a), (c), and (d), convexity of $G$ , and Lemma E.2 to show that $\bar{\mathbf{x}}$ converges to $\hat{\mathbf{x}}_{i,\infty}(\bar{\mathbf{x}})$ . With $\tau_{n}$ going to [math], $\hat{\mathbf{x}}_{i,\infty}$ is no longer a function but a correspondence; variational analysis is thus introduced to deal with the minimizers of correspondences in the first case of Section B.7. Finally, we deal with unbounded gradient interior point in the second case of Section B.7, using convexity and series convergence. Generalization to study the case of an unbounded gradient boundary point is left as an open question.

Comparing to the proof of NEXT, the generalization to multiple local dependency set is not a technically hard one; it requires some simple inequalities as shown in part (a) and (b) of Proposition B.3. For the second generalization, we replace what was Lipschitz constant $L$ and strongly convex constant $\tau$ by series $\{L_{n}\}$ and $\{\tau_{n}\}$ , which now could grow to infinity and decrease to zero, respectively. This does significantly increase the hardness of the proof. The conditions in Theorem 3.1, the Technical Assumption T stated below, and Lemma E.3 are made such that all the series now with $\{L_{n}\}$ and $\{\tau_{n}\}$ still converge. The unbounded gradient issue and the correspondence nature of $\hat{\mathbf{x}}_{i,\infty}$ also make our scheme much trickier to analyze in comparison to NEXT.

B.1 Notations

We define a set of notations to proceed with the proof. All notation is defined for all $i$ , $m$ , and $n$ , whenever applicable.

Original variables

•

$\mathbf{x}^{m}[n]=(\mathbb{I}\{i\in{\mathcal{N}_{m}}\}\mathbf{x}^{m}_{i}[n])_{i\in\mathcal{N}}=[\mathbb{I}\{1\in\mathcal{N}_{m}\}\mathbf{x}^{m}_{1}[n]^{T}\enspace\cdots\enspace\mathbb{I}\{I\in\mathcal{N}_{m}\}\mathbf{x}^{m}_{I}[n]^{T}]^{T}$ : the concatenation of part $m$ decision variables from all nodes in ${\mathcal{N}_{m}}$ with padded zero for nodes not in ${\mathcal{N}_{m}}$ ; we also use $\mathbf{x}^{m}[n]$ to refer to non-padded zero version $(\mathbf{x}^{m}_{i}[n])_{i\in{\mathcal{N}_{m}}}$ , i.e. the vector containing only $\mathbf{x}^{m}_{i}[n]$ when $i$ is in ${\mathcal{N}_{m}}$ , when the context is clear; the notation $(v_{i})_{i\in\mathcal{S}}$ , which denotes the vector concatenated from all the vectors of the form $v_{i}$ where the index $i$ is in the set $\mathcal{S}$ , will be used throughout this Appendix

•

$\mathbf{y}^{m}[n]=(\mathbb{I}\{i\in{\mathcal{N}_{m}}\}\mathbf{y}^{m}_{i}[n])_{i\in\mathcal{N}}$ : the concatenation of part $m$ of $\mathbf{y}$ , which tracks the average (among nodes) gradients of $\nabla_{\mathbf{x}^{m}}f_{i}$ from the nodes in ${\mathcal{N}_{m}}$ in the algorithm

•

$\mathbf{r}^{m}[n]=(\nabla_{\mathbf{x}^{m}}f^{*}_{i,n}[n])_{i\in\mathcal{N}}$ : the concatenation of ground truth gradient, with $\nabla_{\mathbf{x}^{m}}f^{*}_{i,n}[n]=\nabla_{\mathbf{x}^{m}}f^{*}_{i,n}(\mathbf{x}_{i}[n])$ ; adding $\mathbb{I}\{i\in{\mathcal{N}_{m}}\}$ is unnecessary as the gradient would be zero for those nodes not depending on $\mathbf{x}^{m}$

•

$\Delta\mathbf{r}^{m}[l,n]=(\Delta\mathbf{r}^{m}_{i}[l,n])_{i\in\mathcal{N}}$ : the gradient difference, with $\Delta\mathbf{r}^{m}_{i}[l,n]=\nabla_{\mathbf{x}^{m}}f^{*}_{i,l}[l]-\nabla_{\mathbf{x}^{m}}f^{*}_{i,n}[n]$ ( $l\leq n$ )

•

$\tilde{\mathbf{\pi}}_{i}[n]$ : see Line 11 of Algorithm 1

•

$\tilde{\mathbf{x}}_{i}[n]=\tilde{\mathbf{x}}^{*}_{i}(\mathbf{x}_{i}[n],\tilde{\mathbf{\pi}}_{i}[n])$ : see Line 5 of Algorithm 1 and Eq. 4

Average variables

•

$\bar{\mathbf{x}}^{m}[n]=\frac{1}{I_{m}}\sum_{i\in\mathcal{N}_{m}}\mathbf{x}^{m}_{i}[n]$ : average of decision variable

•

$\bar{\mathbf{y}}^{m}[n]=\frac{1}{I_{m}}\sum_{i\in\mathcal{N}_{m}}\mathbf{y}^{1}_{i}[n]$ : average of gradient tracking variable

•

$\bar{\mathbf{r}}^{m}[n]=\frac{1}{I_{m}}\sum_{i\in\mathcal{N}_{m}}\nabla_{\mathbf{x}^{m}}f^{*}_{i,n}[n]$ : average of ground truth gradient

•

$\Delta\bar{\mathbf{r}}^{m}[l,n]=\frac{1}{I_{m}}\sum_{i\in\mathcal{N}_{m}}\Delta\mathbf{r}^{m}_{i}[l,n]$ : average of gradient difference

Tracking system using average variables

•

$\nabla f^{*,av}_{i,n}[n]=\nabla f^{*}_{i,n}(\bar{\mathbf{x}}[n])$

•

$\mathbf{r}^{m,av}[n]=(\nabla_{\mathbf{x}^{m}}f^{*,av}_{i,n}[n])_{i\in\mathcal{N}}$ : the concatenation of ground truth gradient evaluated at average decision variable

•

$\Delta\mathbf{r}^{m,av}[l,n]=(\Delta\mathbf{r}^{m,av}_{i}[l,n])_{i\in\mathcal{N}}$ : the gradient difference evaluated at average decision variable, with $\Delta\mathbf{r}^{m,av}_{i}[l,n]=\nabla_{\mathbf{x}^{m}}f^{*,av}_{i,n}[l]-\nabla_{\mathbf{x}^{m}}f^{*,av}_{i,n}[n]$

•

$\mathbf{y}^{m,av}_{i}[n+1]=\sum_{j}w^{m}_{ij}[n]\mathbf{y}^{m,av}_{j}[n]+\Delta\mathbf{r}^{m,av}_{i}[n+1,n]$ : tracking of average gradient evaluated at average decision variable, with $\mathbf{y}^{m,av}_{i}[0]=\nabla_{\mathbf{x}^{m}}f^{*,av}_{i,n}[0]$ ; concatenating $\mathbf{y}^{m,av}_{i}[n+1]$ for $i\in\mathcal{N}$ makes $\mathbf{y}^{m,av}[n]=(\mathbb{I}\{i\in{\mathcal{N}_{m}}\}\mathbf{y}^{m,av}_{i}[n])_{i\in\mathcal{N}}$

•

$\tilde{\mathbf{\pi}}^{m,av}_{i}[n]=I_{m}\mathbf{y}^{m,av}_{i}[n]-\nabla_{\mathbf{x}^{m}}f^{*,av}_{i,n}[n]$

•

$\tilde{\mathbf{x}}^{av}_{i}[n]=\tilde{\mathbf{x}}^{*}_{i}(\bar{\mathbf{x}}_{i}[n],\tilde{\mathbf{\pi}}^{av}_{i}[n])$ : optimized result evaluated at average decision variable and average tracking system

•

$\bar{\mathbf{r}}^{m,av}[n]=\frac{1}{I_{m}}\sum_{i\in\mathcal{N}_{m}}\nabla_{\mathbf{x}^{m}}f^{*,av}_{i,n}[n]$ : average of ground truth gradient evaluated at average decision variable

Doubly stochastic matrices

•

$\mathbf{P}^{m}[n,l]=\mathbf{W}^{m}[n]\mathbf{W}^{m}[n-1]\cdots\mathbf{W}^{m}[l]\quad n\geq l$

•

$\hat{\mathbf{W}}^{m}[n]=\mathbf{W}^{m}[n]\otimes I_{d_{m}}$

•

$\hat{\mathbf{P}}^{m}[n,l]=\hat{\mathbf{W}}^{m}[n]\hat{\mathbf{W}}^{m}[n-1]\cdots\hat{\mathbf{W}}^{m}[l]=\mathbf{P}^{m}[n,l]\otimes I_{d_{m}}\quad n\geq l$

•

$J^{m}=\frac{1}{I_{m}}\mathbf{1}_{\mathcal{N}_{m}}\mathbf{1}_{\mathcal{N}_{m}}^{T}\otimes\mathbf{I}_{d_{m}}$ , where $\mathbf{1}_{\mathcal{N}_{m}}=\{\mathbb{I}\{i\in{\mathcal{N}_{m}}\}:i\in\mathcal{N}\}$ , and $\mathbf{I}$ is the identity matrix

•

$J^{m}_{\perp}=\mathbf{I}_{d_{m}I_{m}}-J^{m}$ , where $\mathbf{I}_{d_{m}I_{m}}$ is the identity matrix with dimension $d_{m}\times I_{m}$

B.2 Key Propositions

The next proposition is a variant of Proposition 5 in [24].

Proposition B.1.

Let $\mathbf{\pi}^{\mathcal{S}_{i}}_{i}(\tilde{\mathbf{x}})=\sum_{j\neq i}\nabla_{\mathbf{x}^{\mathcal{S}_{i}}}f^{*}_{j,n}(\tilde{\mathbf{x}})=\left(\sum_{j\in{\mathcal{N}_{m}},j\neq i}\nabla_{\mathbf{x}^{m}}f^{*}_{j,n}(\tilde{\mathbf{x}}^{\mathcal{N}_{m}})\right)_{m\in{\mathcal{S}_{i}}}$ be the concatenation of $\sum_{j\in{\mathcal{N}_{m}},j\neq i}\nabla_{\mathbf{x}^{m}}f^{*}_{j,n}(\tilde{\mathbf{x}}^{\mathcal{N}_{m}})$ for all $m$ in $\mathcal{S}_{i}$ . Define the mapping $\hat{\mathbf{x}}^{\mathcal{S}_{i}}_{i,n}(\cdot):\mathcal{K}\rightarrow\mathcal{K}_{\mathcal{S}_{i}}=\Pi_{m\in{\mathcal{S}_{i}}}\mathcal{K}_{m}$ by

[TABLE]

and the mapping $\hat{\mathbf{x}}_{i,n}(\cdot):\mathcal{K}\rightarrow\mathcal{K}$ by $\hat{\mathbf{x}}_{i,n}(\tilde{\mathbf{x}})=\left(\hat{\mathbf{x}}^{\mathcal{S}_{i}}_{i,n}(\tilde{\mathbf{x}}),\tilde{\mathbf{x}}^{\mathcal{N}\setminus{\mathcal{S}_{i}}}\right)$ , that is, preserving everything in the $\mathcal{K}_{\mathcal{N}\setminus{\mathcal{S}_{i}}}$ subspace unchanged while mapping with $\hat{\mathbf{x}}^{\mathcal{S}_{i}}_{i,n}(\cdot)$ in the ${\mathcal{S}_{i}}$ subspace. Then, under Assumptions A, F, and N, the mapping $\hat{\mathbf{x}}_{i,n}(\cdot)$ has the following properties:

(a)

$\forall\enspace\mathbf{z}\in\mathcal{K}$ * and $i\in\mathcal{N}$ ,*

[TABLE]

where $F(\mathbf{x})=\sum_{i=1}^{I}f_{i}(\mathbf{x})$ . Here we use $G(\mathbf{x})$ and $G(\mathbf{x}^{c})$ interchangeably as they are the same thing. 2. (b)

$\hat{\mathbf{x}}_{i,n}(\cdot)$ * is Lipschitz continuous, i.e. $\|\hat{\mathbf{x}}_{i,n}(\mathbf{w})-\hat{\mathbf{x}}_{i,n}(\mathbf{z})\|\leq L_{i,n}\|\mathbf{w}-\mathbf{z}\|\quad\forall\enspace\mathbf{w},\mathbf{z}\in\mathcal{K}$ for $i\in\mathcal{N}$ 444Note that this holds for $\hat{\mathbf{x}}^{\mathcal{S}_{i}}_{i,n}$ as well because the elements in the $\mathcal{K}_{\mathcal{N}\setminus{\mathcal{S}_{i}}}$ subspace just cancel each other out.. *

The only thing we do is to substitute $\tilde{f}_{i}$ in [24] as $\tilde{f}^{*}_{i,n}$ . Although the equations are written in localization form, it does not really change anything here.

Lemma B.2.

Define $\mathbf{P}[n,l]\triangleq\mathbf{W}[n]\mathbf{W}[n-1]\cdots\mathbf{W}[l]$ . Then under Assumption L´(doubly stochasticity and lower bounded entries for edges),

[TABLE]

*for some $c_{0}>0$ and $\rho\in(0,1)$ . *

Strictly speaking the above is for the $\mathbf{W}^{c}=\mathbf{W}^{M+1}$ , which is the matrix used for the averaging of the entire network, in Assumption L´. For $\mathbf{W}^{m}$ where $m\in[M]$ , the Lemma also holds after we delete the zero rows/columns. In this case the product converges to $\frac{1}{I_{m}}\mathbf{1}\mathbf{1}^{T}$ where the $\mathbf{1}$ is of the proper dimension. From now on we will take $\rho$ as the largest geometric convergence factor among all $\mathbf{W}^{m}$ , $m\in[M+1]$ .

Before proving Theorem 3.1, we will first prove the following proposition.

Proposition B.3.

Let $\{\mathbf{x}^{m}[n]\}_{n}\triangleq\{(\mathbf{x}^{m}_{i}[n])_{i\in\mathcal{N}_{m}}\}$ and $\{\bar{\mathbf{x}}^{m}[n]\}_{n}\triangleq\left\{\frac{1}{I_{m}}\sum_{i\in\mathcal{N}_{m}}\mathbf{x}^{m}_{i}[n]\right\}_{n}$ , $m\in[M+1]$ be the sequences generated by Algorithm 1, in the settings of the Theorem 3.1. Then the following holds:

(a)

For all $n$ , $\|\mathbf{x}^{m,inx}_{i}[n]-\mathbf{x}^{m}_{i}[n]\|\leq\frac{c^{m}L_{i,n}}{\tau_{i,n}}\quad\forall\enspace i\in\mathcal{N}_{m},m\in[M+1]$ . 2. (b)

$\lim_{n\rightarrow\infty}\|\mathbf{x}^{m}_{i}[n]-\bar{\mathbf{x}}^{m}[n]\|=0$ , $\sum_{n=1}^{\infty}\alpha[n]\|\mathbf{x}^{m}_{i}[n]-\bar{\mathbf{x}}^{m}[n]\|<\infty$ , $\sum_{n=1}^{\infty}\|\mathbf{x}^{m}_{i}[n]-\bar{\mathbf{x}}^{m}[n]\|^{2}<\infty\quad\forall\enspace i\in\mathcal{N}_{m},m\in[M+1]$ . 3. (c)

$\lim_{n\rightarrow\infty}\|\tilde{\mathbf{x}}^{av}_{i}[n]-\hat{\mathbf{x}}^{\mathcal{S}_{i}}_{i,n}(\bar{\mathbf{x}}[n])\|=0$ , $\sum_{n=1}^{\infty}\alpha[n]L^{\max}_{n}\|\tilde{\mathbf{x}}^{av}_{i}[n]-\hat{\mathbf{x}}^{\mathcal{S}_{i}}_{i,n}(\bar{\mathbf{x}}[n])\|$ $<\infty\quad\forall\enspace i\in\mathcal{N}$ . 4. (d)

$\lim_{n\rightarrow\infty}\|\tilde{\mathbf{x}}^{m}_{i}[n]-\tilde{\mathbf{x}}^{m,av}_{i}[n]\|=0$ , $\sum_{n=1}^{\infty}\alpha[n]L^{\max}_{n}\|\tilde{\mathbf{x}}^{m}_{i}[n]-\tilde{\mathbf{x}}^{m,av}_{i}[n]\|<\infty\quad\forall\enspace i\in\mathcal{N}_{m},m\in[M+1]$ .

We will use the following assumption frequently when proving Proposition B.3. These are the exact technical inequalities we use in the proof, while all of them are implicitly implied by the conditions of Theorem 3.1 as we will show below.

**Technical Assumption T

(T1)** $\lim_{n\rightarrow\infty}\rho^{n}\frac{L^{\max}_{n}}{\tau^{\min}_{n}}=0$ ;

(T2) $\lim_{n\rightarrow\infty}\frac{L^{\max}_{n}}{\tau^{\min}_{n}}\sum_{l=0}^{n-1}\rho^{n-l}\frac{\alpha[l](L^{\max}_{l})^{2}}{\tau^{\min}_{l}}=0$ ;

(T3) $\lim_{n\rightarrow\infty}\frac{1}{\tau^{\min}_{n}}\sum_{l=0}^{n-1}\rho^{n-l}\|\nabla f^{*}_{i,l}(\mathbf{x})-\nabla f^{*}_{i,l-1}(\mathbf{x})\|=0$ for all $\mathbf{x}$ and $i$ ;

(T4) $\sum_{n=1}^{\infty}\rho^{n}\frac{L^{\max}_{n}}{\tau^{\min}_{n}}<\infty$ ;

(T5) $\sum_{n=1}^{\infty}\frac{\alpha[n]L^{\max}_{n}}{\tau^{\min}_{n}}\sum_{l=0}^{n-1}\rho^{n-l}\frac{\alpha[l](L^{\max}_{l})^{2}}{\tau^{\min}_{l}}<\infty$ ;

(T6) $\sum_{n=1}^{\infty}\frac{\alpha[n]L^{\max}_{n}}{\tau^{\min}_{n}}\sum_{l=0}^{n-1}\rho^{n-l}\|\nabla f^{*}_{i,l}(\mathbf{x})-\nabla f^{*}_{i,l-1}(\mathbf{x})\|<\infty$ for all $\mathbf{x}$ and $i$ .

Proof B.4.

(T1)*: it should be evident from the conditions of Theorem 3.1 that none of the parameters could be growing or decaying at an exponential rate. We are considering the setting where $\alpha[n]$ and $\tau^{\min}_{n}$ are going to zero while $L^{\max}_{n}$ is going to infinity. The condition $\sum_{n=0}^{\infty}(L^{\max}_{n})^{3}\left(\frac{\alpha[n]}{\tau^{\min}_{n}}\right)^{2}<\infty$ implies that if either $L^{\max}_{n}$ is growing exponentially or $\tau^{\min}_{n}$ is decaying exponentially, then $\alpha[n]$ must also be decaying exponentially. But then $\sum_{n=0}^{\infty}\tau^{\min}_{n}\alpha[n]=\infty$ would never be possible.

(T2): recall the conditions of Theorem 3.1 imply $\lim_{n\rightarrow\infty}\alpha[n]\frac{(L^{\max}_{n})^{3}}{(\tau^{\min}_{n})^{3}}=0$ , then apply the first part of Lemma E.3.

(T3): from $\lim_{n\rightarrow\infty}\frac{\eta^{\max}_{n}}{\tau^{\min}_{n}}=0$ and the first part of Lemma E.3.

(T4): again the parameters are not growing at an exponential rate.

(T5): from $\sum_{n=0}^{\infty}(L^{\max}_{n})^{3}\left(\frac{\alpha[n]}{\tau^{\min}_{n}}\right)^{2}<\infty$ and the second part of Lemma E.3.

(T6): from $\sum_{n=0}^{\infty}\frac{\alpha[n]L^{\max}_{n}\eta^{\max}_{n}}{\tau^{\min}_{n}}<\infty$ and the second part of Lemma E.3. *

B.3 Proof of Proposition B.3 (a)

Consider a local dependency set ${\mathcal{N}_{m}}$ and any node $i\in{\mathcal{N}_{m}}$ . By the definition of $\mathbf{x}^{\mathcal{S}_{i}}_{i}$ defined in the minimization of Eq. 4, we have

[TABLE]

From the Line 11 of Algorithm 1 and (F2´), we have

[TABLE]

Substitute this result into Eq. 20 and rearrange the terms to get

[TABLE]

We have omitted all the time indices in LABEL:30 since all the variables have the same time index $[n]$ .

Suppose that $\|\mathbf{y}^{m}_{i}\|$ is bounded by $l_{m}L_{i,n}$ for all $m\in{\mathcal{S}_{i}}$ . Then LABEL:30 is of the form

[TABLE]

which implies that all $\|\mathbf{x}^{m}_{i}-\tilde{\mathbf{x}}^{m}_{i}\|$ ’s are bounded by $\frac{\sum_{m\in{\mathcal{S}_{i}}}l_{m}L_{i,n}}{\tau_{i,n}}$ 555Note that $\mathbf{x}^{c}$ refers to $\mathbf{x}^{M+1}$ and $M+1$ is in ${\mathcal{S}_{i}}$ as well if the part exists. Therefore, the second term $L_{G}\|\mathbf{x}^{c}_{i}-\tilde{\mathbf{x}}^{c}_{i}\|$ can be put into the summation in the first term.. This is due to the following argument: if $\{x_{i}\},\{l_{i}\}$ are non-negative and $\sum_{i}x^{2}_{i}\leq\sum_{i}l_{i}x_{i}$ , then $\max\{x_{i}\}\leq\sum_{i}l_{i}$ ; otherwise, W.O.L.G. we can assume $x_{1}=\max\{x_{i}\}$ and hence $\sum_{i}l_{i}<x_{1}$ , then the following holds

[TABLE]

which is a contradiction. Thus, with Eq. 23, we get

[TABLE]

where $c^{m}$ is some constant independent of $n$ and $i$ . This proves the claim. It only remains to show that $\|\mathbf{y}^{m}_{i}\|$ is actually bounded by $l_{m}L_{i,n}$ .

We use mathematical induction to finish the proof. The statement is that

[TABLE]

holds for all $n$ . We have already shown that the latter implies the former. The base case is obvious as we initialize $\mathbf{y}^{m}_{i}[0]$ to be $\nabla_{\mathbf{x}^{m}}f^{*}_{i,0}[0]$ , which is assumed to be Lipschitz continuous. For the induction step, we assume the statement is true for $n-1$ and proved the latter part $\enspace\|\mathbf{y}^{m}_{i}[n]\|\leq l_{m}L_{i,n}$ holds for $n$ .

By the definition of $\mathbf{y}$ , we have

[TABLE]

where

[TABLE]

To reach the last line we utilize the triangle inequality and the Lipschitz continuity of $\nabla f^{*}_{i,n}$ . For the first term,

[TABLE]

Using the induction hypothesis of $\Delta\mathbf{x}$ , we obtain

[TABLE]

Therefore, we finally obtain

[TABLE]

using the induction hypothesis of $\Delta\mathbf{y}$ and the fact that $\frac{\alpha[n-1]L_{i,n-1}}{\tau_{i,n-1}}$ also goes to zero when $n\rightarrow\infty$ implied by the condition of Theorem 3.1.

B.4 Proof of Proposition B.3 (b)

We only prove the case for $\mathbf{x}^{c}=\mathbf{x}^{M+1}$ to save the ubiquitous subscript of $m$ . The proof of the claims for general $\mathbf{x}^{m}$ is exactly the same with appropriate substitutions of $\mathbf{x}^{c},\hat{\mathbf{W}},\mathbf{P},\mathbf{r}^{c},J,J_{\perp},\mathbf{1}_{I},I$ by $\mathbf{x}^{m},\hat{\mathbf{W}}^{m},\mathbf{P}^{m},\mathbf{r}^{m},J^{m},J^{m}_{\perp},\mathbf{1}_{\mathcal{N}_{m}},I_{m}$ .

(i)

[TABLE]

Notice that with 3 (d) and (e), the difference of $\mathbf{x}^{c}[n]$ and $\mathbf{1}_{I}\otimes\bar{\mathbf{x}}^{c}[n]$ which is $\mathbf{x}^{c}_{\perp}[n]$ can be expressed as a linear combination of $\mathbf{x}^{c}_{\perp}[n-1]$ and $\Delta\mathbf{x}^{c,inx}[n-1]$ . We can thus expand $\mathbf{x}^{c}_{\perp}[n-1]$ iteratively as follows:

[TABLE]

where the last equation resulted from 3 (b). From Proposition B.3 (a) we know

[TABLE]

for some constants $c_{1}$ and $c_{2}$ . Consequently, we get

[TABLE]

by first utilizing triangle inequality, and then using Eq. 29, Lemma B.2, and finally Lemma E.1 (a). Remark that $\lim_{n\rightarrow\infty}\alpha[n]\left(\frac{L^{\max}_{n}}{\tau^{\min}_{n}}\right)^{3}=0$ implies $\lim_{n\rightarrow\infty}\frac{\alpha[n]L^{\max}_{n}}{\tau^{\min}_{n}}=0$ , which we use in Eq. 30 as the condition of Lemma E.1 (a). 2. (ii)

[TABLE]

The bound for the last term comes from Lemma E.1 (b). 3. (iii)

[TABLE]

The bound for the first term is natural. The double summation is bounded due to the second equality Lemma E.1 (b) with $(\lambda,\beta[k],\nu[l])$ being $(\rho,\rho^{k},\frac{\alpha[l-1]L^{\max}_{l-1}}{\tau^{\min}_{l-1}})$ . The condition of Theorem 3.1 $\sum_{n=1}^{\infty}(L^{\max}_{n})^{3}\left(\frac{\alpha[n]}{\tau^{\min}_{n}}\right)^{2}<\infty$ guarantees that $\sum_{n=1}^{\infty}\left(\frac{\alpha[n]L^{\max}_{n}}{\tau^{\min}_{n}}\right)^{2}<\infty$ . The inequality of the triple summation follows from

[TABLE]

where the last inequality is due to the first equality of Lemma E.1 (b). Again, the convergence of $\sum_{n=1}^{\infty}\left(\frac{\alpha[n]L^{\max}_{n}}{\tau^{\min}_{n}}\right)^{2}$ is implied by the convergence of $\sum_{n=1}^{\infty}(L^{\max}_{n})^{3}\left(\frac{\alpha[n]}{\tau^{\min}_{n}}\right)^{2}$ .

B.5 Proof of Proposition B.3 (c)

We exploit the optimality of $\tilde{\mathbf{x}}^{av}_{i}$ and (F1´) and (A2) to get

[TABLE]

and the optimality of $\bar{\mathbf{x}}$ (for the mapping of $\hat{\mathbf{x}}^{\mathcal{S}_{i}}_{i,n}(\bar{\mathbf{x}})$ ) leads to

[TABLE]

$\mathbf{0}^{{\mathcal{S}_{i}}\setminus\{c\}}$ is an all zero vector in the subspace $\mathcal{K}_{{\mathcal{S}_{i}}\setminus\{c\}}$ . It should be clear that $\hat{\mathbf{x}}^{c}_{i,n}(\bar{\mathbf{x}})$ refers to the component of $\hat{\mathbf{x}}_{i,n}(\bar{\mathbf{x}})$ in the subspace $\mathcal{K}_{c}$ . Then

[TABLE]

From LABEL:40,

[TABLE]

Up until now, the context is clear enough to allow us to drop all $[n]$ time index. Again, we only focus on the case of $m=M+1$ ; that is, proving $\frac{1}{\tau_{i,n}}\|\mathbf{y}^{c,av}-\mathbf{1}_{I}\otimes\bar{\mathbf{r}}^{c,av}\|$ goes to zero. We calculate

[TABLE]

and

[TABLE]

Similar to LABEL:31-2 we have

[TABLE]

Similar to the technique as in Eq. 28 to (30), by combining Eq. 34, Eq. 35, plus Lemma B.2, and then LABEL:41, we have

[TABLE]

In the last equation, we also have $\lim_{n\rightarrow\infty}\frac{\alpha[n](L^{\max}_{n})^{2}}{(\tau^{\min}_{n})^{2}}=0$ and $\lim_{n\rightarrow\infty}\frac{1}{\tau^{\min}_{n}}\|\nabla f^{*}_{i,n}(\bar{\mathbf{x}}[n])-\nabla f^{*}_{i,n-1}(\bar{\mathbf{x}}[n])\|=0$ implied by the conditions of Theorem 3.1. For the second part of the claim, we can equivalently prove

[TABLE]

This is true because

[TABLE]

All the terms are finite because of the following. The first term is due to (T4) – after multiplying a going-to-zero $\alpha[n]$ , the term remains to be bounded. The second term is due to (T5). The third term is in the condition of Theorem 3.1. The fourth term is due to (T6). The last term is also in the condition of Theorem 3.1.

B.6 Proof of Proposition B.3 (d)

Recall we have $\tilde{\mathbf{x}}_{i}[n]=\underset{\mathbf{x}_{i}\in\mathcal{K}_{\mathcal{S}_{i}}}{\arg\min}\enspace\tilde{U}_{i,n}(\mathbf{x}_{i};\mathbf{x}_{i}[n],\tilde{\mathbf{\pi}}_{i}[n])$ and $\tilde{\mathbf{x}}^{av}_{i}[n]=\underset{\mathbf{x}_{i}\in\mathcal{K}_{\mathcal{S}_{i}}}{\arg\min}\enspace\tilde{U}_{i,n}(\mathbf{x}_{i};\bar{\mathbf{x}}_{i}[n],\tilde{\mathbf{\pi}}^{av}_{i}[n])$ , where

[TABLE]

These along with (F1´) and (A2) lead to the following:

[TABLE]

and

[TABLE]

As we did in LABEL:40,

[TABLE]

Hence,

[TABLE]

Since $\|\tilde{\mathbf{x}}^{m}_{i}-\tilde{\mathbf{x}}^{m,av}_{i}\|$ is not larger than $\left[\sum_{m\in{\mathcal{S}_{i}}}\left\|\tilde{\mathbf{x}}^{m}_{i}-\tilde{\mathbf{x}}^{m,av}_{i}\right\|^{2}\right]^{1/2}$ , LABEL:45 implies the former goes to zero as $n$ goes to infinity if we can show all terms in the RHS do so. The first term does go to zero as we showed in part (b) (combining Eq. 30, Lemma E.1 (a), and the fact that $\lim_{n\rightarrow\infty}\alpha[n]\left(\frac{L^{\max}_{n}}{\tau^{\min}_{n}}\right)^{2}$ ). The following shows this property holds for the remaining two terms as well. As always we omit all time index $[n]$ from above as the context is clear enough.

We have

(40)

$\displaystyle\frac{1}{\tau_{i,n}}\|\mathbf{1}_{\mathcal{N}_{m}}\otimes(\bar{\mathbf{r}}^{m}[n]-\bar{\mathbf{r}}^{m,av}[n])\|$ $\displaystyle\leq\frac{1}{\tau_{i,n}}\sum_{i\in{\mathcal{N}_{m}}}\left\|\nabla_{\mathbf{x}^{m}}f^{*}_{i,n}(\mathbf{x}_{i}[n])-\nabla_{\mathbf{x}^{m}}f^{*}_{i,n}(\bar{\mathbf{x}}^{\mathcal{S}_{i}}[n])\right\|$

$\displaystyle\leq\sum_{i\in{\mathcal{N}_{m}}}\frac{L_{i,n}}{\tau_{i,n}}\|\mathbf{x}_{i}[n]-\bar{\mathbf{x}}^{\mathcal{S}_{i}}[n]\|\hskip 130.08621pt\text{((N1))}$

$\displaystyle\xrightarrow{n\rightarrow\infty}0\hskip 144.54pt(\text{\lx@cref{creftype~refnum}{p23} (b)}),$

and

[TABLE]

For the terms of the form $\frac{L_{n}}{\tau_{n}}\|\mathbf{x}_{\perp}[n]\|$ to converge to zero, refer to Eq. 30 and Assumption T1 and T2. In the second line, from Eq. 34, Eq. 35, and Section B.5 we know that $\|\mathbf{y}^{m,av}-\mathbf{1}_{\mathcal{N}_{m}}\otimes\bar{\mathbf{r}}^{m,av}\|$ can be represented as a sum of $\Delta\mathbf{r}^{m,av}[l,l-1]$ ’s; using the same method $\|\mathbf{y}^{m}-\mathbf{1}_{\mathcal{N}_{m}}\otimes\bar{\mathbf{r}}^{m}\|$ can also be represented as a sum of $\Delta\mathbf{r}^{m}[l,l-1]$ ’s, which we omit here. In the last inequality one can alternatively use Eq. 27 to bound $\Delta\mathbf{r}^{m}[l,l-1]$ and $\Delta\mathbf{r}^{m,av}[l,l-1]$ , which is simpler and sufficient for our purposes.

For the second part of the claim,

[TABLE]

For the first term, use Eq. 30, (T4), and (T5). Second term is finite due to (T4) with additional $\alpha[n]$ . The terms in the forth line are just like the first term. The terms in the fifth line converge by the condition of Theorem 3.1. The terms in the second line are of the type $\sum_{n}\frac{\alpha[n]L_{n}}{\tau_{n}}\sum_{l}\rho^{n-l}L_{l}\|\mathbf{x}_{\perp}[l]\|$ , from Eq. 30 and (T4) one can show that $\sum_{n}\frac{\alpha[n]L^{2}_{n}}{\tau_{n}}\|\mathbf{x}_{\perp}[n]\|$ converges, hence the convergence of the terms by applying second part of Lemma E.3. The terms in the third line converge because of (T6).

B.7 Proof of Theorem 3.1

Denote $F^{*}_{n}=\sum_{i\in\mathcal{N}}f^{*}_{i,n}$ . By descent Lemma,

[TABLE]

By the convexity of $G$ (A2),

[TABLE]

Then using Proposition B.1 (a) and the fact that $G$ has bounded subgradients,

[TABLE]

where $\diamondsuit[n]$ stands for the expression $\sum_{m}\sum_{i\in\mathcal{N}_{m}}\left\|\hat{\mathbf{x}}^{m}_{i}(\bar{\mathbf{x}}[n])-\bar{\mathbf{x}}^{m}_{i}[n]\right\|^{2}$ . Combining Section B.7, LABEL:50 and (N1) with Cauchy-Schwarz inequality as well as triangle inequality, we get

[TABLE]

From the triangle inequality, Proposition B.3 (a) and 3 (e),

[TABLE]

Substitute these expression back into (LABEL:51) and rearrange the terms to get

[TABLE]

We now exploit Lemma E.2 with $Y[n]=U^{*}_{n}(\bar{\mathbf{x}}[n])$ , $X[n]=\tau^{\min}_{n}\alpha[n]\diamondsuit[n]$ and

[TABLE]

Since $U(\bar{\mathbf{x}}[n])$ is coercive ((A3)), $Y[n]\not\rightarrow-\infty$ ; on the other hand, from Proposition B.3 (c), (d), and the assumption of the Theorem, $\sum_{n=1}^{\infty}Z[n]<\infty$ . Thus, by Lemma E.2 $\{U^{*}_{n}(\bar{\mathbf{x}}[n])\}$ converges to a finite value and $\sum_{n=1}^{\infty}\tau^{\min}_{n}\alpha[n]\diamondsuit[n]$ converges as well, which means

[TABLE]

This in turn implies

[TABLE]

At this point the localization is no longer an issue, and we will use the generalized definition of $\hat{\mathbf{x}}_{i,n}(\bar{\mathbf{x}}[n])\in\mathcal{K}$ so that we have $\lim_{n\rightarrow\infty}\left\|\hat{\mathbf{x}}_{i,n}(\bar{\mathbf{x}}[n])-\bar{\mathbf{x}}[n]\right\|=0$ for all $i\in\mathcal{N}$ .

Since $\{\bar{\mathbf{x}}[n]\}$ is bounded following from the convergence of $\{U^{*}_{n}(\bar{\mathbf{x}}[n])\}$ , there exists a limit point $\bar{\mathbf{x}}^{\infty}\in\mathcal{K}$ of the set. We assume $\bar{\mathbf{x}}[n]\rightarrow\bar{\mathbf{x}}^{\infty}$ . If this is not the case, then one can find a subsequence $\bar{\mathbf{x}}[n_{k}]$ indexed by $k$ such that $\bar{\mathbf{x}}[n_{k}]\rightarrow\bar{\mathbf{x}}^{\infty}$ as $k\rightarrow\infty$ . We consider a partition of three cases: (1) bounded gradient ( $\exists\enspace B\text{ s.t. }\|\nabla f_{i}(\mathbf{x})\|<B\enspace\forall\enspace i,\mathbf{x}$ ), (2) unbounded gradient and interior point ( $\bar{\mathbf{x}}^{\infty}\in int(\mathcal{K})$ ), and (3) unbounded gradient and boundary point ( $\bar{\mathbf{x}}^{\infty}\in bd(\mathcal{K})$ ).

(1) bounded gradient: Recall the map defined in Proposition B.1

[TABLE]

This map is converging to the following map

[TABLE]

which might be multi-valued since we do not require $\tilde{f}_{i}$ to be strongly convex. The latter map is well-defined everywhere only with bounded gradient. Otherwise $\mathbf{\pi}_{i}$ could be infinite; moreover, when $\nabla f_{i}(\mathbf{x})=\infty$ and $\mathbf{x}\in int(\mathcal{K})$ , it is not possible to achieve $\nabla\tilde{f}_{i}(\mathbf{x};\mathbf{x})=\nabla f_{i}(\mathbf{x})$ , $\tilde{f}_{i}$ being defined everywhere and being convex simultaneously. Thus, the analysis for this case does not work for the other two cases.

Now consider the two maps evaluated at $\bar{\mathbf{x}}[n]$ and $\bar{\mathbf{x}}^{\infty}$ respectively, $\hat{\mathbf{x}}_{i,n}(\bar{\mathbf{x}}[n])$ , the minimizer of $\tilde{U}_{i,n}(\bullet;\bar{\mathbf{x}}[n])\triangleq\psi_{n}$ , and $\hat{\mathbf{x}}_{i}(\bar{\mathbf{x}}^{\infty})$ , the set of minimizers of $\tilde{U}_{i}(\bullet;\bar{\mathbf{x}}^{\infty})\triangleq\psi$ . We have the following two properties.

•

$\{\psi_{n}\}$ is eventually level-bounded, i.e. $\forall\enspace\alpha\in\mathbb{R}$ , $\bigcup_{n\in N,N\in\mathcal{N}_{\infty}}lev_{\leq\alpha}\psi_{n}$ is bounded. Refer to [32], p. 8, p. 109, and p. 123 for the definitions of the notations. This is ensured by Assumption F3, i.e. either $\tilde{f}_{i}(\bullet;\mathbf{x})$ is coercive $\forall\enspace\mathbf{x},i$ or $G(\bullet)$ is coercive.

•

$\psi_{n}\overset{e}{\rightarrow}\psi$ , i.e. $\psi_{n}$ epi-converges to $\psi$ . See [32], p. 241 for the definition. This is due to $\{\tilde{U}_{i,n}\}$ and $\tilde{U}_{i}$ being continuous and $\lim_{n\rightarrow\infty}\tilde{U}_{i,n}=\tilde{U}_{i}$ , then by [32] Theorem 7.2, p. 241 we have $\psi_{n}\overset{e}{\rightarrow}\psi$ .

By [32] Theorem 7.33, p. 266, with these two properties, we then have

[TABLE]

In [24] Proposition 5(b) says that the fixed point of $\hat{\mathbf{x}}_{i}$ is also the stationary solution of the original optimization problem, which is proved in [14] Proposition 8(b). Things change slightly here as the minimizer of $\hat{\mathbf{x}}_{i}$ may not be unique. However, in the proof they did not exploit any strong convexity property. Hence, we still have $\bar{\mathbf{x}}^{\infty}$ being a stationary solution.

(2) unbounded gradient and interior point: Effectively we want to show

[TABLE]

but we can no longer argue anything with $\hat{\mathbf{x}}_{i}$ . Only for the following we will write $\bar{\mathbf{x}}_{n}$ instead of $\bar{\mathbf{x}}[n]$ for simplicity. From the optimality condition of $\hat{\mathbf{x}}_{i,n}(\bar{\mathbf{x}}_{n})$ , we have that for all $\mathbf{z}\in\mathcal{K}$ ,

[TABLE]

where $\nabla\tilde{f}^{*}_{i,n}(\bar{\mathbf{x}}_{n};\bar{\mathbf{x}}_{n})+\sum_{j\neq i}\nabla f^{*}_{i,n}(\bar{\mathbf{x}}_{n})$ is just $\nabla F^{*}_{n}(\bar{\mathbf{x}}_{n})$ . The terms in the second bracket are bounded as follows

[TABLE]

where the second inequality is due to the Lipschitz continuities of $\nabla\tilde{f}^{*}_{i,n}(\mathbf{x};\bullet)$ and $\nabla f^{*}_{i,n}(\bullet)$ . Since we assume $\sum_{n}(L^{\max}_{n})^{3}\left(\frac{\alpha[n]}{\tau^{\min}_{n}}\right)^{2}<\infty$ in the condition and get $\sum_{n}\alpha[n]\tau^{\min}_{n}\|\hat{\mathbf{x}}_{i,n}(\bar{\mathbf{x}}_{n})-\bar{\mathbf{x}}_{n}\|^{2}<\infty$ , it must be that $\|\hat{\mathbf{x}}_{i,n}(\bar{\mathbf{x}}_{n})-\bar{\mathbf{x}}_{n}\|^{2}=O\left(\alpha[n]\frac{(L^{\max}_{n})^{3}}{(\tau^{\min}_{n})^{3}}\right)$ . Hence, with the conditions of $\lim_{n\rightarrow\infty}\alpha[n]\frac{(L^{\max}_{n})^{5}}{(\tau^{\min}_{n})^{3}}=0$ and $\lim_{n\rightarrow\infty}\nabla F^{*}_{n}=\nabla F$ , taking $n\rightarrow\infty$ in Eq. 47 yields exactly Eq. 46. It is evident that $\bar{\mathbf{x}}^{\infty}$ must be a point such that $\nabla F(\bar{\mathbf{x}}^{\infty})<\infty$ , because if not so $\bar{\mathbf{x}}^{\infty}$ is an interior point and there must exist one descent direction.

(3) unbounded gradient and boundary point: We can consider two subcases.

•

$\nabla F(\bar{\mathbf{x}}^{\infty})<\infty$ : we can use the same argument in case (2) to show that $\bar{\mathbf{x}}^{\infty}$ is a stationary solution. If we have $\|\nabla f_{i}(\bar{\mathbf{x}}^{\infty})\|<B\enspace\forall\enspace i$ , we can also use the same argument in case (1) confined to a small neighborhood of $\bar{\mathbf{x}}^{\infty}$ .

•

$\nabla F(\bar{\mathbf{x}}^{\infty})=\infty$ : the definition of stationary solution fails here and we can only turn to the definition of local minimum. However, both NEXT and our algorithm can numerically converge to a point which is not a local minimum.

Appendix C Pseudo Code of the Algorithms

In this appendix we give the complete pseudo code of algorithms in Section 5 for reader’s reference. All $\bar{n}$ ’s represent $n+1$ for compression.

C.1 LXGP-RM

The LXGP-RM algorithm is given as follows:

In the algorithm, $p^{b}_{BK}$ is the collection of the variables $p^{b}_{b^{\prime}k}\enspace\forall\enspace b^{\prime}\in B,k\in K$ , and so are the other quantities $x_{bI(b)K}$ , $q^{b}_{BK}$ , etc. The collection of variables $p^{b}_{BK}$ can be viewed as a $B\times K$ matrix, or a vector of $BK$ dimensions. $\mathbf{1}_{BK}$ is a $B\times K$ matrix (or vector) consisting of all 1’s. $\circ$ denotes the element-wise product, also known as Hadamard product or Schur product.

C.2 LXLP-RM

Using the same notations as in LXGP-RM, the LXLP-RM algorithm is given as follows:

In line 10, the notation stands for $p^{b}_{b^{\prime\prime}K}[n+1]$ being updated as $\sum_{b^{\prime}\in Nb(b)}W_{bb^{\prime}}(b^{\prime\prime})q^{b^{\prime}}_{b^{\prime\prime}K}[n]$ for all $b^{\prime\prime}\in Nb(b)$ , and so do line 11 and 12. In line 12, $d_{b^{\prime\prime}}$ means the degree of $b^{\prime\prime}$ , i.e. $d_{b^{\prime\prime}}=|N(b^{\prime\prime})|$ . If considering $\tilde{\pi}^{b}_{Nb(b)K}$ to be a $Nb(b)\times K$ matrix, then $(d_{Nb(b)}+\mathbf{1}_{Nb(b)})\mathbf{1}_{K}^{T}$ consists of $Nb(b)$ rows; each row has $|K|$ elements, and every element in the $b^{\prime\prime}$ -th row is $d_{b^{\prime\prime}}+1=|Nb(b^{\prime\prime})|$ .

C.3 GXGP-CM

the GXGP-CM algorithm is given as follows:

Appendix D A Stochastic Approximation Viewpoint

In this appendix, we review NEXT from the viewpoint of stochastic approximation. We provide an alternative proof of Theorem 4 in [24] using results from stochastic approximation in Section D.1. As we will see, NEXT can be seen as a two time-scale process, with the faster $\mathbf{y}$ tracking the total gradient, the slower $\mathbf{x}$ tracking the fixed point iteration of $\hat{\mathbf{x}}(\bullet)$ , and a repeated projection onto the consensus plane. From this viewpoint and the fact that the local optimization in Eq. 4 (in NEXT version, see Equation (8), [24]) can be solved by the projected gradient descent method described in Section D.2, we can interleave each “descent” as another time-scale of the algorithm. We relate the result with another distributed non-convex optimization method proposed in [6].

D.1 Alternative Proof of NEXT

Substituting the definitions of $\mathbf{z}$ and $\tilde{\mathbf{\pi}}$ into Inexact NEXT (Algorithm 2, [24]), we can rewrite each iteration of the algorithm in two steps:

[TABLE]

where $\|\mathbf{e}_{i}[n]\|\leq\epsilon_{i}[n]\enspace\forall\enspace i$ , and $\tilde{\mathbf{x}}_{i}(\mathbf{x}_{i}[n],\mathbf{y}_{i}[n])$ is given by (8) in [24] with $\tilde{\mathbf{\pi}}_{i}$ substituted by $\mathbf{y}_{i}$ using (S.3) (c) of Algorithm 1 in [24]. By letting $\mathbf{u}_{i}[n]=\mathbf{y}_{i}[n]-\nabla f_{i}(\mathbf{x}_{i}[n])$ , Eq. 48 can be rewritten as

[TABLE]

where $\beta[n]=1$ . It is evident that $\alpha[n]=o(\beta[n])$ . As a result, Eq. 49 and Eq. 50 together form a two time-scale stochastic approximation algorithm [7], where $\mathbf{u}_{i}$ or $\mathbf{y}_{i}$ is on a faster, natural time-scale with constant step sizes, and $\mathbf{x}_{i}$ goes on a slower, algorithmic time-scale with shrinking step sizes.

To analyze this process, we first begin with the fact that the fast variable $\mathbf{u}$ or $\mathbf{y}$ views the slow variable $\mathbf{x}$ as quasi-static, i.e. we can see $\mathbf{x}$ as constant in Eq. 50. Denote $\mathbf{u}$ as the ensemble of $\mathbf{u}_{i}$ ’s, i.e. $\mathbf{u}=\begin{bmatrix}\mathbf{u}_{1}^{T}&\cdots&\mathbf{u}_{I}^{T}\end{bmatrix}^{T}$ , $\nabla f$ as the ensemble of $\nabla f_{i}$ ’s, and also $\mathbf{x}$ , $\mathbf{y}$ , etc. Then the iterate of $\mathbf{u}$ will asymptotically track the following ordinary differential equation (ODE)

[TABLE]

where $\mathbf{I}$ is the identity matrix, $d$ denotes the dimension of $\mathbf{x}_{i}$ ’s, and $\otimes$ means the Kronecker product. Then Eq. 51 would become

[TABLE]

Lemma D.1.

*We have $\lim_{t\rightarrow\infty}\mathbf{y}(t)=\overline{\nabla f}(\mathbf{x})\otimes\mathbf{1}_{I}$ . *

Proof D.2.

We have

[TABLE]

*where $\mathbf{1}$ is the all one vector and $\overline{\nabla f}=\frac{1}{I}\sum_{i=1}^{I}\nabla f_{i}$ . The second equality follows from the two matrices being multiplication commutative, third from $e^{\mathbf{A}\otimes\mathbf{I}+\mathbf{I}\otimes\mathbf{B}}=e^{\mathbf{A}}\otimes e^{\mathbf{B}}$ , and fourth from the fact that $\lim_{t\rightarrow\infty}W^{t}=\frac{1}{I}(\mathbf{1}_{I}\mathbf{1}_{I}^{T})$ . *

We see that $\mathbf{y}(t)$ indeed goes to the unique global asymptotically stable equilibrium, where every component of $\mathbf{y}$ , i.e. $\mathbf{y}_{i}$ ’s, equals to the average of the gradients as desired.

Next, from the perspective of the slow variable $\mathbf{x}$ , the fast variable $\mathbf{y}$ already reaches its equilibrium $\bar{\mathbf{y}}(\mathbf{x})$ . That is to say, in Eq. 49 $\tilde{\mathbf{x}}_{j}(\mathbf{x}_{j}[n],\mathbf{y}_{j}[n])$ can be seen as $\tilde{\mathbf{x}}_{j}(\mathbf{x}_{j}[n],\overline{\nabla f}(\mathbf{x}_{j}[n]))$ , which is exactly $\hat{\mathbf{x}}_{j}(\mathbf{x}_{j}[n])$ , making Eq. 49 become

[TABLE]

As stated in [25], this recursive relation is again a two time-scale stochastic approximation in disguise, with fast averaging and slow learning processes. In fact, the averaging process is also on natural time-scale as $\mathbf{y}$ . From [25] we know that the iterates of $\mathbf{x}$ will reach consensus $\mathbf{x}[n]=\begin{bmatrix}\mathbf{x}_{c}[n]^{T}&\cdots&\mathbf{x}_{c}[n]^{T}\end{bmatrix}^{T}$ , while each of its component $\mathbf{x}_{c}[n]\in\mathbb{R}^{d}$ tracks the ODE

[TABLE]

as $\mathbf{1}_{I}/I$ is the unique stationary distribution resulted from $W$ .

Note that with $\hat{\mathbf{x}}_{i}$ ’s being Lipschitz continuous ([24], Prop. 5a), this ODE is well-posed. We assume the differentiability of $G$ for now to avoid dealing with trickier non-differentiable Lyapunov function here. We consider the whole objective itself as the Lyapunov function $V(\mathbf{x}_{c})=U(\mathbf{x}_{c})=F(\mathbf{x}_{c})+G(\mathbf{x}_{c})$ . Then

[TABLE]

for some positive constant $c_{\tau}$ . The first inequality is established similarly as [24], Prop. 5b. By Lasalle’s invariance principle, the iterates converge to the set of equilibria $\{\mathbf{x}_{c}:\frac{1}{I}\sum_{i=1}^{I}\hat{\mathbf{x}}_{i}(\mathbf{x}_{c})=\mathbf{x}_{c}\}$ ([8], p. 57 and p. 118), which is the set of stationary solutions of the original optimization problem ([24], Prop. 2).

We note that the conditions for applying [7] and [25] are either established in [24] or implied by the assumptions of Theorem 4 in [24]. Specifically, the boundedness of $\mathbf{x}$ follows from the recursive relation Eq. 49 and Proposition 9 (a) in [24], $\sum_{n}\alpha[n]=\sum_{n}\beta[n]=\infty$ , $\sum_{n}\alpha[n]^{2}<\infty$ , and $\sup\sum_{n}\alpha[n]\mathbf{e}_{i}[n]<\infty$ are just assumed in Theorem 4 in [24]. Note that we do not need $\sum_{n}\beta[n]^{2}<\infty$ as there is no noise in the recursion of $\mathbf{y}$ . Also, we have deterministic convergence rather than almost sure convergence, since instead of being martingale differences, our noise term $\mathbf{e}_{i}[n]$ is actually deterministically bounded.

D.2 A Remark on Using One-Step Gradient Descent

Solving the local optimization in Equation (8) in [24] may be costly. Instead, from the stochastic approximation viewpoint, we can solve it iteratively as well using a time-scale faster than $\alpha[n]$ . Once again assume that $G$ is continuously differentiable to avoid working with subgradients. The optimization problem in Equation (8) in [24] belongs to the class of constrained convex optimization problems, and can be solved by projected gradient descent:

[TABLE]

where $\gamma[n]$ is some time-scale faster than $\alpha[n]$ and $P_{\mathcal{K}}$ is the Euclidean projection onto the set $\mathcal{K}$

[TABLE]

Consider using the natural time-scale $\gamma[n]=1$ . The idea of stochastic approximation is essentially blending Eq. 57 into the original algorithm [34]. Namely, at $\mathbf{x}_{i}[n]$ instead of running Eq. 57 infinitely many times to exactly solve (8) in [24], we only run one step gradient descent of Eq. 57 from $\mathbf{x}_{i}[n]$ :

[TABLE]

and then use $\tilde{\mathbf{x}}^{\prime}_{i}[n]$ instead of $\tilde{\mathbf{x}}_{i}[n]$ in (S.2) (a) of Algorithm 1 in [24]. As $\mathbf{x}_{i}[n]$ converges, so does the coupling process of Eq. 57.

From the previous subsection we know that from the perspective of the slow variable $\mathbf{x}$ , $\mathbf{y}_{j}$ can be seen as $\overline{\nabla f}(\mathbf{x}_{j})$ , and thus $\tilde{\mathbf{\pi}}_{i}[n]$ as $\sum_{j\neq i}\nabla f_{j}(\mathbf{x}_{i}[n])$ . Combining this, Eq. 58, and Algorithm 2 in [24] yield

[TABLE]

Note that Eq. 59 contains two projections. It is simple gradient descent followed by a projection to the consensus plane $\mathcal{C}:=\{\mathbf{X}=[\mathbf{x}_{1}^{T}\enspace\cdots\enspace\mathbf{x}_{I}^{T}]^{T}\in\mathbb{R}^{dI}:\mathbf{x}_{1}=\cdots=\mathbf{x}_{I}\}$ similar to Example 3 in [25], and then a further projection onto $\mathcal{K}$ . Similar to Section D.1, the iterates of $\mathbf{x}$ will reach consensus $\mathbf{x}[n]=\begin{bmatrix}\mathbf{x}_{c}[n]^{T}&\cdots&\mathbf{x}_{c}[n]^{T}\end{bmatrix}^{T}$ , while each of its component $\mathbf{x}_{c}[n]\in\mathbb{R}^{d}$ tracks the ODE

[TABLE]

called the projected gradient flow. From Proposition 5 of [6] (also [34], p. 10), $U$ works as a Lyapunov function for the set of its stationary solutions, then by Theorem 2 of [6] (also [34], Proposition 9 and Remark 10) the iterates converge almost surely to the set of stationary solutions666Again, “almost surely” is unnecessary here, since our noise is deterministically bounded.777If in general we consider $\dot{\mathbf{x}}_{c}(t)=\mathcal{P}_{\mathcal{K}}^{e}[h(\mathbf{x}_{c}(t))]$ , a Lyapunov function may not exist, and we are only guaranteed convergence to a “nonempty compact connected internally chain transitive invariant set” in $\mathcal{K}\cap\mathcal{C}$ [25]. A detailed tutorial covering the difference of these sets can be found in [5].888Coerciveness of $U$ is required to apply the LaSalle invariance principle. Without it, $U$ has to be analytic to make the set of local minima and the set of Lyapunov stable points of the gradient flow equal to each other [1]..

In Eq. 59, the matrix $\mathbf{W}$ only needs to be column stochastic instead of doubly stochastic. There exists some distribution $\{\zeta_{i}\}_{i=1}^{I}$ to which $\mathbf{W}$ converges, and the ODE will be minimizing $\sum_{i=1}^{I}\zeta_{i}U=U$ . It immediately follows that with doubly stochastic $\mathbf{W}$ (which induces $\{\frac{1}{I}\}_{i=1}^{I}$ ), we can reduce $\nabla U(\mathbf{x}_{j}[n])$ in Eq. 59 to $\nabla f_{j}(\mathbf{x}_{j}[n])+\nabla G(\mathbf{x}_{j}[n])$ and still minimizing $\sum_{i=1}^{I}\frac{1}{I}(f_{i}+G)=U$ , implying that we can simply let the gradient of each node spread through the gossip and do not have to track $\mathbf{y}$ and $\tilde{\mathbf{\pi}}$ anymore. This is exactly what is done in [6] (with $G=0$ ).

The two papers [24] and [6] have their own strengths. The method in [6] can solve the whole problem using less computation power due to the high cost of local optimization, while NEXT not only allows non-differentiable $G$ but also achieves the optimal within fewer iterations. This is useful in delay sensitive applications with abundant computing resource.

Appendix E Technical Lemmas

We put some technical lemmas used in the proof in this appendix. Some of them are from [24].

Fact 3.

For all $m$ we have the following.

(a)

$J^{m}_{\perp}\hat{\mathbf{W}}^{m}[n]=J^{m}_{\perp}\hat{\mathbf{W}}^{m}[n]J^{m}_{\perp}=\hat{\mathbf{W}}^{m}[n]-\frac{1_{m}}{I}\mathbf{1}_{\mathcal{N}_{m}}\mathbf{1}_{\mathcal{N}_{m}}^{T}\otimes I_{d_{m}}$ , where we also have

[TABLE]

We use the equality $(A\otimes B)\cdot(C\otimes D)=(A\cdot C)\otimes(B\cdot D)$ in showing the above equation. 2. (b)

$J^{m}_{\perp}\hat{\mathbf{W}}^{m}[n]J^{m}_{\perp}\hat{\mathbf{W}}^{m}[n-1]\cdots J^{m}_{\perp}\hat{\mathbf{W}}^{m}[l]=J^{m}_{\perp}\hat{\mathbf{P}}^{m}[n,l]=\left(\mathbf{P}^{m}[n,l]-\frac{1}{I_{m}}\mathbf{1}_{\mathcal{N}_{m}}\mathbf{1}_{\mathcal{N}_{m}}^{T}\right)\otimes\mathbf{I}_{d_{m}}$ . 3. (c)

$\bar{\mathbf{q}}^{m}\triangleq\frac{1}{I_{m}}\sum_{i\in{\mathcal{N}_{m}}}q_{i}=\frac{\mathbf{1}_{\mathcal{N}_{m}}^{T}\otimes\mathbf{I}_{d_{m}}}{I_{m}}\mathbf{q}$ * where $\mathbf{q}=[q_{1}^{T}\enspace\cdots\enspace q_{I}^{T}]^{T}$ and $q_{1},\dots,q_{I}$ are all arbitrary in $\mathbb{R}^{d_{m}}$ .* 4. (d)

$\mathbf{x}^{m}[n]=\hat{\mathbf{W}}^{m}[n-1]\mathbf{x}^{m}[n-1]+\alpha[n-1]\hat{\mathbf{W}}^{m}[n-1]\Delta\mathbf{x}^{m,inx}[n-1]$ * where $\Delta\mathbf{x}^{m,inx}[n]=\left(\mathbb{I}\{i\in{\mathcal{N}_{m}}\}(\mathbf{x}^{m,inx}_{i}[n]-\mathbf{x}^{m}_{i}[n])\right)_{i\in\mathcal{N}}$ . This simply follows from Lines 7 and 9 of Algorithm 1.* 5. (e)

$\bar{\mathbf{x}}^{m}[n]=\bar{\mathbf{x}}^{m}[n-1]+\frac{\alpha[n-1]}{I_{m}}\left(\mathbf{1}_{\mathcal{N}_{m}}^{T}\otimes\mathbf{I}_{d_{m}}\right)\Delta\mathbf{x}^{m,inx}[n-1]$ . This follows from applying (c) to (d).

Lemma E.1.

Let $0<\lambda<1$ , and let $\{\beta[n]\}$ and $\{\nu[n]\}$ be two positive scalar sequences. Then

(a)

If $\lim_{n\rightarrow\infty}\beta[n]=0$ , then $\lim_{n\rightarrow\infty}\sum_{l=1}^{n}\lambda^{n-l}\beta[l]=0$ . 2. (b)

If further we have $\sum_{n=1}^{\infty}\beta^{2}[n]<\infty$ and $\sum_{n=1}^{\infty}\nu^{2}[n]<\infty$ , then $\lim_{n\rightarrow\infty}\sum_{k=1}^{n}\sum_{l=1}^{k}\lambda^{k-l}\beta^{2}[l]<\infty$ and $\lim_{n\rightarrow\infty}\sum_{k=1}^{n}\sum_{l=1}^{k}\lambda^{k-l}\beta[k]\nu[l]<\infty$ .

Lemma E.2.

*Let $\{Y[n]\}$ , $\{X[n]\}$ , and $\{Z[n]\}$ be three sequences of numbers such that $X[n]\geq 0$ for all $n$ . If $Y[n+1]\leq Y[n]-X[n]+Z[n]$ for all $n$ and $\sum_{n=1}^{\infty}Z[n]<\infty$ , then either $Y[n]\rightarrow-\infty$ or $\{Y[n]\}$ converges to a finite value and $\sum_{n=1}^{\infty}X[n]<\infty$ . *

Lemma E.3.

*Let $0<\lambda<1$ , and let $\{\beta[n]\}$ and $\{\nu[n]\}$ be two positive scalar sequences such that $\beta[n]\rightarrow 0$ , $\nu[n]\rightarrow\infty$ , and $\beta[n]\nu[n]\rightarrow 0$ . If further there exist $1>\tilde{\lambda}>\lambda$ and $N$ such that $\frac{\beta[n]}{\beta[l]}\geq\tilde{\lambda}^{n-l}$ for all $n\geq l\geq N$ , then $\lim_{n\rightarrow\infty}\nu[n]\sum_{l=1}^{n}\lambda^{n-l}\beta[l]=0$ . Moreover, if $\beta[n]\nu[n]$ is summable, then so is $\nu[n]\sum_{l=1}^{n}\lambda^{n-l}\beta[l]$ . *

Proof E.4.

The proof of the first part of the claim is straightforward.

[TABLE]

*The second term goes to zero by the condition. Note that the meaning of $\frac{\beta[n]}{\beta[l]}\geq\tilde{\lambda}^{n-l}$ basically says $\beta[n]$ cannot decay to zero faster than at an exponential rate. Thus, $\beta[n]\nu[n]\rightarrow 0$ would imply that $\lambda^{n}\nu[n]\rightarrow 0$ as well. For the second part of the claim, just sum Eq. 61 over $n$ . *

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P.-A. Absil and K. Kurdyka , On the stable equilibrium points of gradient systems , Systems and Control Letters, 55 (2006), pp. 573–577.
2[2] M. Alleven , 5g, lte-a to drive long-term growth in backhaul market: Ihs . Available at http://www.fiercewireless.com/tech/5g-lte-a-to-drive-long-term-growth-backhaul-market-ihs , 2016.
3[3] H. Attouch and D. Azé , Approximation and regularization of arbitrary functions in hilbert spaces by the lasry-lions method , Annales de l’IHP Analyse non linéaire., 10 (1993), pp. 289–312.
4[4] A. Beck and L. Tetruashvili , On the convergence of block coordinate descent type methods , SIAM Journal on Optimization, 23 (2013), pp. 2037–2060.
5[5] M. Benaïm, J. Hofbauer, and S. Sorin , Stochastic approximations and differential inclusions , SIAM Journal on Control and Optimization, 44 (2005), pp. 328–348.
6[6] P. Bianchi and J. Jakubowicz , Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization , IEEE Transactions on Automatic Control, 58 (2013), pp. 391–405.
7[7] V. S. Borkar , Stochastic approximation with two time scales , Systems & Control Letters, 29 (1997), pp. 291–294.
8[8] V. S. Borkar , Stochastic approximation: A dynamical systems viewpoint , Hindustan Book Agency, and Cambridge University Press, New Delhi, India, and Cambridge, UK, 2008.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Localization and Approximations for Distributed Non-convex Optimization

Abstract

keywords:

1 Introduction

Motivating Example

2 Preliminaries

Definition 2.1**.**

3 Main Result

3.1 Localization Setting and Assumptions

3.2 Proximal Approximations Setting and Assumptions

3.3 Localized Proximal Inexact NEXT and Main Convergence Theorem

Remark**.**

Remark**.**

Theorem 3.1**.**

Proof 3.2**.**

Remark**.**

4 Discussions

4.1 Examples of Localization

4.2 Relaxing Assumption L3

4.3 Comparison to Hu et al.

4.4 An Example of Approximation Functions Satisfying Assumption N

Definition 4.1**.**

Fact 1**.**

Fact 2**.**

Proof 4.2**.**

4.5 An Example of Sequences Satisfying the Conditions Using ppp-series

4.6 Examples of Objectives with Unbounded Gradients

5 Application to Resource Allocation

5.1 Problem Formulation

5.2 Direct Method

5.3 Decomposed Method

5.4 Partially Linearized Method

5.5 Consensus Scheme

6 Simulation Results

6.1 Approximation Functions

6.2 The Resource Allocation Application

7 Conclusion

Appendix A Generalizations to time-varying graphs

Appendix B Proof of the Main Result

B.1 Notations

B.2 Key Propositions

Proposition B.1**.**

Lemma B.2**.**

Proposition B.3**.**

Proof B.4**.**

B.3 Proof of Proposition B.3 (a)

B.4 Proof of Proposition B.3 (b)

B.5 Proof of Proposition B.3 (c)

B.6 Proof of Proposition B.3 (d)

B.7 Proof of Theorem 3.1

Appendix C Pseudo Code of the Algorithms

C.1 LXGP-RM

C.2 LXLP-RM

C.3 GXGP-CM

Appendix D A Stochastic Approximation Viewpoint

D.1 Alternative Proof of NEXT

Lemma D.1**.**

Proof D.2**.**

D.2 A Remark on Using One-Step Gradient Descent

Appendix E Technical Lemmas

Fact 3**.**

Lemma E.1**.**

Lemma E.2**.**

Lemma E.3**.**

Proof E.4**.**

Definition 2.1.

Remark.

Remark.

Theorem 3.1.

Proof 3.2.

Remark.

Definition 4.1.

Fact 1.

Fact 2.

Proof 4.2.

4.5 An Example of Sequences Satisfying the Conditions Using $p$ -series

Proposition B.1.

Lemma B.2.

Proposition B.3.

Proof B.4.

Lemma D.1.

Proof D.2.

Fact 3.

Lemma E.1.

Lemma E.2.

Lemma E.3.

Proof E.4.