Asynchronous Incremental Stochastic Dual Descent Algorithm for Network   Resource Allocation

Amrit S. Bedi; and Ketan Rajawat

arXiv:1702.08290·math.OC·May 9, 2018·IEEE Trans. Signal Process.

Asynchronous Incremental Stochastic Dual Descent Algorithm for Network Resource Allocation

Amrit S. Bedi, and Ketan Rajawat

PDF

TL;DR

This paper introduces an asynchronous incremental stochastic dual descent algorithm for network resource allocation, enabling efficient, distributed optimization in heterogeneous networks with delayed gradient updates.

Contribution

It presents a novel asynchronous incremental dual descent algorithm that handles delayed stochastic gradients, suitable for energy-constrained and heterogeneous network environments.

Findings

01

Proves dual convergence with constant and diminishing step sizes.

02

Shows near-optimality of the resource allocation policy with constant step size.

03

Demonstrates effectiveness through multi-cell coordinated beamforming application.

Abstract

Stochastic network optimization problems entail finding resource allocation policies that are optimum on an average but must be designed in an online fashion. Such problems are ubiquitous in communication networks, where resources such as energy and bandwidth are divided among nodes to satisfy certain long-term objectives. This paper proposes an asynchronous incremental dual decent resource allocation algorithm that utilizes delayed stochastic {gradients} for carrying out its updates. The proposed algorithm is well-suited to heterogeneous networks as it allows the computationally-challenged or energy-starved nodes to, at times, postpone the updates. The asymptotic analysis of the proposed algorithm is carried out, establishing dual convergence under both, constant and diminishing step sizes. It is also shown that with constant step size, the proposed resource allocation policy is…

Equations257

P := max

P := max

i = 1 \sum K u^{i} (x^{i}) + E [v^{i} (h^{i}, p_{h^{i}}^{i})] ⪰ 0

x^{i} \in X^{i}, p^{i} \in P^{i}

i = 1 \sum K U (r^{i})

i = 1 \sum K U (r^{i})

r^{i}, p^{i} max i = 1 \sum K U (r^{i})

r^{i}, p^{i} max i = 1 \sum K U (r^{i})

s.t. E [i = 1 \sum K (\frac{1}{2} lo g (1 + h^{i} p_{h^{i}}^{i}))]

E [i = 1 \sum K p_{h^{i}}^{i}]

r^{i} \in [r_{m i n}, r_{m a x}], p^{i} \in P^{i}

L (λ, X, P)

L (λ, X, P)

D (λ)

D (λ)

=: i = 1 \sum K D^{i} (λ) .

= λ \in R_{+}^{d} min i = 1 \sum K D^{i} (λ) .

= λ \in R_{+}^{d} min i = 1 \sum K D^{i} (λ) .

{x_{t}^{i} (λ_{t}), p_{t}^{i} (λ_{t})}

{x_{t}^{i} (λ_{t}), p_{t}^{i} (λ_{t})}

\displaystyle{\boldsymbol{{\bm{\lambda}}}}_{t+1}=\Big{[}{\boldsymbol{{\bm{\lambda}}}}_{t}-{\epsilon}{\sum\limits_{i=1}^{K}}{\mathbf{g}}_{t}^{i}({\mathbf{p}}_{t}^{i}({\boldsymbol{{\bm{\lambda}}}}_{t}),{{\mathbf{x}}^{i}_{t}({\boldsymbol{{\bm{\lambda}}}}_{t})})\Big{]}^{+}.

\displaystyle{\boldsymbol{{\bm{\lambda}}}}_{t+1}=\Big{[}{\boldsymbol{{\bm{\lambda}}}}_{t}-{\epsilon}{\sum\limits_{i=1}^{K}}{\mathbf{g}}_{t}^{i}({\mathbf{p}}_{t}^{i}({\boldsymbol{{\bm{\lambda}}}}_{t}),{{\mathbf{x}}^{i}_{t}({\boldsymbol{{\bm{\lambda}}}}_{t})})\Big{]}^{+}.

{x_{t}^{i} (λ_{t - π_{i} (t)})

{x_{t}^{i} (λ_{t - π_{i} (t)})

\displaystyle:=\operatorname*{arg\,max}_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{x}}\in\boldsymbol{\mathcal{X}^{i}},\mathring{{\mathbf{p}}}\in\Pi_{t}^{i}}}f^{i}({\mathbf{x}})+\langle{\boldsymbol{{\bm{\lambda}}}}_{t-\pi_{i}(t)},{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{g}}_{t}^{i}(\mathring{{\mathbf{p}}},{\mathbf{x}})}\rangle

λ_{t + 1} = [λ_{t} - ϵ (i = 1 \sum K g_{t - δ_{i} (t)}^{i} (λ_{t - τ_{i} (t)}))]^{+}

λ_{t + 1} = [λ_{t} - ϵ (i = 1 \sum K g_{t - δ_{i} (t)}^{i} (λ_{t - τ_{i} (t)}))]^{+}

g_{t - δ_{i} (t)}^{i}

g_{t - δ_{i} (t)}^{i}

:= g_{t - δ_{i} (t)}^{i} (p_{t - δ_{i} (t)}^{i} (λ_{t - τ_{i} (t)}), x_{t - δ_{i} (t)}^{i} (λ_{t - τ_{i} (t)}))

(x_{t}^{i}

(x_{t}^{i}

\displaystyle:=\operatorname*{arg\,max}_{{{\mathbf{x}}\in\mathcal{X}^{i},\ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathring{{\mathbf{p}}}\in\Pi_{t}^{i}}}}f^{i}({{\mathbf{x}}})+\langle{\boldsymbol{{\bm{\lambda}}}}^{i-1}_{t-\pi_{i}(t)},{\mathbf{g}}_{t}^{i}({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathring{{\mathbf{p}}},{\mathbf{x}}})\rangle.

λ_{t}^{i} =

λ_{t}^{i} =

g_{t - δ_{i} (t)}^{i}

g_{t - δ_{i} (t)}^{i}

:=

= λ \in Λ min i = 1 \sum K D^{i} (λ) \vspace 0 mm

= λ \in Λ min i = 1 \sum K D^{i} (λ) \vspace 0 mm

E [g_{t}^{i} (λ)^{2}] \leq V_{i}^{2} .

E [g_{t}^{i} (λ)^{2}] \leq V_{i}^{2} .

λ_{t}^{i}

λ_{t}^{i}

t \to \infty liminf E [D (λ_{t})] \leq D + O (ϵ) .

t \to \infty liminf E [D (λ_{t})] \leq D + O (ϵ) .

t \to \infty lim inf E [D (λ_{t})] = D .

t \to \infty lim inf E [D (λ_{t})] = D .

e_{t, δ_{i} (t)}^{i} :=

e_{t, δ_{i} (t)}^{i} :=

\nabla D^{i} (λ) - \nabla D^{i} (λ^{'}) \leq L_{i} λ - λ^{'} .

\nabla D^{i} (λ) - \nabla D^{i} (λ^{'}) \leq L_{i} λ - λ^{'} .

λ_{t}^{i}

λ_{t}^{i}

t \to \infty lim inf [i = 1 \sum K E [D^{i} (λ_{t})]] = D .

t \to \infty lim inf [i = 1 \sum K E [D^{i} (λ_{t})]] = D .

1 \leq t \leq T min i = 1 \sum K

1 \leq t \leq T min i = 1 \sum K

t = 1 \sum T i = 1 \sum K 2 ϵ_{t} E [D^{i} (λ_{t})] - D

t = 1 \sum T i = 1 \sum K 2 ϵ_{t} E [D^{i} (λ_{t})] - D

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Asynchronous Incremental Stochastic Dual Descent Algorithm for Network Resource Allocation

Amrit Singh Bedi, Student Member, IEEE and Ketan Rajawat, Member, IEEE The authors are with the Department of Electrical Engineering, IIT Kanpur, Kanpur (UP), India 208016 (email: amritbd, [email protected]).

Abstract

Stochastic network optimization problems entail finding resource allocation policies that are optimum on an average but must be designed in an online fashion. Such problems are ubiquitous in communication networks, where resources such as energy and bandwidth are divided among nodes to satisfy certain long-term objectives. This paper proposes an asynchronous incremental dual decent resource allocation algorithm that utilizes delayed stochastic gradients for carrying out its updates. The proposed algorithm is well-suited to heterogeneous networks as it allows the computationally-challenged or energy-starved nodes to, at times, postpone the updates. The asymptotic analysis of the proposed algorithm is carried out, establishing dual convergence under both, constant and diminishing step sizes. It is also shown that with constant step size, the proposed resource allocation policy is asymptotically near-optimal. An application involving multi-cell coordinated beamforming is detailed, demonstrating the usefulness of the proposed algorithm.

Index Terms:

Stochastic subgradient, resource allocation, asynchronous algorithm, incremental algorithm.

I Introduction

The recent years have witnessed an unprecedented growth in the complexity and bandwidth requirements of network services. The resulting stress on the network infrastructure has motivated the network designers to move away from simpler or modular architectures and towards optimum ones. To make sure that resources such as bandwidth and energy are allocated efficiently, optimum designs advocate cooperation between the network nodes [1, 2]. This paper considers the problem of cooperative network resource allocation that arises in wireless communication networks [3], smart grid systems [4], and in the context of scheduling [5]. Of particular interest is the stochastic resource allocation problem, where the goal is to find an allocation policy that is asymptotically optimal [6, 7]. Although such problems are infinite dimensional in nature, they can be solved in an online fashion via stochastic dual descent methods, allowing real-time resource allocation that is also asymptotically near-optimal [8, 9, 10, 11, 12].

Heterogeneous networks are common to a number of applications where the energy availability, computational capability, or the mode of operation of the nodes is not the same across the network. Key requirements for heterogeneous network protocols include scalability, robustness, and tolerance to delays and packet losses. Towards this end, a number of distributed algorithms have been proposed in the literature [13, 14, 15, 16, 17, 18, 19, 20]. By eliminating the need for a fusion center, the distributed algorithms operate with reduced communication overhead, and render the network resilient to single-point failures.

Most distributed algorithms still place stringent communication and computational requirements on the network nodes. For instance, the dual stochastic gradient methods entail multiple updates and message exchanges per time slot, and cannot handle missed or delayed updates. In heterogeneous networks, such delays are often unavoidable, arising due to poor channel conditions, traffic congestion, or limited processing power at certain nodes. This paper proposes a distributed asynchronous stochastic resource allocation algorithm that tolerates such delays. The next subsection outlines the main contributions of this paper.

I-A Contributions and organization

The stochastic resource allocation problem is formulated as a constrained optimization problem where the goal is to maximize a network-wide utility function. The allocated resources at the different nodes in the network are coupled through constraint functions that involve expectations with respect to a random network state. Specifically, the aim is to find an allocation policy that satisfies the constraints on an average. The distribution of the state variables is not known, so that the optimization problem does not admit an offline solution. Instead, the idea is to observe the instances of the state variables over time, and allocate resources in an online manner. It is well-known that stochastic dual descent algorithms yield viable online algorithms for such problems [9].

Within the heterogeneous network setting considered here, the focus is on distributed algorithms that can tolerate communication and processing delays [21, 15, 22]. Different from the state-of-the-art algorithms that utilize the standard stochastic gradient methods [8, 22, 23], we develop two variants of the asynchronous dual descent algorithm that allow some of the nodes in the network to temporarily “fall back,” in the event of low energy availability, unusually large processing delay, node shutdown, or channel impairments. The first asynchronous variant utilizes a fusion center to collect the possibly delayed gradients from various nodes and carry out the updates (cf. III-A ). The second variant eliminates the need for the fusion center, and instead utilizes the fully distributed and incremental stochastic gradient descent algorithm, where the nodes carry out updates in a round-robin fashion and pass messages along a cycle (cf. III-B ). As earlier, the use of stale gradients for primal and dual updates, allows the algorithm to be run on two different clocks, one corresponding to the local resource allocation and tuned to the changing random network state, while the other dictated by the message passing protocol. The key feature of the proposed algorithm is the possibility for the second clock to slow down temporarily and wait for slower nodes to catch up. The proposed algorithm thus allows timely resource allocation, while tolerating occasional delays in message passing.

The asymptotic performance of the proposed algorithm is studied under certain regularity conditions on the problem structure and bounded delays. In particular, the asymptotic performance of the asynchronous incremental stochastic subgradient descent (AIS-SD) algorithm is characterized under both, diminishing and constant step-sizes. The overall structure of the proof is based on the convergence results in the incremental stochastic subgradient descent algorithm of [13] and the asynchronous incremental subgradient method of [14]. Specific to the resource allocation problem at hand, the asymptotic near-optimality and almost sure feasibility of the primal allocation policy is established for the case of constant step sizes. When applied to resource allocation problems, the proposed algorithm is called asynchronous incremental stochastic dual descent (AIS-DD). It is remarked that since the proposed algorithms utilize stochastic subgradient descent, their computational complexity is also comparable to other distributed stochastic algorithms [13, 22, 24, 15, 16, 17, 18, 19, 20, 25]. The calculation of the subgradient is the most computationally expensive step, and like other first-order algorithms, must be carried out at every time slot.

Finally, the stochastic coordinated multi-cell beamforming problem is formulated and solved via the proposed algorithm. Detailed simulations are carried out to demonstrate the usefulness of the proposed algorithm in delay-prone and distributed environments. Summarizing, the main contributions of the paper include (a) the AIS-SD algorithm and its convergence (b) primal near-optimality and feasibility results for the allocated resources using AIS-DD; and (c) demonstration of the proposed algorithm on a practical stochastic coordinated multi-cell beamforming problem.

The rest of the paper is organized as follows. Sec. I-B provides an outline of the related literature. Sec. II describes the problem formulation and recapitulates the known results. Sec. III details the proposed algorithm. Sec. IV lists the required assumptions, and provides the primal and dual convergence results. Sec. V formulates the stochastic version of the coordinated beamforming problem along with the relevant simulation results. Finally, Sec. VI concludes the paper.

I-B Related work

Resource allocation problems have been well-studied in the context of cross-layer optimization in networks [26]. Popular tools for solving stochastic resource allocation problems include the backpressure algorithm [3] and variants of the stochastic dual descent method [23, 9]. However, most of these works only consider synchronous algorithms, and the effect of communication delays has not been examined in detail. An exception is the asynchronous subgradient method proposed in [21], where delayed subgradients were utilized for resource allocation. The present work extends the algorithm in [21] by allowing delayed stochastic subgradients. Additionally, the proposed algorithm is also incremental, and is therefore applicable to a wider variety of problems.

Depending on the mode of communication among the nodes, distributed algorithms can be broadly classified into three categories, namely, diffusion, consensus, and incremental [27]. Of these, the incremental update rule generally incurs the least amount of message passing overhead [28], and is of interest in the present context. The incremental subgradient descent and its variants have been widely applied to large-scale problems, and generally exhibit faster convergence than the traditional steepest descent algorithm and its variants [29].

The stochastic gradient and subgradient algorithms are well-known within the machine learning and signal processing communities [30, 31, 22]. The incremental stochastic subgradient method, with cyclic, random, and Markov incremental variants, was first proposed in [13]. The asymptotic analysis of dual problem in the present work follows the same general outline as that of the cyclic incremental algorithm in [13], with additional modifications introduced to handle asynchrony. It is emphasized that these modifications are not straightforward, since the delayed stochastic subgradient is not generally a descent direction on an average. The present work also allows delays in both, primal and dual update steps, and establishes asymptotic near-optimality and feasibility of the primal allocation policies. Finally, saddle point algorithms have recently been applied to unconstrained [32] or proximity-constrained [33] network optimization problems, but do not readily generalize to the general form constrained optimization problem considered here.

Asynchronous algorithms have also been considered within the Markov decision process framework [34], though the setup there is quite different and does not apply to the problem at hand. On the other hand, asynchronous first order methods have attracted a significant interest from the machine learning community [22, 24]. For problems where the exact subgradient is available at each node, the asynchronous alternating directions method of multiplier (ADMM) has been well-studied [15, 16, 17]. The present work considers stochastic algorithms, and thus differs considerably in terms of both analysis and the final results. Even among algorithms utilizing stochastic subgradients, the definition of asynchrony varies across different works. One way to model asynchrony is to allow each node to carry out its update according to a local Poisson clock. This approach is followed in [18, 19, 20], all of which consider various consensus-based distributed subgradient algorithms. The asynchronous adaptive algorithms in [35] also subscribe to the same philosophy, with decoupled node-updates due to communication errors, changing topology, and node failures. The incremental algorithm considered here is very different in terms of operation and analysis.

On the other hand, asynchronous operation can be modeled via delayed gradients or subgradients utilized for the updates. A consensus-based stochastic algorithm proposed in [25], and utilizes randomly delayed stochastic gradients. Along similar lines, asynchronous saddle point algorithms for network problems with edge-based constraints have recently been proposed [36, 37]. Finally, for the unconstrained variants of the problem, a non-parametric approach has been proposed in [38]. Different from these works, the network resource allocation framework considered here allows generic convex constraints. Further, the incremental algorithm developed here handles stale subgradients while incurring significantly lower communication overheads. Asynchronous variants of the classical or averaged stochastic gradient methods have been proposed in [39, 22, 40, 41]. The generic problem of interest here is that of the minimization of a sum of private functions at various nodes. Further, a network with star topology is considered, with updates being carried out using delayed gradients collected at the fusion center. Different from these works, the proposed algorithm is incremental, does not require a fusion center, and is therefore more relevant to the network resource allocation problem at hand. Unlike these works, the present work also avoids making any assumptions on the compactness of the domain of the dual optimization problem. Before concluding, it is remarked that this work develops convergence results that hold on an average. Stronger results, where convergence is established in an almost sure sense, require a more involved analysis, and are not pursued here.

The notation used in this paper is as follows. Scalars are represented by small letters, vectors by small boldface letters, and constants by capital letters. The index $t$ is used for the time or iteration index. The inner product between vectors $\boldsymbol{a}$ and $\boldsymbol{b}$ is denoted by $\langle\boldsymbol{a},\boldsymbol{b}\rangle$ . For a vector ${\mathbf{x}}$ , projection onto the non-negative orthant is denoted by $[{\mathbf{x}}]^{+}$ . The expectation operation is denoted by ${\mathbb{E}}$ . The notation $\nabla$ is used for gradient and $\partial$ is used for the subgradient. The Euclidean norm is denoted by $\left\|\cdot\right\|$ .

II Problem formulation

II-A Problem statement

This section details the stochastic resource allocation problem at hand for a network with $K$ nodes. The stochastic component of the problem is captured through the random network state, comprising of the random vectors ${\mathbf{h}}^{i}\in{\mathbb{R}}^{q}$ for each node $i\in\{1,\ldots,K\}$ , with unknown distributions. The overall problem is formulated as follows.

[TABLE]

where the expectation in (1b) is with respect to the random vector ${\mathbf{h}}^{i}$ and $\mathsf{P}$ is finite. The optimization variables in (1) include the resource allocation variables $\{{\mathbf{x}}^{i}\!\!\in\!\!\mathbb{R}^{n}\}_{i=1}^{K}$ and the policy functions $\{{\mathbf{p}}^{i}\!\!:\!\!{\mathbb{R}}^{q}\rightarrow{\mathbb{R}}^{p}\}_{i=1}^{K}$ , under the constraints (1b)-(1c). Note that, the constraints in (1b) are required to be satisfied on an average, whereas those in (1c) are needed to be satisfied instantaneously. The functions $f^{i}:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}$ are assumed to be concave, and the sets ${\mathcal{X}}^{i}\subseteq{\mathbb{R}}^{n}$ , convex and compact. The constraint function at node $i$ is vector-valued, and is given by ${\mathbf{u}}^{i}({\mathbf{x}}^{i}):=[u^{i}_{1}({\mathbf{x}}^{i})\cdots u^{i}_{d}({\mathbf{x}}^{i})]^{T}$ , where $\{u_{k}^{i}({\mathbf{x}}^{i}):{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}\}_{k=1}^{d}$ are concave functions. On the other hand, no such restriction is imposed upon the vector-valued function ${\mathbf{v}}^{i}:{\mathbb{R}}^{p}\times{\mathbb{R}}^{q}\rightarrow{\mathbb{R}}^{d}$ and the compact set of functions $\{\mathcal{P}^{i}\}_{i=1}^{K}$ . Of course, the overall problem still needs to adhere to certain regularity conditions (see Sec. IV), such as the Slater’s constraint qualification and Lipschitz continuity of the gradient function; see (A1)-(A7).

Since the distribution of ${\mathbf{h}}^{i}$ is also not known in advance, it is generally not possible to solve for P in an offline manner. Therefore, an online algorithm is sought to solve problem ‘on the fly’ as the independent identically distributed (i.i.d.) random variables $\{{\mathbf{h}}_{t}^{i}\}_{t\in{\mathbb{N}}}$ are realized and observed. For brevity, we denote ${\mathbf{p}}_{t}^{i}:={\mathbf{p}}_{{\mathbf{h}}^{i}_{t}}^{i}$ and ${\mathbf{g}}_{t}^{i}({\mathbf{p}}_{t}^{i},{\mathbf{x}}^{i}):={\mathbf{u}}^{i}({\mathbf{x}}^{i})+{\mathbf{v}}^{i}({\mathbf{h}}_{t}^{i},{\mathbf{p}}^{i}_{{\mathbf{h}}^{i}_{t}})$ . Therefore, it is possible to write (1b) equivalently as ${\mathbb{E}}\left[{\sum\limits_{i=1}^{K}}{\mathbf{g}}_{t}^{i}({\mathbf{p}}_{t}^{i},{\mathbf{x}}^{i})\right]\succeq 0$ . The algorithm outputs a sequence of vector pairs $\{{\mathbf{x}}_{t}^{i},{{\mathbf{p}}}_{t}^{i}\}_{t}$ , that are used for allocating resources in a timely manner. Towards this end, the stochastic dual descent algorithm has been proposed in [9], which yields allocations that are almost surely near-optimal and provably convergent.

In the present paper, the focus is on networked systems where both, allocations $({\mathbf{x}}^{i},{\mathbf{p}}^{i})$ and the functions $f^{i}$ and ${\mathbf{g}}^{i}_{t}$ are private to each node $i$ . Likewise, the random variable ${\mathbf{h}}_{t}^{i}$ is also observed and estimated locally at each node $i$ . In other words, while the nodes can exchange dual variables and numerical values of the gradients, they may not be willing to reveal the full functional form of the objective or constraint functions and other locally estimated quantities, owing to privacy and security concerns. Such privacy-preserving cooperation is common for many secure multi-agent systems [17, 42, 43]. To this end, the nodes may be arranged in a star topology, and utilize a centralized controller for collecting and distributing various algorithm iterates. Alternatively, ring topology may be used, allowing a fully distributed implementation, where the exchanges occur only between two immediate neighbors. In order to clarify the problem formulation considered in (1), the following simple example is considered.

Example 1. Consider the problem of network utility maximization over a wireless network consisting of $K$ nodes. The aim is to maximize the network-wide utility given by

[TABLE]

where $U(\cdot)$ is a concave function that quantifies the utility obtained by the node $i$ upon achieving a rate $r^{i}\in[r_{\min},r_{\max}]$ . The channel is assumed to be time-varying, and for each channel realization $h^{i}$ , node $i$ allocates the power $p^{i}_{h^{i}}$ , achieving the instantaneous rate of $\log(1+h^{i}p^{i}_{h^{i}})$ , where the noise power is assumed to be one. The goal is to maximize the utility in (2) subject to constraints on the average rate and the average power consumption, and the full problem can be written as (cf. (1)):

[TABLE]

It is remarked that $\mathcal{P}^{i}$ is a set of functions $p^{i}:{\mathbb{R}}\rightarrow{\mathbb{R}}$ , while $p^{i}_{h^{i}}$ is a random variable that depends on $h^{i}$ . That is, the optimization variables in (3) include the rates $r^{i}$ and the power allocation functions $p^{i}$ .

II-B Existing approaches and challenges

We begin with explicating the desirable features of an algorithm that seeks to solve (1). Specifically, it is required that any such algorithm meets the following requirements.

F1.

The algorithm should allow nodes to “fall behind” temporarily, e.g., under poor channel conditions and intermittent transmission failures. 2. F2.

The algorithm should allow a distributed implementation, that is, without requiring a star-topology or an FC.

These features are particularly important for large and heterogeneous networks where delays may be unavoidable and designating an FC may be impractical. Put differently, (F1) requires the algorithm to handle the inevitable delays that may occur due to temporarily poor channel conditions or noise. Complementarily, (F2) is an architectural requirement that must be kept in mind when choosing or designing the algorithm.

Since the number of constraints in (1b) are finite, the problem is more tractable in the dual domain. To this end, introducing a dual variable ${\bm{\lambda}}\in\mathbb{R}^{d}_{+}$ corresponding to constraint in (1b), the stochastic (sub-)gradient descent method was proposed for solving such problems in [9]. The Lagrangian of (1) is given by

[TABLE]

where ${\mathbf{X}}$ and ${\mathbf{P}}$ collect the primal optimization variables $\{{\mathbf{x}}^{i}\}_{i=1}^{K}$ and $\{{\mathbf{p}}^{i}\}_{i=1}^{K}$ respectively. Next, the dual function is obtained by maximizing $L$ with respect to ${\mathbf{X}}$ and ${\mathbf{P}}$ . Since the Lagrangian is expressed as a sum of $K$ terms, each depending on a different set of variables, the maximization operation is separable and the dual function takes the following form:

[TABLE]

The dual problem is given by

[TABLE]

While for general problems, it only holds that $\mathsf{D}\geq\mathsf{P}$ , that the stochastic resource allocation problem considered here has a zero duality gap, i.e., $\mathsf{P}=\mathsf{D}$ [8, Thm. 1]. The result utilizes the Lyapunov’s convexity theorem and holds under strict feasibility (Slater’s condition), bounded subgradients, and continuous cumulative distribution function of ${\mathbf{h}}^{i}$ for each $i$ . It is remarked that similar results are well-known in economics [44], wireless communications [12, 8], and control theory [45].

The result on zero duality gap legitimizes the dual descent approach, since the dual problem is always convex, and the resultant dual solution can be used for primal recovery. To this end, similar problems in various contexts have been solved via the classical dual descent algorithm [9, 46, 21, 8, 47], wherein the primal updates utilize various sampling techniques. It is remarked however that from a practical perspective, solving the dual problem alone is not sufficient, since online allocation of power or rate variables necessitates determining the primal optimum variables $\{{{\mathbf{x}}^{i}}^{\star},{{\mathbf{p}}^{i}}^{\star}\}$ . In the present case, since ${\mathbf{p}}^{i}$ is infinite dimensional, primal recovery and consequently, online resource allocation, is not straightforward.

Since the distribution of ${\mathbf{h}}^{i}$ is not known in advance, solving (6) via classical first or second order descent methods requires a costly Monte Carlo sampling step [23]. Instead, the use of stochastic subgradient descent has been proposed in [9, 48], which takes the following form for $t\geq 1$ ,

D1.

Primal updates: At time $t$ , node $i$ observes or estimates ${{\mathbf{h}}_{t}^{i}}$ , and allocates the resources in accordance with:

[TABLE] 2. D2.

Dual update: The dual updates at time $t$ take the form:

[TABLE]

Here, $\Pi_{t}^{i}:=\{{\mathbf{p}}^{i}_{{\mathbf{h}}^{i}_{t}}\in{\mathbb{R}}^{p}|{\mathbf{p}}^{i}\in{\mathcal{P}}^{i}\}$ is the set of all legitimate values of the vector ${\mathbf{p}}^{i}_{{\mathbf{h}}^{i}_{t}}$ . The term ${\mathbf{g}}_{t}^{i}({\mathbf{p}}_{t}^{i}({\boldsymbol{{\bm{\lambda}}}}_{t}),{\mathbf{x}}^{i}_{t}({\boldsymbol{{\bm{\lambda}}}}_{t}))$ is a stochastic gradient of the dual function $D^{i}({\bm{\lambda}})$ at ${\boldsymbol{{\bm{\lambda}}}}={\boldsymbol{{\bm{\lambda}}}}_{t}$ . Further for notational brevity, ${\mathbf{g}}_{t}^{i}({\boldsymbol{{\bm{\lambda}}}}):={\mathbf{g}}_{t}^{i}({\mathbf{p}}_{t}^{i}({\boldsymbol{{\bm{\lambda}}}}),{\mathbf{x}}^{i}_{t}({\boldsymbol{{\bm{\lambda}}}}))$ is used throughout the paper. Recall that for a given ${\boldsymbol{{\bm{\lambda}}}}$ , ${\mathbf{g}}_{t}^{i}({\boldsymbol{{\bm{\lambda}}}})$ is stochastic and depends on the random variable ${\mathbf{h}}^{i}_{t}$ , as discussed in Sec. II-A. The algorithm is initialized with an arbitrary ${\boldsymbol{{\bm{\lambda}}}}_{1}$ and the resulting allocations are asymptotically near optimal and feasible. A constant step-size stochastic gradient descent algorithm is utilized in the dual domain, which not only allows recovery of optimal primal variables via averaging, but also bestows it the ability to handle small changes in the network topology or other problem parameters. The algorithm can be implemented in a distributed fashion in a network with star-topology, with the help of a fusion center (FC). Within the FC-based implementation, the primal iterates are calculated and used locally at each node $i$ . At the end of each time slot, the node $i$ communicates the gradient component ${{\mathbf{g}}^{i}_{t}({\bm{\lambda}}_{t})}$ to the FC, which carries out the dual update (8) and broadcasts ${\boldsymbol{{\bm{\lambda}}}}_{t+1}$ to all the nodes in the network. Summarizing, the stochastic algorithm is preferred over its deterministic counterpart since it does not require Monte Carlo iterations, yields asymptotically near-optimal resource allocations, and is provably convergent if the stochastic process $\{{\mathbf{h}}^{i}_{t}\}$ is stationary.

It is remarked that since (1) is infinite dimensional, full primal recovery is generally not possible using such dual methods. Existing algorithms only allow partial primal recovery, as will also be possible via Theorem 2. Specifically, it is well-known that while the running average $\frac{1}{T}\sum_{t=1}^{T}{\mathbf{x}}^{i}_{t}$ can be viewed as the approximate version of the primal optimum ${{\mathbf{x}}^{i}}^{\star}$ , no such interpretation exists for the infinite-dimensional variable ${\mathbf{p}}^{i}$ . For instance, the running average of ${\mathbf{p}}^{i}_{t}$ cannot be meaningfully related to the corresponding optimum ${{\mathbf{p}}^{i}}^{\star}$ [9, 48]. Nevertheless, the resource allocation carried out using the primal iterates $\{{\mathbf{x}}^{i}_{t}({\boldsymbol{{\bm{\lambda}}}}_{t}),{\mathbf{p}}^{i}_{t}({\boldsymbol{{\bm{\lambda}}}}_{t})\}$ still ensures near-optimality and asymptotic feasibility (cf. Theorem 2).

In view of the desiderata (F1)-(F2), observe that a network implementation of (7)-(8) is still impractical since it is synchronous and FC-based, and thus has relatively stringent communication requirements. In particular, the algorithm necessitates that each node exchanges messages (i.e. ${\mathbf{g}}^{i}_{t}({\bm{\lambda}}_{t})$ & ${\boldsymbol{{\bm{\lambda}}}}_{t}$ ) with the FC at every time-slot, thereby incurring a large communications cost. Since the updates (7)-(8) must occur before the network state changes, the nodes must synchronize and cooperate in order to meet these deadline constraints, ultimately increasing message passing overhead and consuming more energy. Further, nodes in large networks are often heterogeneous, and may not always be able to transmit the gradients within the stipulated time. Finally, if the nodes are not deployed in a star-topology around the FC, the need for multi-hop communications further increases the delays, results in heterogeneous energy consumption, and increases protocol overhead. In all such cases, the FC must wait for the updates to arrive from all the nodes, possibly requiring all the nodes to skip resource allocation for one or more time slots, and resulting in a suboptimal asymptotic objective value.

III Proposed Algorithm

This section details the proposed stochastic dual descent algorithm that incorporates the features (F1)-(F2) in its design. To begin with, Sec. III-A describes the asynchronous variant that tolerates delayed gradients still resulting in near-optimal resource allocation. Next, Sec. III-B details the more general AIS-DD algorithm that is amenable to a distributed implementation.

III-A Asynchronous stochastic dual descent

The asynchronous stochastic dual descent algorithm addresses (F1), and proceeds as follows for all $t\geq 1$ :

Primal update: At each time $t$ , node $i$ solves

[TABLE]

for all $1\leq i\leq K$ , and some finite delay $\pi_{i}(t)\geq 0$ . 2. 2.

Dual update: The dual update at time $t$ is given by

[TABLE]

the stale gradient, evaluated at time $t-\delta_{i}(t)$ , is given by

[TABLE]

where the total delay is denoted by $\tau_{i}(t):=\pi_{i}(t)+\delta_{i}(t)$ and $\pi_{i}(t),~{}\delta_{i}(t)\geq 0$ .

Different from (7), the resource allocation in (1) utilizes an old dual variable, ${\boldsymbol{{\bm{\lambda}}}}_{t-\pi_{i}(t)}$ . Further, the dual update is also carried out using an old gradient ${\mathbf{g}}_{t-\delta_{i}(t)}^{i}({\boldsymbol{{\bm{\lambda}}}}_{t-\tau_{i}(t)})$ . The two modes of asynchrony introduced in (1)-(10) allow the primal and dual updates to be carried out at different time scales. In other words, while the resource allocation at each node still occurs at every time slot, the rate at which the dual variables and the gradients are exchanged may be different. In order to highlight the asynchronous nature of the algorithm, the implementation of (1)-(10) is now described from the perspective of the FC and that of node $i$ , in Algorithms 1 and 2, respectively.

Observe that in Algorithms 1 and 2, some steps are ‘optional,’ which in the present case, means that they can, at times, be skipped. These steps are however still required to be carried out ‘often enough’, so that the total delay $\tau_{i}(t)$ is bounded for each node $i$ ; cf. (A4) in Sec. IV-A. Nevertheless, the optional steps in these algorithms allow the dual updates to occur at a different rate. For instance, as long as each packet is correctly time-stamped, the dual updates at the FC may occur as and when the gradients become available, instead of following a fixed schedule.

The ability to postpone or skip transmissions is important in the context of large heterogeneous networks. For instance, transmissions from the nodes to the FC often requires a multiple access protocol, inter-node coordination, and energy budgeting at each node. Consequently, energy-constrained nodes may extend their lifetime simply by scheduling their transmissions once every few time slots. Similarly, energy harvesting nodes may only transmit when sufficient energy is available, choosing to stay silent in times of energy paucity. The slower nodes may even skip the gradient calculation, as long as the resources are allocated in time. Finally, the communication between the nodes and the FC may also incur delays, arising from queueing, processing, or retransmission at various layers in the protocol stack. The flexibility of carrying out updates with stale information makes the network tolerant to such delays.

III-B Asynchronous Incremental Stochastic Dual Descent

This subsection details an incremental version of the asynchronous algorithm introduced in Sec. III-A, that obviates the need for an FC and is thus endowed with both (F1) and (F2). The AIS-DD algorithm allows each node to perform the partial dual update itself, while passing messages to nodes along a cycle. Specifically, for a network with a ring topology, such that node $i$ passes dual variable ${\boldsymbol{{\bm{\lambda}}}}_{t}^{i}$ to node $i+1$ and so on, the primal and dual updates take the following form.

**Primal update: ** At time $t$ , node $i$ solves

[TABLE] 2. 2.

**Dual update: ** At time $t$ , the dual update at node $i$ takes the form

[TABLE]

where,

[TABLE]

and ${\boldsymbol{{\bm{\lambda}}}}_{t}^{0}$ is read as ${\boldsymbol{{\bm{\lambda}}}}_{t-1}^{K}$ and ${\boldsymbol{{\bm{\lambda}}}}_{t}={\boldsymbol{{\bm{\lambda}}}}_{t}^{0}$ will be used to evaluate the performance of the asynchronous incremental algorithms. A key feature of the AIS-DD algorithm is that the message passing and the dual updates occur in parallel with the resource allocation, as shown in Fig. 1. The full implementation details are provided in Algorithm 3.

Here, the two optional steps may be repeated as long as the received ${\boldsymbol{{\bm{\lambda}}}}_{t^{\prime}}^{i-1}$ is still old, that is, $t^{\prime}\leq t$ . As in Sec. III-A, the nodes are allowed to halt the updates temporarily, as long as they “catch up,” eventually. In other words, the updates for time $t^{\prime}$ must be carried out before time $t^{\prime}+\tau$ so as to ensure that $\tau_{i}(t)\leq\tau$ for all $t$ . Interestingly, although resources are allocated at every time slot, the network may or may not carry out one or more message passing rounds per time-slot. It is remarked that the update in (13) must still be performed once at every node for each time index $t^{\prime}$ . Equivalently, the algorithm runs on two ‘clocks,’ one dictating the resource allocation and synchronous with the changes in the network state, and the other governed by the rate at which messages get passed around the network. In the next section, we will establish that the such an algorithm still converges, as long as the difference between the two clocks is bounded. In summary, the AIS-DD algorithm has all the benefits of the asynchronous dual descent algorithm of (1)-(10), while allowing a distributed implementation.

As with classical incremental algorithms, the nodes must communicate along a ring topology. Strictly speaking, the message passing overhead is minimized if the updates occur along a Hamilton cycle [28]. Even when the network does not admit a Hamilton cycle, an approximate cycle can be found using a random walk protocol [49] or the protocol described in [28, Sec. VII]. It is remarked that such a route need only be found once, at the start of the algorithm.

IV Convergence results

This section provides the convergence results for the AIS-SD and AIS-DD algorithm. We begin with developing and analyzing the convergence of AIS-SD algorithm (cf. Theorem 1). It is emphasized that the AIS-SD algorithm is general-purpose, and can be used to minimize any sum of functions in an incremental and asynchronous manner. Subsequently, the asynchronous incremental stochastic gradient descent algorithm is applied to (1) in the dual domain, and a primal-averaging method is proposed that yields asymptotically near-optimal allocations (cf. Theorem 2). We begin with stating the assumptions and briefly reviewing some of the known results (Sec. IV-A). The results for the dual case are outlined in Sec. IV-B, while the near-optimality of the resource allocation is established in Sec. IV-C.

IV-A Assumptions and known results

This subsection begins with the discussion of the following general optimization problem:

[TABLE]

where, ${\bm{\lambda}}$ is the optimization variable, ${\boldsymbol{\Lambda}}\subseteq\mathbb{R}^{d}$ is a non-empty, closed, and convex set, D is finite, and the objective function separates into node-specific cost functions $D^{i}$ . The goal is to solve (15) using only the stochastic subgradients ${\mathbf{g}}_{t}^{i}({\bm{\lambda}})$ of $D^{i}({\boldsymbol{{\bm{\lambda}}}})$ . It is emphasized that the general results presented in this subsection do not required $D^{i}$ to be differentiable. As in (1), ${\mathbf{g}}_{t}^{i}({\bm{\lambda}})$ is stochastic due to its dependence on the random variable ${\mathbf{h}}_{t}^{i}$ that is first observed at node $i$ at time $t$ . Besides the network resource allocation problem considered here, (15) also arises in the context of machine learning [50] and distributed parameter estimation [51]. Before describing the known results related to (15), the necessary assumptions are first stated.

A1.

Non-expansive projection mapping. The projection mapping $P_{{\boldsymbol{\Lambda}}}\left[\right]$ satisfies $\left\|P_{{\boldsymbol{\Lambda}}}\left[{\mathbf{x}}\right]-P_{{\boldsymbol{\Lambda}}}\left[{\mathbf{y}}\right]\right\|\leq\left\|{\mathbf{x}}-{\mathbf{y}}\right\|$ for all ${\mathbf{x}}$ and ${\mathbf{y}}$ . 2. A2.

Zero-mean time-invariant error. Given ${\boldsymbol{{\bm{\lambda}}}}$ , the averaged subgradient function satisfies ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\partial D^{i}({\boldsymbol{{\bm{\lambda}}}})}={\mathbb{E}}[{\mathbf{g}}_{t}^{i}({\boldsymbol{{\bm{\lambda}}}})]$ . 3. A3.

**Bounded moments. ** Given ${\boldsymbol{{\bm{\lambda}}}}\in{\boldsymbol{\Lambda}}$ , the second moment of ${{\mathbf{g}}_{t}^{i}({\bm{\lambda}})}$ is bounded as follows:

[TABLE]

These assumptions are not very restrictive, and hold for most real-world resource allocation problems. A stochastic incremental algorithm for solving (15) was first proposed in [13]. Given a network with ring topology, the updates in [13] take the form

[TABLE]

where ${\boldsymbol{{\bm{\lambda}}}}_{t}^{0}$ is read as ${\boldsymbol{{\bm{\lambda}}}}_{t-1}^{K}$ . It was shown in [13], that under (A1)-(A3), the iterates ${\boldsymbol{{\bm{\lambda}}}}^{i}_{t}$ are asymptotically near optimal in the following sense

[TABLE]

where ${\boldsymbol{{\bm{\lambda}}}}_{t}={\boldsymbol{{\bm{\lambda}}}}_{t}^{0}={\boldsymbol{{\bm{\lambda}}}}_{t-1}^{K}$ . Further, for the case when the step size is diminishing, i.e. ${\epsilon}_{t}$ satisfies $\lim\limits_{T\rightarrow\infty}\sum\limits_{t=1}^{T}{\epsilon}_{t}=\infty$ and $\lim\limits_{T\rightarrow\infty}\sum\limits_{t=1}^{T}{\epsilon}_{t}^{2}<\infty$ , it holds that

[TABLE]

This paper provides the corresponding results for the asynchronous case, where the subgradient in (17) is replaced by an older copy ${\mathbf{g}}_{t-\delta_{i}(t)}^{i}({\bm{\lambda}}_{t-\tau_{i}(t)}^{i-1})$ , that is, the stochastic subgradient of $D^{i}({\bm{\lambda}})$ that depends on the random variable ${\mathbf{h}}_{t-\delta_{i}(t)}^{i}$ and is evaluated at ${\boldsymbol{{\bm{\lambda}}}}={\bm{\lambda}}_{t-\tau_{i}(t)}^{i-1}$ . The delays satisfy $\tau_{i}(t)\geq\delta_{i}(t)\geq 0$ and for the special case of no delay, the stochastic subgradient simplifies to ${\mathbf{g}}_{t}^{i}({\bm{\lambda}}_{t}^{i-1})$ as in (17). The following additional assumption regarding the delays $\delta_{i}(t)$ and $\tau_{i}(t)$ is stated.

A4.

**Bounded delay. ** For each $1\leq i\leq K$ and $t\geq 1$ , it holds that $0\leq\delta_{i}(t)\leq\tau_{i}(t)\leq\tau<\infty$ .

The boundedness assumption on the delay in (A4) allows us to develop convergence results that hold in the worst case, and has been widely used in the context of asynchronous algorithms [52]. It is remarked that an alternative assumption, made in [22], allows the delays $\delta_{i}(t)$ and $\tau_{i}(t)$ to be random variables with unbounded supports but finite means, but is not pursued here. Even with bounded delays, the extension to the asynchronous case is not straightforward, since the the old stochastic subgradients are not necessarily descent directions on an average. Indeed, the resulting subgradient error at time $t$ , defined as

[TABLE]

is neither zero-mean nor i.i.d. In other words, the asynchronous algorithm cannot simply be considered as a special case of the inexact subgradient method.

It is worth pointing out that there is a subtle difference between the definition of the delayed stochastic gradient considered here, and those considered in [39, 22, 40]. Specifically, the delayed gradient in these works takes the form ${\mathbf{g}}_{t}^{i}({\bm{\lambda}}_{t-\tau_{i}(t)}^{i-1})$ instead of the one in (20). As a result, given ${\bm{\lambda}}_{t-\tau_{i}(t)}^{i-1}$ , the gradient error at time $t$ in these papers is indeed zero mean and i.i.d., an assumption that simplifies the analysis to a certain extent. It is also remarked that the definition of the delayed stochastic gradients in [25] is however similar to that considered here. Different from these works, the dual convergence results developed here consider subgradients instead of gradients, and are therefore applicable to a wider range of problems.

Within the context of network resource allocation, it is also important to study the (near-)optimality of the allocations $\{{\mathbf{x}}^{i}_{t},{\mathbf{p}}^{i}_{t}\}$ . Towards this end, some additional assumptions are first stated.

A5.

**Non-atomic probability density function: ** The random variables $\{{\mathbf{h}}_{t}^{i}\}_{i=1}^{K}$ have non-atomic probability density functions (pdf). 2. A6.

**Slater’s condition: ** There exists strictly feasible ${(\tilde{{\mathbf{p}}}^{i},\tilde{{\mathbf{x}}}^{i})}$ , i.e., ${\mathbb{E}}\left[{\sum\limits_{i=1}^{K}}{\mathbf{g}}_{t}^{i}(\tilde{{\mathbf{p}}}_{t}^{i},\tilde{{\mathbf{x}}}^{i})\right]>0$ . 3. A7.

**Lipschitz continuous gradients. ** Given ${\boldsymbol{{\bm{\lambda}}}}$ , ${\boldsymbol{{\bm{\lambda}}}}^{\prime}\in{\boldsymbol{\Lambda}}$ , there exists $L_{i}<\infty$ such that

[TABLE]

In (A5), for $\{{\mathbf{h}}_{t}^{i}\}_{i=1}^{K}$ to have a non-atomic pdf, it should not have any point masses or delta functions. Note that this requirement is not restrictive for most applications arising in wireless communications; see e.g. [9]. The Slater’s condition is a standard assumption that ensures that $\mathsf{P}=\mathsf{D}$ and consequently, since $\mathsf{P}$ is finite, so is $\mathsf{D}$ . The Lipschitz condition in (A7) is however restrictive, since it requires the dual functions $D^{i}({\boldsymbol{{\bm{\lambda}}}})$ to be differentiable with respect to ${\boldsymbol{{\bm{\lambda}}}}$ . In other words, with (A7), ${\mathbf{g}}_{t}^{i}({\boldsymbol{{\bm{\lambda}}}})$ is a stochastic gradient, not a subgradient. It is remarked however that (A7) always holds if $f^{i}({\mathbf{x}}^{i})$ is strongly convex. Moreover, it is generally possible to enforce (A7) artificially by adding a strongly convex regularizer (such as $\theta\left\|{\mathbf{x}}^{i}\right\|_{2}^{2}$ ) to the primal objective [53]. Note however that (A5)-(A7) will not be utilized while establishing the dual convergence results.

The incremental or asynchronous gradient methods have thus far never been applied to the problem of network resource allocation. For the classical stochastic dual descent method (cf. (7)-(8)], it is known that under (A1)-(A3) and (A6), the average resource allocations $\bar{{\mathbf{x}}}^{i}:={\frac{1}{T}\sum\limits_{t=1}^{T}}{\mathbf{x}}^{i}_{t}$ are asymptotically feasible and near-optimal [9].

IV-B Convergence of the AIS-SD algorithm

This subsection provides the convergence results for the AIS-SD algorithm, applied to (15). For the general case, the updates take the following form:

[TABLE]

where $\epsilon_{t}$ is the step-size, ${\mathbf{g}}_{t}^{i}({\boldsymbol{{\bm{\lambda}}}})$ is a stochastic subgradient of $D^{i}({\boldsymbol{{\bm{\lambda}}}})$ and ${\boldsymbol{{\bm{\lambda}}}}^{0}_{t}$ is read as ${\boldsymbol{{\bm{\lambda}}}}^{K}_{t-1}$ . Since the dual problem (6) is simply a special case of (15), the results developed here also apply to the iterates $\{{\boldsymbol{{\bm{\lambda}}}}_{t}^{i}\}$ generated by Algorithm 1. In order to keep the discussion generic, the results are presented for both, diminishing and constant step sizes.

Theorem 1.

The following results apply to the iterates generated by (22) with ${\boldsymbol{{\bm{\lambda}}}}_{t}={\boldsymbol{{\bm{\lambda}}}}_{t}^{0}$ under (A1)-(A4).

(a)

Diminishing step-size:* If the positive sequence $\{{\epsilon}_{t}\}$ satisfies $\lim\limits_{T\rightarrow\infty}\sum\limits_{t=1}^{T}{\epsilon}_{t}=\infty$ and $\lim\limits_{T\rightarrow\infty}\sum\limits_{t=1}^{T}{\epsilon}_{t}^{2}<\infty$ , then it holds that*

[TABLE] 2. (b)

Error bound for constant step size:* For ${\epsilon}_{t}={\epsilon}>0$ , and any arbitrary scalar $\eta>0$ , it holds that*

[TABLE]

where $T\leq B^{2}_{0}/{\epsilon}\eta$ . Here, $\tau$ is the maximum delay as defined in (A4), $C(\tau):=C_{1}+(C_{2}+\tau C_{2}^{\prime})$ , $C_{1}=KV^{2}$ , ${C_{2}:=2KV^{2}\frac{K-1}{2}}$ , ${C_{2}^{\prime}:=4K^{2}V^{2}}$ , and $B_{0}$ is such that $\left\|{{\boldsymbol{{\bm{\lambda}}}}_{1}}-{\boldsymbol{{\bm{\lambda}}}^{\star}}\right\|\leq B_{0}$ .

A popular choice for the diminishing step-size parameter ${\epsilon}_{t}$ required in Theorem 1(a) is ${\epsilon}_{t}=t^{-\alpha}$ for $\alpha\in(1/2,1)$ . For this case, the objective function in (15) converges exactly to the dual optimum. On the other hand, with a constant step size ${\epsilon}$ , the minimum objective value comes to within an $O({\epsilon})$ -sized ball around the optimum as $T\rightarrow\infty$ . More precisely, the result in (24) provides an upper bound on the number of iterations required to come $\eta$ -close to this ball. Different from the results in [13], the size of the ball now depends on the maximum delay $\tau$ , quantifying the worst-case impact of using delayed subgradients.

The proof of Theorem 1 follows the same overall structure as in [13], with appropriate modifications introduced to handle the asynchrony. To begin with, the following intermediate lemma splits a function related to the optimality gap in Theorem 1 into three different terms, and develops bounds on each. The proof of the following lemma is provided in Appendix A.

Lemma 1.

Under (A1)-(A4), the iterates generated by (22) with ${\boldsymbol{{\bm{\lambda}}}}_{t}={\boldsymbol{{\bm{\lambda}}}}_{t}^{0}$ satisfy the following bounds:

[TABLE]

where,

[TABLE]

where $\left\|{\boldsymbol{{\bm{\lambda}}}}_{1}^{0}-{\boldsymbol{{\bm{\lambda}}}^{\star}}\right\|\leq B_{0}$ . Note that ${\boldsymbol{{\bm{\lambda}}}}_{1}^{0}={\boldsymbol{{\bm{\lambda}}}}_{1}$ .

Having developed the necessary bounds, the proof of Theorem 1 is presented next.

Proof:

For the positive sequence $\{{\epsilon}_{t}\}$ , it holds that

[TABLE]

Substituting the bounds obtained in Lemma 1, and noting that it always holds that ${\epsilon}_{t}\leq{\epsilon}_{t-\tau}$ for all $\tau\geq 0$ , we obtain

[TABLE]

where, $C_{1}:=KV^{2}$ , $C_{2}^{\prime}:=4K^{2}V^{2}$ and $C_{2}:=2KV^{2}\frac{K-1}{2}$ . Note that in (IV-B), we have used the notation ${\epsilon}_{[t-\tau]_{+}}:={\epsilon}_{1}$ for all $t\leq\tau$ . Next, for the case when ${\epsilon}_{t}$ is diminishing, and satisfies ${\lim\limits_{T\rightarrow\infty}\sum\limits_{t=1}^{T}{\epsilon}_{t}=\infty}$ and $\lim\limits_{T\rightarrow\infty}\sum\limits_{t=1}^{T}{\epsilon}_{t}^{2}<\infty$ , the numerator of the bound on the right stays bounded, while the denominator grows to infinity. Consequently, taking the limit of $T\rightarrow\infty$ on both sides of (IV-B), the required result in (23) follows. Observe that when the step size is constant, the bound in (IV-B) can be written as

[TABLE]

where $C(\tau)$ is as defined in Theorem 1. In the limit as $T\rightarrow\infty$ , the bound becomes

[TABLE]

which is the asymptotic version of the result in (24).

The rate result in (24) builds upon a similar result from [54, Prop. 3.3]. Intuitively, $\mathbb{E}\left[D({\bm{\lambda}}_{t})\right]$ continues to decrease as long as it is significantly larger than $\mathsf{D}$ . The rest of the proof characterizes the resulting decrement rigorously and subsequently invokes the monotone convergence theorem in order to establish that $\mathbb{E}\left[D({\bm{\lambda}}_{t})\right]$ must eventually come close to $\mathsf{D}$ . Given arbitrary $\eta>0$ and recalling that ${\bm{\lambda}}_{t}:={\bm{\lambda}}_{t}^{0}$ , define the sequence

[TABLE]

Alternatively, $\mathring{{\bm{\lambda}}}_{t}$ is same as ${\bm{\lambda}}_{t}$ until ${\bm{\lambda}}_{t}$ enters level set defined as

[TABLE]

and $\mathring{{\bm{\lambda}}}_{t}$ terminates at ${\bm{\lambda}}^{\star}$ . From (69) and Lemma 1, we have for constant step size ${\epsilon}_{t}={\epsilon}$ that

[TABLE]

where $\mathring{{\bm{\lambda}}}_{t+1}:=\mathring{{\bm{\lambda}}}_{t}^{K}$ and $\mathring{{\bm{\lambda}}}_{t}=\mathring{{\bm{\lambda}}}_{t}^{0}$ . Next define

[TABLE]

so that (33) can be written as

[TABLE]

From the Monotone convergence theorem, we have that $\sum\limits_{t=1}^{\infty}z_{t}<\infty$ , implying that there exists $T<\infty$ , such that $z_{t}=0$ for all $t\geq T$ . Observe that for the case when $\mathring{{\bm{\lambda}}}_{t}\notin L$ , it holds that

[TABLE]

Consequently, it follows from (35) that

[TABLE]

Since the term on the left is non-negative, we have that $B_{0}\geq\sum\limits_{t=1}^{T}z_{t}\geq T{\epsilon}\eta$ , yielding the required bound on $T$ . ∎

IV-C * Primal near optimality and feasibility*

The AIS-SD algorithm of Sec. IV-B, when applied to solve the dual problem in (6), is referred to as the AIS-DD algorithm. In order to ensure that the results developed thus far continue to apply to the dual problem, assumptions (A5)-(A7) are also required. As mentioned earlier, for the primal problem, ${\boldsymbol{\Lambda}}$ is simply the non-negative orthant implying that $\mathsf{D}$ is finite. This subsection establishes the average near-optimality of the AIS-DD algorithm in (1)-(13). Note that Theorem 1 does not imply that the allocations $\{{\mathbf{x}}_{t}^{i},{\mathbf{p}}_{t}^{i}\}$ converge. Instead, the results will make use of the ergodic limit variable

[TABLE]

for each $1\leq i\leq K$ . The main theorem for this subsection is presented next.

Theorem 2.

Under (A1)-(A7) and for constant step size ${\epsilon}>0$ , the iterates generated by (1)-(13) follow:

A.

Primal near optimality

[TABLE]

where,

[TABLE] 2. B.

Asymptotic feasibility**

[TABLE]

Intuitively, the resource allocations in (1) are near-optimal, with optimality gap depending on the step size ${\epsilon}$ and the delay bound $\tau$ . Further, the allocations are almost surely asymptotically feasible, regardless of the the delay bound or the step size. As in Sec. IV-B, the proof of Theorem 2 proceeds by first splitting the optimality gap into three terms and developing bounds on each. The required results are summarized into the following intermediate Lemmas, whose proofs are deferred to Appendices B and C respectively.

Lemma 2.

Under (A1)-(A6), the iterates ${\boldsymbol{{\bm{\lambda}}}}_{t}^{i}$ obtained from (13) are bounded on an average, i.e., there exists $B<\infty$ such that ${\mathbb{E}}\left\|{\boldsymbol{{\bm{\lambda}}}}_{t}\right\|\leq B$ for all $t\geq 1$ .

Lemma 3.

Under (A1)-(A7), the iterates generated by (1)-(13) satisfy the following bounds:

[TABLE]

where,

[TABLE]

Having established the intermediate results, the proof of Theorem 2 is now presented.

Proof of Theorem 2.

The primal near-optimality can be established directly from Lemma 3. Specifically, summing the bounds for $I_{2}$ , $I_{3}$ , and $I_{4}$ , and taking the limit as $T\rightarrow\infty$ , the bound in (41) follows.

In order to establish (42), observe that for any $t\geq 1$ and $1\leq i\leq K$ , it holds that

[TABLE]

where the inequality holds element-wise. Summing both sides over all $1\leq t\leq T$ and $1\leq i\leq K$ , and rearranging, it follows that

[TABLE]

Finally, since ${\boldsymbol{{\bm{\lambda}}}}_{1}^{K}\succeq 0$ , taking expectations on both sides, it follows that

[TABLE]

where (45) holds due to Lemma 2. In other words, given any $\alpha>0$ , there exists $t_{0}\in\mathbb{N}$ such that for all $T\geq t_{0}$ ,

[TABLE]

Taking the limit as $T\rightarrow\infty$ , the result in (42) follows. ∎

V Application to co-ordinated beamforming

This section considers the co-ordinated downlink beamforming problem in wireless communication networks. The usefulness of the proposed stochastic incremental algorithm is demonstrated by applying it to the beamforming problem and solving it in a distributed and online fashion. Simulations are carried out to confirm that the performance of the proposed algorithm is close to that of the centralized algorithm.

V-A Problem formulation

Consider a multi-cell multi-user wireless network with $B$ base stations and $U$ users. Each user $j\in\{1,\ldots,U\}$ is associated with a single base station $b(j)\in\{1,\ldots,B\}$ , and the set of users associated with a base station $i$ is denoted by ${\mathcal{U}}_{i}:=\{j|b(j)=i\}$ . For the sake of consistency, this section will utilize indices $i$ and $m$ for base stations, and indices $j$ , $k$ , and $n$ for users, with the additional restriction that $b(j)=b(k)=i$ and $b(n)=m$ . Within the downlink scenario considered here, user $j$ can only receive data symbols $s_{j}\in\mathbb{C}$ from its associated base station $b(j)$ . The signals transmitted by the base station $i$ intended for other users $k\in{\mathcal{U}}_{i}\setminus\{j\}$ , as well as the signals transmitted by other base stations $m\neq i$ constitute, respectively, the intra-cell and inter-cell interference at user $j$ . The base station $i$ , equipped with $N_{i}$ transmit antennas, utilizes the transmit beamforming vector ${\mathbf{w}}_{j}\in\mathbb{C}^{N_{i}\times 1}$ for each of its associated user $j\in{\mathcal{U}}_{i}$ . Consequently, the received signal at user $j$ is given by

[TABLE]

where ${\mathbf{h}}_{ij}$ denotes the complex channel gain vector between base station $i$ and user $j$ , and $e_{j}$ is the zero mean, complex Gaussian random variable with variance $\sigma^{2}$ that models the noise at user $j$ . Assuming $s_{k}$ to be independent, zero-mean, and with unit variance, the expression for the signal-to-interference-plus-noise ratio (SINR) at user $j$ is given by

[TABLE]

where $i=b(j)$ is the associated base station.

Within the classical co-ordinated beamforming framework, the goal is to design the beamformers $\{{\mathbf{w}}_{j}\}_{j=1}^{U}$ so as to minimize the transmit power, while meeting the SINR constraints at each user. The required optimization problem becomes [55]

[TABLE]

where ${\gamma_{j}}$ is a pre-specified quality-of-service (QoS) threshold for user $j$ . While the beamforming vectors resulting from (48) are optimal, the centralized nature of the optimization problem renders it impractical for application to real networks. For instance, the solution proposed in [55] requires the estimated channel gains $\{{\mathbf{h}}_{ij}\}$ to be collected at a centralized location, where (48) is solved via an iterative algorithm. In practice however, the entire parameter exchange and the algorithm must complete within a fraction of the coherence time of the channel, lest the designed beamformer becomes obsolete. Such a solution is therefore difficult to implement, not robust to node or link failures, and not scalable to large networks.

Observe that the modified version of (48) can be written as

[TABLE]

where $i=b(j)$ . Note that constraints in (49b) and (49c) will ensure the SINR is still greater than the required threshold of $\gamma_{j}$ . It is due to the fact that the feasible set is restricted and feasible set of (49) will be subset of that of (48) and solution found for (49) can be used for (48). Next, the use of primal or dual decomposition techniques can yield a distributed algorithm for (49). Nevertheless, such distributed algorithms also suffer from the limitations mentioned earlier, since the optimum beamforming vectors are required at every time slot.

On the other hand, within the uncoordinated beamforming framework, the optimization variable $I_{j}$ in (49) is replaced with a pre-specified threshold $\rho$ . This renders (49) separable at each base station, allowing beamforming vectors to be designed in parallel. However, the resulting beamformers are suboptimal, and may even render the problem infeasible if $\rho$ is too small or too large.

[TABLE]

A compromise is possible within the stochastic optimization framework by requiring the bound in (49c) to only be satisfied on an average. Note that this amounts to relaxing the optimization problem (49) since the SINR constraint is no longer binding at every time slot. The overall stochastic optimization problem can be expressed as

[TABLE]

where $i=b(j)$ . Different from (48) or (49), the stochastic optimization problem (51) involves finding policies ${\mathbf{w}}_{j}(t)$ and ${I_{j}(t)}$ , which are not necessarily optimal for every time slot $t$ , but only on an average. Specifically, the intercell interference is bounded on an average [cf. (51c)], but also instantaneously [cf. (51d)], so as to limit the worst case SINR. The problem in (51) can be readily implemented using the proposed distributed and asynchronous stochastic dual descent algorithm. In contrast to (49), the stochastic algorithm is not required to converge at every time slot, and allows cooperation over heterogeneous nodes.

V-B Solution to optimization problem

The AIS-DD algorithm proposed in Sec. III-B can now be applied to solve (51). To this end, associate dual variables ${\lambda_{j}}$ for all users ${\{j\in\left[1,U\right]\}}$ , and observe that the primal variables at node $i$ include $\{{\mathbf{w}}_{j}\}_{j\in{\mathcal{U}}_{i}}$ and ${\{I_{j}\}_{j\in{\mathcal{U}}_{i}}}$ . Departing from the notational convention used thus far, the subscript in $\lambda_{j}$ is used for indexing the users, while time dependence is indicated by $\lambda_{j}(t)$ . Proceeding as in Sec. III-B, and recalling that the indices $j$ and $n$ are such that $i=b(j)\neq b(n)=m$ , the operation at node $i$ is summarized in Algo. 4. Observe that such an implementation entails allocating resources prior to the dual updates, and thus results in the delay of at least one, i.e., $\pi_{i}(t)\geq 1$ , compared to the synchronous version. Conversely, the dual updates occur as and when they are passed around, without creating a bottleneck for the resource allocation. For the sake of simplicity, it is assumed that the dual updates occur along the route $1$ , $2$ , $\ldots$ , $K$ .

Next, simulations are carried out demonstrate the applicability of the stochastic algorithm to the beamforming problem at hand. For the simulations, we consider a system with $B=10$ and $U=10$ , with one users per cell. Each of the base stations have ten antennas ( $N_{i}=10$ ), while the other algorithm parameters are ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\epsilon}=0.5}$ , $\sigma^{2}=1$ , $\rho=1.65$ ${\gamma_{j}=10\ \text{dB}\ \text{for all}\ j}$ . In order to keep the simulations realistic, we assume that the delays in the dual updates arise from random events such as node and link failures. For the centralized algorithm, a random subset of four out of ten nodes are selected to transmit their current gradients to the FC at every time slot. Since the FC utilizes old gradients for the other nodes, it results in an average delay of 5.4 time slots. Similarly, for the incremental algorithm, it is assumed that at every time slot, five to fifteen dual update steps (cf. (8)) occur, resulting in an average delay of 5.4 time slots. For instance, if at any time slot, only 8 nodes update, it will result in a delay $\pi_{i}(t)=2$ at the remaining two nodes, where the dual update will occur at the start of the next time slot. The delay may increase further if fewer than 10 nodes update for consecutive time slots and conversely, may decrease if more than 10 updates occur per time slot. Fig. 2 shows the running average of the primal objective function as a function of time using Monte Carlo simulations. For comparison, the performance of the classical centralized stochastic gradient method [cf. (7)-(8)], assuming perfect message passing, is also shown. As evident, the performance loss due to the delays in the availability of the dual variables in minimal.

In order to motivate the stochastic formulation over the deterministic one, Fig. 3 also compares the average transmit power and SINR achieved for the various cases and for different values of the parameter $\rho$ . As expected, the distributed deterministic algorithm performs poorly since it forces the SINR bound to be a constant that does not depend on the channel. By design, the worst case SINR is bounded below by one at every time slot in both the deterministic formulations. Interestingly, the worst case SINR achieved for the relaxed stochastic formulation is also close to one on an average. In return, the stochastic algorithm yields an average transmit power that is equal to or below that obtained by the centralized deterministic formulation. In other words, it is always possible to artificially raise $\gamma$ to a value that is slightly higher than one, so as to obtain an average SINR above $10$ dB, while still getting near-optimal average transmit power.

Next, we study the effect of delay on the rate of convergence of the AIS-DD algorithm. For this case, a simple system with $B=10$ and $U=10$ , and constant delays at all the nodes is considered. The base stations have ten antennas each ( $N_{i}=10$ ) and the other algorithm parameters are ${\epsilon}=0.2$ , $\sigma^{2}=1$ , and $\rho=1.65$ . Fig. 4 shows the evolution of the primal objective function for various delay values. As expected, the convergence is slower if both $\pi_{i}(t)$ and $\delta_{i}(t)$ are consistently larger. Interestingly however, a small increase in the delays amounts to only a marginal loss in performance.

Finally, in order to demonstrate the scalability of the proposed algorithm, Fig. 5 shows an example run for a system with $B=50$ nodes and $U=50$ . The base stations have ten antennas each ( $N_{i}=10$ ), while the other algorithm parameters are ${{\epsilon}=0.5}$ , $\sigma^{2}=1$ , and $\rho=5$ . The delay is generated in the similar manner as for the earlier simulations. It can be observed that even when the number of nodes is large, the difference between the performance of the synchronous and asynchronous algorithms remains relatively small.

VI Conclusion

This paper considers a constrained stochastic resource allocation problem over a heterogeneous network. An asynchronous incremental stochastic dual descent method is proposed for solving the same. The proposed algorithm utilizes delayed gradients for carrying out the updates, resulting in an attractive feature that allows nodes to skip or postpone some updates. The convergence of the proposed algorithm is established for both constant and diminishing step sizes. Further, it is shown that the resource allocations arising from the proposed algorithm are also asymptotically near-optimal. A novel multi-cell coordinated beamforming problem is formulated within the stochastic framework considered here, and solved via the proposed algorithm. Simulation results reveal that the impact of using stale stochastic gradients is minimal.

Appendix A Proof of Lemma 1

A-1 Preliminaries

Before deriving the required bounds, some preliminary results are first obtained. Recall that the quantity ${\mathbf{g}}_{\upsilon}^{i}({\bm{\lambda}})$ denotes the stochastic (sub-)gradient of $D^{i}({\bm{\lambda}})$ at time $t=\upsilon$ and evaluated at ${\bm{\lambda}}$ . Within the context of the dual descent algorithm, we also have that ${\mathbf{g}}_{\upsilon}^{i}({\bm{\lambda}}):={\mathbf{g}}_{\upsilon}^{i}({\mathbf{p}}^{i}_{\upsilon}({\bm{\lambda}}),{\mathbf{x}}_{\upsilon}^{i}({\bm{\lambda}}))$ for $\upsilon\geq 1$ and all $1\leq i\leq K$ . In particular the updates in (22) use $\upsilon=t-\delta_{i}(t)$ and ${\bm{\lambda}}={\bm{\lambda}}_{t-\tau_{i}(t)}^{i-1}$ , where $\tau_{i}(t)=\delta_{i}(t)+\pi_{i}(t)$ . It can be seen that for the special case of the synchronous algorithm, we have that $\tau_{i}(t)=\delta_{i}(t)=0$ and the stochastic (sub-)gradient is written as ${\mathbf{g}}_{t}^{i}({\bm{\lambda}}_{t}^{i-1})$ .

For the sake of convenience, let us denote ${\mathbf{g}}_{t-\delta_{i}(t)}^{i}:={\mathbf{g}}_{t-\delta_{i}(t)}^{i}({\boldsymbol{{\bm{\lambda}}}}_{t-\tau_{i}(t)}^{i-1})$ . First, we establish that the distance between the iterates ${\boldsymbol{{\bm{\lambda}}}}_{t}^{i}$ and ${\boldsymbol{{\bm{\lambda}}}}_{{\ell}}^{i}$ is bounded by a term that is proportional to the step size. From the updates in (22), it holds for all $t$ and $1\leq i\leq K$ that

[TABLE]

where the inequalities in (54) follow from (A1) and (A3). The bound ${\mathbb{E}}\left\|{\mathbf{g}}_{t-\delta_{i}(t)}^{i}\right\|\leq V_{i}$ follows from A3 and Jensen’s inequality which implies that ${\mathbb{E}}\left\|{\mathbf{g}}_{t-\delta_{i}(t)}^{i}\right\|\leq\sqrt{{\mathbb{E}}\left\|{\mathbf{g}}_{t-\delta_{i}(t)}^{i}\right\|^{2}}$ . Given $1\leq i,j\leq K$ and $t\geq{\ell}\geq 1$ , it follows that

[TABLE]

where (55) is obtained by substituting $V=\max_{i}V_{i}$ and (56) follows since ${\epsilon}_{{\ell}}\geq{\epsilon}_{s}$ for all ${\ell}\leq s\leq t$ . The result in (57) holds from the inequality $(i-j)\leq|i-j|$ . Further, for $t\geq 1$ and $1\leq i\leq K$ , let ${\mathcal{F}}^{i}_{t}$ be the $\sigma$ -algebra generated by the random variables

[TABLE]

where ${\mathbf{h}}_{t}^{i}$ is the random state variables observed at node $i$ at time $t$ . With this definition, it holds that

[TABLE]

since ${\mathbf{g}}_{t-\delta_{i}(t)}^{i}$ may depend on ${\boldsymbol{{\bm{\lambda}}}}_{s}^{i-1}$ only for $s\leq{t-\tau_{i}(t)}$ .

Proof:

The proof is organized into two parts. Subsection A-2 develops a bound on the optimality gap in (25), in terms of $I_{1}$ . Subsequently, Subsection A-3 develops the required bound on $I_{1}$ .

A-2 Bound on the optimality gap

An upper bound on $\left\|{\boldsymbol{{\bm{\lambda}}}}_{t}^{K}-{\boldsymbol{{\bm{\lambda}}}^{\star}}\right\|^{2}$ is developed by making use of the form of the updates in (22) for all nodes $1\leq i\leq K$ . The bound follows from the use of triangle inequality and the moment bounds in (A3). Further, the bounded delay assumption (A4) enter through the use of (57).

Observe from the updates in (22) that

[TABLE]

where (60) follows form (A1), and the term $2{\epsilon}_{t}\langle{\mathbf{g}}^{i}_{t-\delta_{i}(t)},{\boldsymbol{{\bm{\lambda}}}}^{i-1}_{t-\tau_{i}(t)}\rangle$ has been added and subtracted to obtain (61). Taking expectations on both sides and summing over all $1\leq i\leq K$ and $1\leq t\leq T$ , we obtain

[TABLE]

where $I_{1}$ is as defined in Lemma 1. Deferring the bound on $I_{1}$ to Subsection A-3, the last term in (62) is analyzed first. In particular, it holds from (59) that

[TABLE]

Further, since the functions $D^{i}({\boldsymbol{{\bm{\lambda}}}})$ are convex, it holds that

[TABLE]

where (64)-(65) follow from the first order convexity condition for $D^{i}$ and (66) follows from the use of triangle inequality, and the fact that given any ${\boldsymbol{{\bm{\lambda}}}}\in{\boldsymbol{\Lambda}}$ ,

[TABLE]

For the last term in (66), taking expectation and utilizing the result in (57), it follows that

[TABLE]

for $1\leq i\leq K$ . Finally, substituting (66), (68) in (62), and using (A3), it follows that

[TABLE]

Since the left-hand side is non-negative and $\left\|{\boldsymbol{{\bm{\lambda}}}}_{1}^{0}-{\boldsymbol{{\bm{\lambda}}}^{\star}}\right\|\leq B_{0}$ , the first part of Lemma 1 is obtained simply by rearranging the terms in (69)

[TABLE]

where the first inequality in (70) follows since $\tau_{i}(t)\leq\tau$ , and ${\epsilon}_{t}$ is non-increasing sequence. Finally, the second inequality in (70) follows from substituting $V_{i}\leq V$ for all $1\leq i\leq K$ , and $I_{0}$ is as defined in Lemma 1.

A-3 Bound on $I_{1}$

In order to derive a bound on $I_{1}$ , we make use of the Cauchy-Schwartz inequality as follows:

[TABLE]

where (71) follows from (57). Consequently,

[TABLE]

where (72) utilizes the bounds $\tau_{i}(t)\leq\tau$ . Finally, substituting $V=\max_{i}V_{i}$ , we obtain ${I_{1}\leq 2\tau KV^{2}{\sum_{t=1}^{T}\sum_{i=1}^{K}}{\epsilon}_{t}{\epsilon}_{t-\tau_{i}(t)}}$ which is the required bound. ∎

Appendix B Proof of Lemma 2

Here, we establish that the dual iterates always stay bounded, thanks to the Slater’s condition in (A6). The proof begins with establishing an upper bound on the per-iteration increase in the value of ${\mathbb{E}}\left\|{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\boldsymbol{{\bm{\lambda}}}}_{t}^{0}}}-{\boldsymbol{{\bm{\lambda}}}^{\star}}\right\|^{2}$ , and subsequently utilizes an induction argument to derive the following bound for all $t\geq 1$ :

[TABLE]

where, $\theta$ and $C$ are positive constants, $\tilde{V}\!\!\!=\!\!\!V^{2}K(K\!\!-\!\!1)$ , $\bar{V}\!\!\!=\!\!\!2K^{2}V^{2}$ , and $\{\bar{{\mathbf{x}}}^{i}\}$ is a slater point of (1). Since $\left\|{\bm{\lambda}}^{\star}\right\|$ is bounded, the right hand side of (B) serves as the bound on ${\mathbb{E}}\left\|{\bm{\lambda}}_{t}\right\|$ .

Proof:

In order to prove (B), we will instead establish a more general result that takes the form:

[TABLE]

where, recall that ${\bm{\lambda}}_{t}^{0}={\bm{\lambda}}_{t}$ and ${\bm{\lambda}}_{1}^{0}={\bm{\lambda}}_{1}$ . The desired result in (B) will follow by applying the triangle inequality to (B). The proof of (B) follows via induction. It can be seen that the inequality in (B) holds trivially for the base case of $t=1$ . As part of the inductive hypothesis, assume that (B) holds for $t$ where $t\geq 1$ . It remains to show that it also holds for $t+1$ . We split the argument into the following two cases.

**Case 1. ** ${\mathbb{E}}D({{\boldsymbol{{\bm{\lambda}}}}_{t}^{0}})>\mathsf{D}+{\epsilon}{{\tilde{V}}/2}+{{\epsilon}\tau\bar{V}}\normalsize$ : In this case, it holds that ${\mathbb{E}}\left\|{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\boldsymbol{{\bm{\lambda}}}}_{t+1}^{0}}-{\boldsymbol{{\bm{\lambda}}}^{\star}}\right\|^{2}\leq{\mathbb{E}}\left\|{{\boldsymbol{{\bm{\lambda}}}}_{t}^{0}}-{\boldsymbol{{\bm{\lambda}}}^{\star}}\right\|^{2}\normalsize$ . Consequently, the induction hypothesis for time $t$ implies that (B) also holds for time $t+1$ .

**Case 2. ** ${\mathbb{E}}D({{\boldsymbol{{\bm{\lambda}}}}_{t}^{0}})\leq{\mathsf{D}+{\epsilon}{{\tilde{V}}/2}+{{\epsilon}\tau\bar{V}}}$ : Recall that the dual function in (II-B) is defined as

[TABLE]

where $\{\tilde{{\mathbf{x}}}^{i},\{\tilde{{\mathbf{p}}}_{t}^{i}\}_{t\geq 1}\}_{i=1}^{K}$ is a strictly feasible (Slater) solution to (1). From (A6), such a strictly feasible solution exists and satisfies ${\mathbb{E}}\left[{\mathbf{g}}_{t}^{i}(\tilde{{\mathbf{p}}}_{t}^{i},\tilde{{\mathbf{x}}}^{i})\right]>C>0$ . Substituting into (B), and rearranging, we obtain

[TABLE]

Since ${{\boldsymbol{{\bm{\lambda}}}}_{t}^{0}}\succeq 0$ , it follows from equivalence of norms $\left\|{{\boldsymbol{{\bm{\lambda}}}}_{t}^{0}}\right\|\leq\theta\left\|{{\boldsymbol{{\bm{\lambda}}}}_{t}^{0}}\right\|_{1}=\theta\langle\boldsymbol{1},{{\boldsymbol{{\bm{\lambda}}}}_{t}^{0}}\rangle$ . Therefore, taking expectations in (76) yields

[TABLE]

where the assumption for Case 2 has been used in (78). Finally, the use of triangle inequality and the bound in (57) yields

[TABLE]

which, together with (78), yields (B) for $t+1$ . Therefore by mathematical induction, the inequality in (B) holds for all $t\geq 1$ . Finally, using (B) and triangle inequality, we obtain the result in (B) since ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\boldsymbol{{\bm{\lambda}}}}_{t}}={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\boldsymbol{{\bm{\lambda}}}}_{t}^{0}}$ . ∎

Appendix C Proof of lemma 3

Proof:

The proof establishes a lower bound for the running average of the primal objective function, calculated at the primal iterates. The lower bound depends upon the dual optimal value, dual initialization, and the maximum delay bound $\tau$ . For ease of exposition, the proof begins with re-arranging the optimality gap in the form required by Lemma 3 and subsequently analyzing the resulting terms. The full proof is split into various parts that develop separate bounds on the terms $I_{2}$ , $I_{3}$ , and $I_{4}$ . Since (A7) is required to establish Lemma 3, the dual function $D$ has to be differentiable.

Since the functions $f^{i}$ are concave, the expected value of the primal objective can be written as

[TABLE]

Consider the following expression where we simply add subtract $D^{i}({{\boldsymbol{{\bm{\lambda}}}}_{t}})$ as follows

[TABLE]

where (83) follows since $\mathsf{D}={\sum\limits_{i=1}^{K}}D^{i}({\boldsymbol{{\bm{\lambda}}}^{\star}})\leq{\sum\limits_{i=1}^{K}}D^{i}({\boldsymbol{{\bm{\lambda}}}})$ for all ${\boldsymbol{{\bm{\lambda}}}}\in{\boldsymbol{\Lambda}}$ . Taking the expectation on both sides of (83) and substituting the result into (82), we obtain

[TABLE]

where $I_{2}$ and $I_{3}$ are as defined in Lemma 3. The rest of the proof proceeds simply by developing bounds on $I_{2}$ and $I_{3}$ .

C-1 Bound on $I_{2}$

The bound on $I_{2}$ follows simply from the moment bounds in (A3) and the Cauchy-Schwartz inequality. We begin with the following observation Since the functions $D^{i}$ are convex, it holds that

[TABLE]

where (86) uses the Cauchy-Schwartz inequality, while (87) uses (57) and (67). Therefore, substituting (87) into the expression for $I_{2}$ and rearranging, we obtain $I_{2}\leq{{\epsilon}{V}\sum_{i=1}^{K}(i-1)V_{i}}.$ Finally, the required bound in Lemma 3 is obtained by substituting $V=\max_{i}V_{i}$ .

C-2 Bound on $I_{3}$

The bound in $I_{3}$ follows from setting aside the error due to asynchrony $I_{4}$ , and developing a bound on the remaining terms by telescopically summing the bounds on $\left\|{\bm{\lambda}}_{t}^{i}\right\|$ over all $1\leq i\leq K$ and $1\leq t\leq T$ .

Since $\mathbf{0}\in\Lambda$ is a feasible dual solution, using the form of the updates in (13) and expanding as in (60), it follows that

[TABLE]

Adding the term $2{\epsilon}\langle{\boldsymbol{{\bm{\lambda}}}}_{t}^{i-1},\nabla D^{i}({\boldsymbol{{\bm{\lambda}}}}_{t}^{i-1})\rangle$ on both sides, and rearranging, we obtain

[TABLE]

where ${\mathbf{e}}_{t,\delta_{i}(t)}^{i}$ is as defined in (20). Summing over $i=1,\ldots,K$ and $t=1,\cdots,T$ , taking expectation, and utilizing (A3), it follows that

[TABLE]

where $I_{4}$ is as defined in Lemma 3 and the (90) uses $V=\max_{i}V_{i}$ .

C-3 Bound on $I_{4}$

The term $I_{4}$ collects the error from the terms that arise due to asynchrony. A bound on $I_{4}$ is developed from the use of the delay bound assumption in (A4). Adding and subtracting $\langle{\boldsymbol{{\bm{\lambda}}}}_{t}^{i-1},\nabla D^{i}({\boldsymbol{{\bm{\lambda}}}}_{t-\tau_{i}(t)}^{i-1})\rangle$ to each summand of $I_{4}$ , we obtain

[TABLE]

Of these, the first term in (C-3) can be bounded using the bound in Lemma 2 and the Cauchy-Schwartz inequality, by observing that

[TABLE]

where (92) follows from (A7) and (93) from the bound developed in (57).

For the second term in (C-3), recalling the definition of ${\mathcal{F}}_{t-\delta_{i}(t)}^{i-1}$ from Appendix A, observe that although $\mathbb{E}\left[{\boldsymbol{{\bm{\lambda}}}}_{t}^{i}\mid{\mathcal{F}}_{t-\delta_{i}(t)}^{i-1}\right]\neq{\boldsymbol{{\bm{\lambda}}}}_{t}^{i}$ , there exists some $\kappa_{i}(t)\leq t$ such that

[TABLE]

Indeed, observe that $\kappa_{i}(t)\geq t-\delta_{i}(t)$ since ${\boldsymbol{{\bm{\lambda}}}}_{t-\tau_{i}(t)}^{i-1}$ only depends on random variables contained in ${\mathcal{F}}_{t-\delta_{i}(t)}^{i-1}$ . The subsequent bounds hold for any $\kappa_{i}(t)$ that satisfies (94), including for the worst case when $\kappa_{i}(t)=t-\delta_{i}(t)$ . It follows that

[TABLE]

From (59) and (94), it follows that the first summand in (95) is zero. The second summand can be bounded by using the Cauchy-Schwartz inequality and the bounds in (A4) and (57) as follows:

[TABLE]

where the inequality in (97) follows since $t-\tau_{i}(t)\leq t-\delta_{i}(t)\leq\kappa_{i}(t)\leq t$ . Finally, substituting (97) and (93) into (C-3) yields

[TABLE]

which together with (90) gives the desired bound. ∎

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y.-W. Hong, W.-J. Huang, F.-H. Chiu, and C.-C. J. Kuo, “Cooperative communications in resource-constrained wireless networks,” IEEE Signal Process. Mag. , vol. 24, no. 3, pp. 47–57, 2007.
2[2] A. K. Sadek, W. Su, and K. Liu, “Multinode cooperative communications in wireless networks,” IEEE Trans. Signal Process. , vol. 55, no. 1, pp. 341–355, 2007.
3[3] L. Georgiadis, M. J. Neely, and L. Tassiulas, “Resource allocation and cross-layer control in wireless networks,” Found. Trends. Network. , vol. 1, no. 1, pp. 1–144, 2006.
4[4] Z. Fan, P. Kulkarni, S. Gormus, C. Efthymiou, G. Kalogridis, M. Sooriyabandara, Z. Zhu, S. Lambotharan, and W. H. Chin, “Smart grid communications: overview of research challenges, solutions, and standardization activities,” IEEE Commun. Surveys Tuts. , vol. 15, no. 1, pp. 21–38, 2013.
5[5] J. Jaramillo and R. Srikant, “Optimal scheduling for fair resource allocation in ad hoc networks with elastic and inelastic traffic,” IEEE/ACM Trans. Netw. , vol. 19, no. 4, pp. 1125–1136, Aug 2011.
6[6] S. Alaei, M. Hajiaghayi, and V. Liaghat, “The Online Stochastic Generalized Assignment Problem,” in APPROX-RANDOM . Springer Berlin Heidelberg, 2013, vol. 8096, pp. 11–25.
7[7] X. Wang and N. Gao, “Stochastic resource allocation over fading multiple access and broadcast channels,” IEEE Trans. Inf. Theory , vol. 56, no. 5, pp. 2382–2391, 2010.
8[8] A. Ribeiro and G. B. Giannakis, “Separation principles in wireless networking,” IEEE Trans. Inf. Theory , vol. 56, no. 9, pp. 4488–05, 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Asynchronous Incremental Stochastic Dual Descent Algorithm for Network Resource Allocation

Abstract

Index Terms:

I Introduction

I-A Contributions and organization

I-B Related work

II Problem formulation

II-A Problem statement

II-B Existing approaches and challenges

III Proposed Algorithm

III-A Asynchronous stochastic dual descent

III-B Asynchronous Incremental Stochastic Dual Descent

IV Convergence results

IV-A Assumptions and known results

IV-B Convergence of the AIS-SD algorithm

Theorem 1**.**

Lemma 1*.*

Proof:

IV-C * Primal near optimality and feasibility*

Theorem 2**.**

Lemma 2*.*

Lemma 3*.*

Proof of Theorem 2.

V Application to co-ordinated beamforming

V-A Problem formulation

V-B Solution to optimization problem

VI Conclusion

Appendix A Proof of Lemma 1

A-1 Preliminaries

Proof:

A-2 Bound on the optimality gap

A-3 Bound on I1I_{1}I1​

Appendix B Proof of Lemma 2

Proof:

Appendix C Proof of lemma 3

Proof:

C-1 Bound on I2I_{2}I2​

C-2 Bound on I3I_{3}I3​

C-3 Bound on I4I_{4}I4​

Theorem 1.

Lemma 1.

Theorem 2.

Lemma 2.

Lemma 3.

A-3 Bound on $I_{1}$

C-1 Bound on $I_{2}$

C-2 Bound on $I_{3}$

C-3 Bound on $I_{4}$