Quantized Distributed Online Projection-free Convex Optimization

Wentao Zhang; Yang Shi; Baoyong Zhang; Kaihong Lu; Deming Yuan

arXiv:2302.13559·math.OC·May 9, 2023·IEEE Control. Syst. Lett.

Quantized Distributed Online Projection-free Convex Optimization

Wentao Zhang, Yang Shi, Baoyong Zhang, Kaihong Lu, Deming Yuan

PDF

Open Access

TL;DR

This paper introduces a quantized distributed online optimization algorithm for convex problems over multi-agent networks, balancing communication efficiency and convergence performance through a novel quantization approach.

Contribution

It develops a new quantized algorithm for distributed online convex optimization that reduces communication costs while maintaining theoretical convergence guarantees.

Findings

01

Achieves dynamic regret bounds under different quantizer settings.

02

Demonstrates trade-off between quantization level and convergence performance.

03

Validates results with simulations on distributed linear regression.

Abstract

This paper considers online distributed convex constrained optimization over a time-varying multi-agent network. Agents in this network cooperate to minimize the global objective function through information exchange with their neighbors and local computation. Since the capacity or bandwidth of communication channels often is limited, a random quantizer is introduced to reduce the transmission bits. Through incorporating this quantizer, we develop a quantized distributed online projection-free optimization algorithm, which can achieve the saving of communication resources and computational costs. For different parameter settings of the quantizer, we establish the corresponding dynamic regret upper bounds of the proposed algorithm and reveal the trade-off between the convergence performance and the quantization effect. Finally, the theoretical results are illustrated by the simulation of…

Equations159

x_{t} \in X min t = 1 \sum T F_{t} (x_{t})

x_{t} \in X min t = 1 \sum T F_{t} (x_{t})

Regret_{d}^{j} (T) = t = 1 \sum T F_{t} (x_{j, t}) - t = 1 \sum T F_{t} (x_{t}^{*})

Regret_{d}^{j} (T) = t = 1 \sum T F_{t} (x_{j, t}) - t = 1 \sum T F_{t} (x_{t}^{*})

H_{T}

H_{T}

E [\mathds Q_{t} (y)] = y, E [∥ \mathds Q_{t} (y) - y ∥^{2}] \leq ϵ_{d, k_{t}} ∥ y ∥^{2}, t \in [T]

E [\mathds Q_{t} (y)] = y, E [∥ \mathds Q_{t} (y) - y ∥^{2}] \leq ϵ_{d, k_{t}} ∥ y ∥^{2}, t \in [T]

\displaystyle\mathds{Q}_{t}(a_{i})=\left\{\begin{array}[]{rcl}{\overline{a_{i}}}^{t},\quad w.p.\ (a_{i}-{\underline{a_{i}}}^{t})k_{t},\\ {\underline{a_{i}}}^{t},\quad w.p.\ (\overline{a_{i}}^{t}-a_{i})k_{t}.\end{array}\right.

\displaystyle\mathds{Q}_{t}(a_{i})=\left\{\begin{array}[]{rcl}{\overline{a_{i}}}^{t},\quad w.p.\ (a_{i}-{\underline{a_{i}}}^{t})k_{t},\\ {\underline{a_{i}}}^{t},\quad w.p.\ (\overline{a_{i}}^{t}-a_{i})k_{t}.\end{array}\right.

\displaystyle\left\{\begin{array}[]{rcl}\bm{x}_{a,t}&=&\frac{1}{n}\sum_{i=1}^{n}\bm{x}_{i,t},\bm{v}_{a,t}=\frac{1}{n}\sum_{i=1}^{n}\bm{v}_{i,t}\\ \bm{e}_{i,t}&=&\mathds{Q}_{t}(\bm{x}_{i,t})-\bm{x}_{i,t}\\ \bm{\theta}_{i,t}&=&\mathds{Q}_{t}[\nabla f_{i,t}(\hat{\bm{x}}_{i,t})]-\nabla f_{i,t}(\hat{\bm{x}}_{i,t})\\ \bm{\nabla}_{i,t}^{Q}&=&\mathds{Q}_{t}[\nabla f_{i,t}(\hat{\bm{x}}_{i,t})]-\mathds{Q}_{t-1}[\nabla f_{i,t-1}(\hat{\bm{x}}_{i,t-1})]\end{array}\right.

\displaystyle\left\{\begin{array}[]{rcl}\bm{x}_{a,t}&=&\frac{1}{n}\sum_{i=1}^{n}\bm{x}_{i,t},\bm{v}_{a,t}=\frac{1}{n}\sum_{i=1}^{n}\bm{v}_{i,t}\\ \bm{e}_{i,t}&=&\mathds{Q}_{t}(\bm{x}_{i,t})-\bm{x}_{i,t}\\ \bm{\theta}_{i,t}&=&\mathds{Q}_{t}[\nabla f_{i,t}(\hat{\bm{x}}_{i,t})]-\nabla f_{i,t}(\hat{\bm{x}}_{i,t})\\ \bm{\nabla}_{i,t}^{Q}&=&\mathds{Q}_{t}[\nabla f_{i,t}(\hat{\bm{x}}_{i,t})]-\mathds{Q}_{t-1}[\nabla f_{i,t-1}(\hat{\bm{x}}_{i,t-1})]\end{array}\right.

t = 1 \sum T i = 1 \sum n E [∥ \hat{x}_{i, t} - x_{a, t} ∥] \leq \frac{n Γ}{1 - σ} j = 1 \sum n ∥ x_{j, 1} ∥ + α T \frac{2 n ^{2} R Γ}{1 - σ}

t = 1 \sum T i = 1 \sum n E [∥ \hat{x}_{i, t} - x_{a, t} ∥] \leq \frac{n Γ}{1 - σ} j = 1 \sum n ∥ x_{j, 1} ∥ + α T \frac{2 n ^{2} R Γ}{1 - σ}

+ (1 + \frac{n Γ σ}{1 - σ}) t = 1 \sum T i = 1 \sum n E [∥ e_{i, t} ∥]

t = 1 \sum T i = 1 \sum n E [s_{i, t} - \frac{1}{n} \nabla F_{t} (x_{a, t})]

t = 1 \sum T i = 1 \sum n E [s_{i, t} - \frac{1}{n} \nabla F_{t} (x_{a, t})]

\leq C_{1} + G_{X} C_{2} t = 1 \sum T i = 1 \sum n E [∥ \hat{x}_{i, t} - x_{a, t} ∥] + n L_{X} C_{2} t = 1 \sum T ϵ_{d, k_{t}}

+ \frac{n Γ G _{X}}{1 - σ} t = 1 \sum T i = 1 \sum n E [∥ e_{i, t} ∥] + \frac{n ^{2} Γ}{1 - σ} D_{T} + \frac{2 n ^{2} Γ R G _{X}}{1 - σ} α T

E [Regret_{d}^{j} (T)]

E [Regret_{d}^{j} (T)]

+ \frac{D _{5}}{α} t = 1 \sum T ϵ_{d, k_{t}} + \frac{2 n}{α} H_{T} + D_{6} D_{T}

D_{1} = n L_{X} i = 1 \sum n ∥ x_{i, 1} - x_{a, 1} ∥ + \frac{n Γ E _{0}}{1 - σ} i = 1 \sum n ∥ x_{i, 1} ∥ + 4 R C_{1},

D_{1} = n L_{X} i = 1 \sum n ∥ x_{i, 1} - x_{a, 1} ∥ + \frac{n Γ E _{0}}{1 - σ} i = 1 \sum n ∥ x_{i, 1} ∥ + 4 R C_{1},

D_{2} = 4 n R (n L_{X} + G_{X} R) + \frac{2 n ^{2} R Γ E _{0}}{1 - σ} + \frac{8 n ^{2} Γ G _{X} R ^{2}}{2},

D_{3} = n R E_{0} (1 + \frac{n Γ σ}{1 - σ}) + n^{2} L_{X} R + 4 n R L_{X} C_{2}

+ \frac{4 n ^{2} Γ G _{X} R ^{2}}{1 - σ}, D_{4} = 2 n L_{X} R, D_{5} = n G_{X} R^{2},

D_{6} = \frac{4 n ^{2} R Γ}{1 - σ} + 2 n R, E_{0} = 4 R C_{2} G_{X} + n L_{X} .

x_{a, t}

x_{a, t}

= \frac{1 - α}{n} i = 1 \sum n \mathds Q_{t - 1} (x_{i, t - 1}) + α v_{a, t - 1}

= \frac{1 - α}{n} i = 1 \sum n e_{i, t - 1} + (1 - α) x_{a, t - 1} + α v_{a, t - 1} .

F_{t} (x_{j, t}) - F_{t} (x_{a, t})

F_{t} (x_{j, t}) - F_{t} (x_{a, t})

\leq n L_{X} ∥ x_{j, t} - x_{a, t} ∥

\leq n L_{X} i = 1 \sum n ∥ x_{i, t} - x_{a, t} ∥

= n L_{X} i = 1 \sum n ∥ \hat{x}_{i, t - 1} - x_{a, t - 1} + α (v_{i, t - 1} - \hat{x}_{i, t - 1})

- α (v_{a, t - 1} - x_{a, t - 1}) - \frac{1 - α}{n} i = 1 \sum n e_{i, t - 1} ∥

\leq n L_{X} i = 1 \sum n ∥ \hat{x}_{i, t - 1} - x_{a, t - 1} ∥ + n L_{X} i = 1 \sum n ∥ e_{i, t - 1} ∥

+ 4 n^{2} L_{X} α R .

E [Regret_{d}^{j} (T)]

E [Regret_{d}^{j} (T)]

\leq t = 1 \sum T E [F_{t} (x_{j, t}) - F_{t} (x_{a, t})] + t = 1 \sum T E [F_{t} (x_{a, t}) - F_{t} (x_{t}^{*})]

\leq n L_{X} i = 1 \sum n ∥ x_{i, 1} - x_{a, 1} ∥ + n L_{X} t = 1 \sum T - 1 i = 1 \sum n E [∥ \hat{x}_{i, t} - x_{a, t} ∥]

+ n^{2} L_{X} R t = 1 \sum T ϵ_{d, k_{t}} + t = 1 \sum T E [F_{t} (x_{a, t}) - F_{t} (x_{t}^{*})]

+ 4 α T n^{2} L_{X} R

F_{t + 1} (x_{a, t + 1}) - F_{t + 1} (x_{a, t})

F_{t + 1} (x_{a, t + 1}) - F_{t + 1} (x_{a, t})

\leq ⟨ \nabla F_{t + 1} (x_{a, t}), x_{a, t + 1} - x_{a, t} ⟩ + \frac{n G _{X}}{2} ∥ x_{a, t + 1} - x_{a, t} ∥^{2}

\leq ⟨ \nabla F_{t + 1} (x_{a, t}), \frac{1 - α}{n} i = 1 \sum n e_{i, t} + α (v_{a, t} - x_{a, t}) ⟩

+ \frac{n G _{X}}{2} \frac{1 - α}{n} i = 1 \sum n e_{i, t} + α (v_{a, t} - x_{a, t})^{2}

\leq α i = 1 \sum n ⟨ \frac{1}{n} \nabla F_{t + 1} (x_{a, t}), v_{i, t} - x_{a, t} ⟩ + \frac{1 - α}{n} i = 1 \sum n Υ_{i, t}

+ n G_{X} \frac{( 1 - α ) ^{2}}{n ^{2}} i = 1 \sum n e_{i, t}^{2} + α^{2} ∥ v_{a, t} - x_{a, t} ∥^{2}

\leq α i = 1 \sum n ⟨ \frac{1}{n} \nabla F_{t + 1} (x_{a, t}), v_{i, t} - x_{a, t} ⟩ + \frac{1 - α}{n} i = 1 \sum n Υ_{i, t}

+ G_{X} i = 1 \sum n ∥ e_{i, t} ∥^{2} + 4 n G_{X} R^{2} α^{2}

⟨ \frac{1}{n} \nabla F_{t + 1} (x_{a, t}), v_{i, t} - x_{a, t} ⟩

⟨ \frac{1}{n} \nabla F_{t + 1} (x_{a, t}), v_{i, t} - x_{a, t} ⟩

\leq ⟨ \frac{1}{n} \nabla F_{t + 1} (x_{a, t}) - s_{i, t}, v_{i, t} - x_{a, t} ⟩

+ ⟨ s_{i, t}, x_{t}^{*} - x_{a, t} ⟩

\leq \frac{2 R}{n} ∥ \nabla F_{t + 1} (x_{a, t}) - \nabla F_{t} (x_{a, t}) ∥ + 4 R \frac{1}{n} \nabla F_{t} (x_{a, t})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Control Multi-Agent Systems · Energy Efficient Wireless Sensor Networks · Advanced Optimization Algorithms Research

Full text

Quantized Distributed Online Projection-free Convex Optimization

Wentao Zhang

Yang Shi

Baoyong Zhang

Kaihong Lu

Deming Yuan *Corresponding author: Deming Yuan.*W. Zhang, B. Zhang, D. Yuan are with School of Automation, Nanjing University of Science and Technology, Nanjing 210094, Jiangsu, P. R. China (e-mail: [email protected], [email protected], [email protected]).Y. Shi is with the Department of Mechanical Engineering, University of Victoria, Victoria, BC V8W 2Y2, Canada (e-mail: [email protected]).K. Lu is with the College of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao 266590, China (e-mail: [email protected])

Abstract

This paper considers online distributed convex constrained optimization over a time-varying multi-agent network. Agents in this network cooperate to minimize the global objective function through information exchange with their neighbors and local computation. Since the capacity or bandwidth of communication channels often is limited, a random quantizer is introduced to reduce the transmission bits. Through incorporating this quantizer, we develop a quantized distributed online projection-free optimization algorithm, which can achieve the saving of communication resources and computational costs. For different parameter settings of the quantizer, we establish the corresponding dynamic regret upper bounds of the proposed algorithm and reveal the trade-off between the convergence performance and the quantization effect. Finally, the theoretical results are illustrated by the simulation of distributed online linear regression problem.

I INTRODUCTION

In recent years, online distributed convex optimization has received ever-increasing attention from researchers because of its wide applications in many areas, such as machine learning, sensor networks, smart grids, etc.; see, e.g., [1, 2, 3, 4, 5, 6]. In such an online optimization problem with constraint sets, various algorithms with projection operations have been developed, such as distributed online gradient descent [7, 8]. However, for some high-dimensional and complex constrained optimization scenarios including multiclass classification [9] and matrix completion [10, 11], projection operations incur a heavy computational burden. On the contrary, projection-free algorithms have impressive advantages essentially due to the use of a linear oracle.

In [9], Zhang et al. earlier proposed an online distributed projection-free algorithm and established the static regret upper bound as $\mathcal{O}(T^{3/4})$ . The works [12, 13, 14] further analyzed the static regret of some variants based on projection-free methods. In [15] and [16], the dynamic regret bounds were studied in distributed online projection-free algorithms under convex and nonconvex conditions, respectively. Dynamic regret is a more stringent metric than static regret due to its dynamic reference sequence. However, the communication channels between agents in [15], [16] are assumed to be perfect. In most applications, the communication channels often have limited bandwidth or capacity, especially for the cases with scarce communication bandwidth or capacity [17].

It is worth mentioning that quantized communication as a communication pattern can effectively reduce the number of communicated bits to achieve the saving of communication resources [17] through transmitting the information quantized by a dedicated quantizer. Currently, some distributed optimization algorithms with quantized communication have been developed [18, 19, 20, 21, 3, 22, 23, 24, 25], etc. In [18], Xiong et al. investigated the quantization effects on the convergence performance of the distributed quantized mirror descent algorithm. The works [19, 20, 21] analyzed the quantized distributed off-line optimization algorithms based on subgradient and inexact proximal-gradient methods, respectively. Doan et al. [22] considered a distributed off-line two-time-scale stochastic approximation algorithm under random quantization and established the almost sure convergence to the optimal solution for both convex and strongly convex loss function. In [23], Li et al. investigated the quantized distributed subgradient optimization algorithm with the dynamic encoding and decoding frameworks and proved that consensus optimization could be achieved under some mild conditions. Further, for the online distributed optimization problem, Yuan et al. [3] proposed a distributed online bandit algorithm under quantized communication and established the static regret. Up to now, there are few research results considering distributed online optimization scenarios under quantized communication. The above analysis of the related literature and the state of the art motivates us to investigate the dynamic regret of distributed online projection-free algorithm under quantized communication, and the effect of quantizer parameters on regret bounds.

The main contributions of this work are two-fold. Firstly, motivated by [15] and [3], we develop a quantized distributed online projection-free optimization (Q-DOPFO) algorithm for solving the distributed online constrained optimization problem over a multi-agent network. Meanwhile, the proposed algorithm saves the communication resources and computational costs as compared to the algorithms with real-valued data and projection operations, respectively. Secondly, for different parameter settings of the quantizer, we establish the corresponding dynamic regret upper bounds of the proposed algorithm and reveal the trade-off between the convergence performance and the quantization effect. In particular, when the knowledge of $H_{T}$ is known, the optimal bound $\mathcal{O}(\sqrt{T(1+H_{T})}+D_{T})$ can be achieved under proper parameter settings, where $T,H_{T}$ and $D_{T}$ represent the total time, function variation and gradient variation, respectively.

The remainder of the paper is organized as follows. The problem statement, quantizer description, and necessary assumptions are presented in Section II. Section III shows the algorithm design and convergence results. Sections IV and V give simulation examples and conclusion, respectively.

Notation: ${\mathbb{R}}^{n}$ represents the Euclidean space with $n$ dimensions. $[T]$ denotes $\{1,2,\ldots,T\}$ . $\|\bm{z}\|$ denotes the Euclidean norm of a vector $\bm{z}$ . $\lceil\cdot\rceil$ represents the round up function. The boundary of a set $\bm{X}$ is denoted as $\partial\bm{X}$ . The element in the $i$ -th row and $j$ -th column of matrix $W$ is denoted as $[W]_{ij}$ . $[\bm{w}]_{i}$ denotes the $i$ -th element of vector $\bm{w}$ . $\mathbb{B}_{R}^{d}:=\{\bm{z}\in\mathbb{R}^{d}|\ \|\bm{z}\|\leq R\}$ is the closed Euclidean ball with a center point of origin and a radius of $R$ .

II Problem Formulation

II-A The Optimization Problem

Consider a directed time-varying network $\mathcal{G}_{t}=\{\mathcal{V},\mathcal{E}_{t},W_{t}\}$ that consists of $n$ agents, where $\mathcal{V}:=\{1,\ldots,n\}$ , $\mathcal{E}_{t}\subseteq\mathcal{V}\times\mathcal{V}$ denotes the edge set. In the network, agent $i$ can receive the information from the agents in its neighbor sets $\mathcal{N}_{i}^{in}(t)=\{j\mid(j,i)\in\mathcal{E}_{t}\}$ . $W_{t}\in\mathbb{R}^{n\times n}$ denotes the weighted matrix and satisfies double stochasticity, i.e., $\sum_{j=1}^{n}[W_{t}]_{ij}=\sum_{i=1}^{n}[W_{t}]_{ij}=1$ , $\forall t\in[T],\forall i,j\in\mathcal{V}$ , where $[W_{t}]_{ii}=1-\sum_{j\in\mathcal{N}_{i}^{in}(t)}[W]_{ij}$ . There exists a constant $\zeta>0$ such that $[W_{t}]_{ij}>\zeta,t\in[T]$ holds when $j\in\mathcal{N}_{i}^{in}(t)\cup\{i\}$ , and $[W_{t}]_{ij}=0$ otherwise. The distributed online optimization problem is described as follows:

[TABLE]

where $F_{t}(\bm{x})=\sum_{i=1}^{n}{f_{i,t}}(\bm{x})$ , the function $f_{i,t}$ is convex over the convex and compact set $\bm{X}\in$ ${\mathbb{R}}^{d}$ . Agents in the network cooperate to search for the global optima of Problem (1) through local computation and information exchange with neighbor agents. Generally, the metric $\textbf{Regret}_{d}^{j}(T)$ defined in (2) is used to measure the algorithm performance, which represents the difference between the cumulative cost $F_{t}{(\bm{x}_{j,t})}$ of the agent $j$ over time $T$ and the cumulative cost at benchmark sequence $\bm{x}_{t}^{*}\in\bm{X}$ .

[TABLE]

where $\bm{x}_{t}^{*}\in{\arg\min}_{\bm{x}\in\bm{X}}F_{t}(\bm{x})$ . Due to this varying benchmark $\bm{x}_{t}^{*}$ , dynamic regret is more stringent than static regret and has wider application scenarios, such as target tracking. It is well known that the upper bound of (2) generally depends on the regularity of the optimization problem. Considering this fact, we define the following function variation $H_{T}$ and gradient variation $D_{T}$ .

[TABLE]

where $f_{t}^{sup}=\max_{i\in\mathcal{V}}\max_{\bm{x}\in\bm{X}}|f_{i,t+1}(\bm{x})-f_{i,t}(\bm{x})|$ , $g_{t}^{sup}=\max_{i\in\mathcal{V}}\max_{\bm{x}\in\bm{X}}\|\nabla f_{i,t+1}(\bm{x})-\nabla f_{i,t}(\bm{x})\|$ .

Our objective is to design a distributed online algorithm with quantized communication for Problem (1) that achieves sublinear dynamic regret of every agent $j\in\mathcal{V}$ .

II-B Random Quantizer

In this section, the following random quantizer is introduced to ensure that each agent in the network uses its quantized information to communicate with its neighbors.

Definition 1 ([3])

$\mathds{Q}_{t}(\bm{y})\in\mathbb{R}^{d}$ * is the time-varying random quantizer of a vector $\bm{y}\in\mathbb{R}^{d}$ if it satisfies that*

[TABLE]

where $\epsilon_{d,k_{t}}$ denotes a quantization resolution that is dependent on the qunantization levels $k_{t}$ and the dimension $d$ .

Remark 1

Several common quantizers are naturally special cases of this random quantizer, such as randomized gossip [26], rescaled unbiased estimators [26], stochastic $k$ -level quantization [3], probabilistic quantizer [18]. We show the probabilistic quantizer in [18] as an example. Denote $\mathds{Q}_{t}(\bm{y})=[\mathds{Q}_{t}(a_{1}),\mathds{Q}_{t}(a_{2}),\ldots,\mathds{Q}_{t}(a_{d})]^{T}$ , where $a_{i}=[\bm{y}]_{i},i\in[d]$ . Then, for $[\bm{y}]_{i},i\in[d],t\in[T]$ , we have

[TABLE]

where ${\overline{a_{i}}}^{t}$ and ${\underline{a_{i}}}^{t}$ are the round up and down $a_{i}$ to the nearest integer multiple of $1/k_{t}$ , respectively. It is not hard to note that the probabilistic quantizer satisfies Definition 1 with $\epsilon_{d,k_{t}}=d/(4{k_{t}}^{2})$ .

Remark 2

According to Definition 1, $\epsilon_{d,k_{t}}$ has a wide range of values and when its value is smaller, the quantized data is closer to the real-value data. Note that large values of $\epsilon_{d,k_{t}}$ are allowed at the early stages of the running algorithm, which means that the quantized data at this stage is coarser and less precise than the real-value data. In order to achieve the sublinear dynamic regret, a sublinearly convergent sequence $\{\epsilon_{d,k_{t}}\}$ over time $t$ is desired and necessary, which can be verified in the following sections.

II-C Some Assumptions

Some necessary assumptions are needed to facilitate the following algorithm development.

Assumption 1

The union $\bigcup_{i=kQ+1}^{(k+1)Q}\mathcal{G}_{i}$ is strongly connected for some positive integer $Q$ and every integer $k\geq 0$ .

Assumption 2

The constraint set $\bm{X}\subset{\mathbb{R}}^{d}$ is convex and compact and satisfies that $\bm{X}\subseteq\mathbb{B}_{R}^{d},R>0$ .

Assumption 3

The function $f_{i,t}$ is $L_{X}$ -Lipschitz, i.e., $|f_{i,t}(\bm{x}_{1})-f_{i,t}(\bm{x}_{2})|\leq L_{X}\|\bm{x}_{1}-\bm{x}_{2}\|$ , $\forall\bm{x}_{1},\bm{x}_{2}\in\bm{X}$ , where $L_{X}$ is a known positive constant.

Assumption 4

The gradient $\nabla f_{i,t}(\bm{x})$ is $G_{X}$ -Lipschitz, i.e., $\|\nabla f_{i,t}(\bm{x}_{1})-\nabla f_{i,t}(\bm{x}_{2})\|\leq{G_{X}}\|\bm{x}_{1}-\bm{x}_{2}\|,\forall\bm{x}_{1},\bm{x}_{2}\in\bm{X}$ , which is equivalent to $f_{i,t}(\bm{x}_{1})-f_{i,t}(\bm{x}_{2})\leq\langle\nabla f_{i,t}(\bm{x}_{2}),\bm{x}_{1}-\bm{x}_{2}\rangle\quad+\frac{G_{X}}{2}\|\bm{x}_{1}-\bm{x}_{2}\|^{2}$ .

Remark 3

Assumptions 1-3 are common in the literature (see [27, 1], [11, 28], etc.) on centralized and distributed optimization. The purpose of assuming $\bm{X}\subseteq\mathbb{B}_{R}^{d}$ is to ensure that the variance of the random quantizer is bounded, i.e., $\mathbb{E}[\|\mathds{Q}_{t}(\bm{y})-\bm{y}\|^{2}]\leq\epsilon_{d,k_{t}}R^{2}$ , which is a necessary precondition. It is worth noting that Assumption 3 implies $\|\nabla f_{i,t}(\bm{x})\|\leq L_{X}$ according to Lemma 2.6 in [29].

III Algorithm Design and Convergence Analysis

III-A Algorithm Q-DOPFO

In this section, we develop Algorithm Q-DOPFO, which is illustrated in Algorithm 1. The key ingredients of the proposed algorithm include: 1) the quantized data $\mathds{Q}_{t}(\bm{x}_{j,t})$ and $\mathds{Q}_{t}[\nabla f_{i,t}(\hat{\bm{x}}_{i,t})]$ , instead of real-valued data, are utilized to perform consensus steps; 2) gradient tracking technique is introduced to correct the gradient change of loss function by using the global gradient estimation $\widehat{\bm{s}}_{i,t}$ instead of individual agent gradients; 3) the decision variable $\bm{x}_{i,t+1}$ is updated through a linear step. It is worth noting that the use of the random quantizer and projection-free oracle in the proposed algorithm can effectively save communication and computing resources of multi-agent systems.

In some extreme situations, such as $\bm{x}_{i,t}\in\partial\bm{X}$ at time $t$ , $\mathds{Q}_{t}(\bm{x}_{i,t})$ may occasionally violate the constraint set due to the quantizer. However, because of the variability of $x_{t}^{*}$ over the time and the randomness of the quantizer, $\mathds{Q}_{t}(\bm{x}_{i,t})$ usually does not always violate set $\bm{X}$ . To ensure that the updated decipsion $\bm{x}_{i,t+1}$ is always feasible, we require that the quantized state $\mathds{Q}_{t}(\bm{x}_{i,t})$ is in set $\bm{X}$ for all $t$ , i.e. the step 4 of Algorithm 1.

III-B Main Convergence Results

In this section, some lemmas and the bound of dynamic regret defined in (2) for Algorithm 1 are established. To facilitate the analysis, we define as follows the transition matrix $\Phi(t,s)=W_{t}W_{t-1}\ldots W_{s}$ , for all $t,s\ \text{with}\ t\geq s\geq 1$ , the running average vectors $\bm{x}_{a,t}$ , $\bm{v}_{a,t}$ , the quantization errors $\bm{e}_{i,t},\bm{\theta}_{i,t}$ and the difference of quantized gradient $\bm{\nabla}_{i,t}^{Q}$ .

[TABLE]

Lemma 1

Let the decision sequence $\{\bm{x}_{i,t}\}$ be generated by Algorithm 1. Then, under Assumptions 1 and 2, we have for $T\geq 2$ that

[TABLE]

where $\sigma=(1-{\zeta}/{4n^{2}})^{1/Q},\Gamma=(1-{\zeta}/{4n^{2}})^{(1-2Q)/Q}$ .

Lemma 2

Let the sequence $\{\widehat{\bm{s}}_{i,t},{\nabla}f_{i,t}(\hat{\bm{x}}_{i,t})\}$ be generated by Algorithm 1. Then, under Assumptions 1 and 4, we have for any $T\geq 2$ that

[TABLE]

where $C_{1}=\frac{\sigma n\Gamma\sqrt{\epsilon_{d,k_{1}}}+n\Gamma}{1-\sigma}\sum_{i=1}^{n}\|{\nabla}f_{i,1}(\hat{\bm{x}}_{i,1})\|,C_{2}=\frac{2n\Gamma}{1-\sigma}+1$ .

Theorem 1

Let the decision sequence $\{\bm{x}_{i,t}\}$ be generated by Algorithm 1 and suppose Assumptions 1-4 hold. Then, for $T\geq 2$ and $j\in\mathcal{V}$ , the regret is bounded as follows:

[TABLE]

where

[TABLE]

Proof. Based on Algorithm 1 and double stochasticity of $W_{t-1}$ , we obtain that

[TABLE]

Thus, according to Assumptions 2 and 3, for any $t\geq 2$ , we have that

[TABLE]

Recalling the regret notion defined in (2) and combining (III-B), we obtain that

[TABLE]

where the last inequality is obtained by using the fact $\mathbb{E}[\|\bm{e}_{i,t}\|]\leq\sqrt{\mathbb{E}[\|\bm{e}_{i,t}\|^{2}]}\leq\sqrt{\epsilon_{d,k_{t}}\|\bm{x}_{i,t}\|^{2}}\leq R\sqrt{\epsilon_{d,k_{t}}}$ .

Next, we aim to bound the term $\sum_{t=1}^{T}\mathbb{E}[F_{t}{(\bm{x}_{a,t})}-F_{t}(\bm{x}_{t}^{*})]$ in (III-B). By using Assumption 4, we have

[TABLE]

where $\Upsilon_{i,t}:=\langle\nabla F_{t+1}(\bm{x}_{a,t}),\bm{e}_{i,t}\rangle$ and the last inequality is obtained by using the fact $\|\sum_{i=1}^{n}\bm{e}_{i,t}\|^{2}\leq n\sum_{i=1}^{n}\|\bm{e}_{i,t}\|^{2}$ [30]. It can be further verified that

[TABLE]

where the first inequality is obtained by utilizing the following optimality condition: $\langle\widehat{\bm{s}}_{i,t},\bm{x}_{t}^{*}\rangle\geq\langle\widehat{\bm{s}}_{i,t},\bm{v}_{i,t}\rangle$ and the last inequality is derived based on the convexity condition of $F_{t}(\bm{x})$ together with Assumption 4. Then, it follows from (III-B) and (III-B) that

[TABLE]

where $\Omega_{t}={4\alpha R}\sum_{i=1}^{n}\|\frac{1}{n}\nabla F_{t}(\bm{x}_{a,t})-\widehat{\bm{s}}_{i,t}\|+\frac{1-\alpha}{n}\sum_{i=1}^{n}\Upsilon_{i,t}+{G_{X}}\sum_{i=1}^{n}\left\|\bm{e}_{i,t}\right\|^{2}+4nG_{X}R^{2}\alpha^{2}$ . Through simplifying (III-B) by using the method similar to Lemma 4 in [15], we obtain that

[TABLE]

For the first term on the right hand of (III-B), we obtain

[TABLE]

Through recalling the definition of $\Omega_{t}$ and combining (III-B), (III-B), (23), (24), Lemmas 1 and 2, we can readily obtain the condition (1). The proof is complete. $\square$

In Theorem 1, (1) shows that the upper bound of the regret depends on $D_{i},i\in\{1,2,\ldots,6\},\alpha,T,\epsilon_{d,k_{t}},H_{T}$ and $D_{T}$ , where $D_{i}$ are the scalars consisting of the initial values, optimization problem parameters, and the network parameters $\sigma,\Gamma$ . It should be pointed out that the shorter the jointed connection period $Q$ of graph is, the tighter the regret bound will be due to the smaller parameter $\Gamma$ and $\frac{1}{1-\sigma}$ in coefficients $D_{1},D_{2},D_{3}$ and $D_{6}$ . When the optimization problem and the network graph are determined, the coefficient $D_{i}$ will be a finite fixed constants and $D_{T},H_{T}$ will have a fixed order over time $T$ . It is not hard to note that the regret bound of Algorithm 1 is affected by $\alpha$ and $\epsilon_{d,k_{t}}$ . Thus, we have the following corollary.

Corollary 1

*Suppose that the conditions in Theorem 1 and $H_{T}=o(T),D_{T}=o(T)$ hold. Given positive scalars $\kappa_{1},\gamma<1,\kappa_{2}\leq T^{\gamma}$ and $\xi$ , the quantization resolution and the step size are chosen as $\epsilon_{d,k_{t}}=\frac{\kappa_{1}}{t^{\xi}},\alpha=\frac{\kappa_{2}}{T^{\gamma}}$ , respectively. Then, we have that *

[TABLE]

where $b:=\min\{\gamma,\xi/2,\xi-\gamma\}$ .

Proof. Substituting the conditions of $\epsilon_{d,k_{t}},\alpha$ in Corollary 1 into (1), once can verify that $\sum_{t=1}^{T}\sqrt{\epsilon_{d,k_{t}}}=\sum_{t=1}^{T}\frac{\sqrt{\kappa_{1}}}{t^{\xi/2}}=\sqrt{\kappa_{1}}+\sqrt{\kappa_{1}}\int_{1}^{T}\frac{1}{t^{\xi/2}}dt\leq\mathcal{O}(T^{1-\xi/2})$ when $0<\xi<2$ , $\sum_{t=1}^{T}\sqrt{\epsilon_{d,k_{t}}}\leq\mathcal{O}(\ln T)$ when $\xi=2$ , and $\sum_{t=1}^{T}\sqrt{\epsilon_{d,k_{t}}}\leq\mathcal{O}(1)$ when $2<\xi$ , respectively. Similarly, $\sum_{t=1}^{T}\epsilon_{d,k_{t}}$ can be bounded as $\mathcal{O}(T^{1-\xi})$ when $0<\xi<1$ , $\mathcal{O}(\ln T)$ when $\xi=1$ , and $\mathcal{O}(1)$ when $1<\xi$ , respectively. Then, (1) is easily obtained based on different ranges of $\xi$ . The proof is complete. $\square$

Remark 4

Note that the setting of the decreasing quantization resolution in Corollary 1 allows relatively coarse quantization in the early stage of algorithm execution. In particular, when the parameter $\xi$ is chosen as a small value, the saving of communication resources will be significant but the regret bound will be poor, which implies that the setting of the parameter $\xi$ links the trade-off between them. It should be noted that when the total iteration time $T$ is large, the quantitative effect of information may be weakened to be close to the real value, especially in the later stage of algorithm operation. This change is actually reasonable since the state variables must approach final optima by continually obtaining the precise data as long as the algorithm runs.

Remark 5

The result of Corollary 1 matches the centralized result [31] and distributed results [15, 16] while taking quantized communication into account. Compared with [16], we additionally consider quantization communication and do not require the loss function to be bounded. Moreover, the requirements $H_{T}=o(T)$ and $D_{T}=o(T)$ in Corollary 1 imply that the cumulative variations of function value and gradient value grow slower than $T$ as $T$ increases. This also means that the loss functions and gradient functions satisfy certain regularities over time, such as the variability of function parameters decreases over time. According to Corollary 1, it is not hard to find that this requirement is reasonable and necessary for guaranteeing the sub-linearity of the considered dynamic regret. In addition, if the bound of the prior knowledge $H_{T}$ can be known in advance, i.e., $H_{T}\leq\mathcal{O}(T^{\theta}),0<\theta<1$ , (1) can be improved to $\mathcal{O}(\sqrt{T(1+H_{T})}+D_{T})$ by setting $\gamma=1/2-\log_{T}\sqrt{1+T^{\theta}}$ .

IV Simulation

In this section, the following distributed online linear regression problem with a regularization term is simulated to verify the proposed algorithm.

[TABLE]

where $\bm{X}:=\{\bm{x}|\|\bm{x}\|_{1}\leq 2\}$ , $\bm{p}_{i,t}\in\mathbb{R}^{d},q_{i,t}\in\mathbb{R}$ represents the feature and label information, and $\rho$ is a regular parameter. The feature vector $\bm{p}_{i,t}$ is generated randomly and uniformly and its element satisfies $[\bm{p}_{i,t}]_{i}\in[-5,5]$ . The label $q_{i,t}$ satisfies $q_{i,t}=\bm{p}_{i,t}^{\top}\bm{x}_{0}+{\zeta_{i,t}}/(4{t})$ where $\zeta_{i,t}$ is generated randomly in the interval $[0,1]$ . In the following simulation, we set the algorithm parameters $n=10$ , $d=30$ , $\rho=5\times 10^{-6}$ , $\alpha=1/(2T^{0.3})$ and take the probabilistic quantizer mentioned in Remark 1 as an example. To measure the performance of the algorithm, the global average dynamic regret $\frac{1}{n}\sum_{j=1}^{n}[\textbf{Regret}_{d}^{j}(T)/T]$ is defined.

To investigate the convergence of Algorithm 1 and the effect of quantization parameters on algorithm convergence, we compare the global average dynamic regrets of Algorithm 1 under different cases: no quantization [15], quantization levels $k_{t}=\lceil t^{0.8}\rceil,\lceil t^{1}\rceil,\lceil t^{1.3}\rceil$ and $\lceil t^{1.5}\rceil$ . From Fig. 2, Algorithm 1 is convergent and when $k_{t}$ with a larger increasing tendency is selected, the related convergence performance is better. Note that in the early stage of the algorithm, the performance fluctuation caused by relatively coarse quantization resolution can be tolerated and this error can be weakened with the iteration time. Further, we analyze the effect of the quantization level with the maximum number $B\geq k_{t}$ on the convergence performance of the designed algorithm. Taking quantization level $k_{t}=\lceil t^{1.5}\rceil$ as an example and considering the case of $B=50,80,100,$ the comparison results are shown in Fig. 2. It can be seen that when $B=100$ , its convergence curve is close to that without the maximum number, while $B=50$ , the convergence performance is poor. It should be noted that the design of quantization level with an appropriate parameter $B$ can better save communication resources than that without the limited parameter $B$ , but it always has a quantization error $\bm{e}_{i,t}$ because the quantized data cannot approach the real-value data.

Next, we carry out a comparative study for the convergence performance of Algorithm 1 under the step size design taking into account unknown total iteration time $T$ as well as the quantization level $k_{t}=\lceil t^{1.5}\rceil$ . Without loss of generality, setting the step size as $\alpha=1/(2T^{0.3}),0.2,0.1,0.05,0.02$ , respectively, the comparison results of the convergence performance are revealed in Fig. 3. Among the settings of step sizes, the dynamic regret under the step size with the knowledge of $T$ has a significantly better convergence effect, while that under the step size without the knowledge of $T$ has a large fluctuation of the convergence performance for different settings. Although the latter does not require prior knowledge of $T$ , it always has a performance gap in a theoretical sense according to Theorem 1. In addition, as the horizon $T$ varies, this step size without the horizon $T$ may cause the original convergence performance to be unmaintainable due to its invariant setting. Finally, the effect of the number of agents on the convergence performance is studied under $k_{t}=\lceil t^{1.5}\rceil$ . Through setting $n=10,30,50$ , the comparison of the global average dynamic regret is shown in Fig. 4, which verifies the theoretical results in Theorem 1 that the smaller the value of $n$ is, the better the convergence performance of Algorithm 1 is.

V CONCLUSIONS

For the distributed online constrained optimization problem under quantized communication, this paper has developed a quantized distributed online projection-free optimization algorithm. The use of random quantizers and linear oracle in the proposed algorithm has ensured the effective saving of communication resources and computational costs. For different settings of quantization resolution $\epsilon_{d,k_{t}}$ , the related dynamic regret bound has been established, in which the optimal bound $\mathcal{O}(\sqrt{T(1+H_{T})}+D_{T})$ can be achieved when the knowledge $H_{T}$ is known and $\xi>1$ . In addition, we have revealed the trade-off between the convergence performance and the quantization effect. Finally, a simulation example has been investigated to verify the theoretical results. A promising direction in the future is to investigate the nonconvex loss function case that will be more general yet more challenging.

-A Proof of lemma 1

According to Algorithm 1, we get

[TABLE]

According to Algorithm 1, the term ${\bm{x}}_{a,t}$ can be further simplified as follows:

[TABLE]

where the second equality combines the double stochasticity of weight matrix $W_{t-1}$ .

Combining (-A) and (-A), for $t\geq 2$ , we achieve

[TABLE]

where the second inequality follows the fact $\bm{v}_{j,l},\hat{\bm{x}}_{j,l}\in\bm{X}$ and the property [27] : $\left|[\Phi(t,s)]_{ij}-\frac{1}{n}\right|\leq\Gamma\sigma^{(t-s)}$ , $\forall i,j\in\mathcal{V}$ , $\sigma=(1-\zeta/4n^{2})^{1/Q},\Gamma=(1-\zeta/4n^{2})^{(1-2Q)/Q}$ .

Summing from $i=1$ to $n$ and $t=1$ to $T$ on both sides of (-A), we get

[TABLE]

where the second inequality follows the fact

[TABLE]

The proof is complete. $\square$

-B Proof of lemma 2

According to Assumption 4, we establish that

[TABLE]

For the first term on the right hand side of (-B), from Algorithm 1, for any $t\geq 2$ it can be verified that

[TABLE]

Note that $\overline{\nabla}f_{i,1}=\mathds{Q}_{1}[\nabla f_{i,1}(\hat{\bm{x}}_{i,1})]$ from Algorithm 1. Hence, the equality $\sum_{i=1}^{n}\overline{\nabla}f_{i,t}=\sum_{i=1}^{n}\mathds{Q}_{t}[\nabla f_{i,t}(\hat{\bm{x}}_{i,t})]$ holds when $t=1$ . Now we assume that $\sum_{i=1}^{n}\overline{\nabla}f_{i,t-1}=\sum_{i=1}^{n}\mathds{Q}_{t-1}[\nabla f_{i,t-1}(\hat{\bm{x}}_{i,t-1})]$ holds at time $t-1$ , and we intend to show the same conclusion at time $t$ . Actually,

[TABLE]

where the last equality follows from the double stochasticity of $W_{t}$ . With this condition, we obtain for any $t\geq 2$ that

[TABLE]

Similar to (-A), combining the fact $\widehat{\bm{s}}_{i,1}=\sum_{j=1}^{n}[W_{1}]_{ij}\mathds{Q}_{1}[{\nabla}f_{j,1}(\hat{\bm{x}}_{j,1})]$ , it follows from (-B) and (-B) that

[TABLE]

This implies that

[TABLE]

For the second term on the right hand side of (-B), by recalling the notion $\bm{\nabla}_{i,t}^{Q}$ defined in (12), it can be yielded that

[TABLE]

where the last inequality is obtained based on the fact:

[TABLE]

Substituting the above inequalities into (-B) and combining (-B) and fact $\mathbb{E}[\|\bm{\theta}_{i,t}\|]\leq\sqrt{\mathbb{E}[\|\bm{\theta}_{i,t}\|^{2}]}\leq\sqrt{\epsilon_{d,k_{t}}\|\nabla f_{i,t}(\hat{\bm{x}}_{i,t})\|^{2}}\leq L_{X}\sqrt{\epsilon_{d,k_{t}}}$ , we can readily obtain (14). The proof is complete. $\square$

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] X. Yi, X. Li, T. Yang, L. Xie, T. Chai, and K. H. Johansson, “Distributed bandit online convex optimization with time-varying coupled inequality constraints,” IEEE Transactions on Automatic Control , vol. 66, no. 10, pp. 4620–4635, 2021.
2[2] S. Shahrampour and A. Jadbabaie, “Distributed online optimization in dynamic environments using mirror descent,” IEEE Transactions on Automatic Control , vol. 63, no. 3, pp. 714–725, 2017.
3[3] D. Yuan, B. Zhang, D. W. Ho, W. X. Zheng, and S. Xu, “Distributed online bandit optimization under random quantization,” Automatica , vol. 146, p. 110590, 2022.
4[4] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “A survey of distributed optimization,” Annual Reviews in Control , vol. 47, pp. 278–305, 2019.
5[5] X. Li, L. Xie, and N. Li, “A survey of decentralized online learning,” ar Xiv preprint , vol. ar Xiv:2205.00473, 2022.
6[6] C. Liu, H. Li, Y. Shi, and D. Xu, “Distributed event-triggered gradient method for constrained convex minimization,” IEEE Transactions on Automatic Control , vol. 65, no. 2, pp. 778–785, 2020.
7[7] S. Sundhar Ram, A. Nedić, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithms for convex optimization,” Journal of Optimization Theory and Applications , vol. 147, no. 3, pp. 516–545, 2010.
8[8] F. Yan, S. Sundaram, S. Vishwanathan, and Y. Qi, “Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties,” IEEE Transactions on Knowledge and Data Engineering , vol. 25, no. 11, pp. 2483–2493, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Quantized Distributed Online Projection-free Convex Optimization

Abstract

I INTRODUCTION

II Problem Formulation

II-A The Optimization Problem

II-B Random Quantizer

Definition 1** ([3])**

Remark 1

Remark 2

II-C Some Assumptions

Assumption 1

Assumption 2

Assumption 3

Assumption 4

Remark 3

III Algorithm Design and Convergence Analysis

III-A Algorithm Q-DOPFO

III-B Main Convergence Results

Lemma 1

Lemma 2

Theorem 1

Corollary 1

Remark 4

Remark 5

IV Simulation

V CONCLUSIONS

-A Proof of lemma 1

-B Proof of lemma 2

Definition 1 ([3])