A Hybrid Stochastic Gradient Tracking Method for Distributed Online Optimization Over Time-Varying Directed Networks

Xinli Shi; Xingxing Yuan; Longkang Zhu; Guanghui Wen

arXiv:2508.20645·cs.LG·August 29, 2025

A Hybrid Stochastic Gradient Tracking Method for Distributed Online Optimization Over Time-Varying Directed Networks

Xinli Shi, Xingxing Yuan, Longkang Zhu, Guanghui Wen

PDF

Open Access

TL;DR

This paper introduces TV-HSGT, a novel distributed online optimization algorithm that effectively handles stochastic gradients over time-varying directed networks, improving convergence and regret bounds without requiring gradient boundedness.

Contribution

The paper presents a hybrid stochastic gradient tracking algorithm that operates over directed networks without Perron vector estimation, enhancing dynamic regret bounds in online optimization.

Findings

01

TV-HSGT outperforms existing methods in dynamic environments.

02

It reduces gradient variance through recursive stochastic gradient integration.

03

Experimental results validate its effectiveness in logistic regression tasks.

Abstract

With the increasing scale and dynamics of data, distributed online optimization has become essential for real-time decision-making in various applications. However, existing algorithms often rely on bounded gradient assumptions and overlook the impact of stochastic gradients, especially in time-varying directed networks. This study proposes a novel Time-Varying Hybrid Stochastic Gradient Tracking algorithm named TV-HSGT, based on hybrid stochastic gradient tracking and variance reduction mechanisms. Specifically, TV-HSGT integrates row-stochastic and column-stochastic communication schemes over time-varying digraphs, eliminating the need for Perron vector estimation or out-degree information. By combining current and recursive stochastic gradients, it effectively reduces gradient variance while accurately tracking global descent directions. Theoretical analysis demonstrates that TV-HSGT…

Tables1

Table 1. Table 1: Comparison with Distributed Online Optimization Algorithms

Works	Weight Matrix	TVN?	SG?	NBG?	Mo. Term?	Regret Type
Shahrampour2018	Undirected, DS	✗	✓	✗	✗	Dynamic
cao2023decentralized	Undirected, DS	✗	✗	✗	✗	Static
Zhang2022SMC	Directed, DS	✓	✗	✗	✗	Static
nazari2022dadam	Undirected, DS	✗	✓	✗	✓	Dynamic
Li2022TCMS	Directed, DS	✓	✓	✗	✗	Dynamic
carnevale2022gtadam	Undirected, DS	✗	✗	✓	✓	Dynamic
Sharma2024TSP	Undirected, DS	✗	✗	✓	✗	Dynamic
Li2024TAC	Directed, RS	✗	✓	✗	✗	Static
yao2025online	Directed, RCS	✗	✗	✗	✗	Dynamic
Ours	Directed, RCS	✓	✓	✓	✓	Dynamic

Equations363

x \in R^{d} min f_{t} (x) := \frac{1}{n} i = 1 \sum n f_{i, t} (x), t \geq 0,

x \in R^{d} min f_{t} (x) := \frac{1}{n} i = 1 \sum n f_{i, t} (x), t \geq 0,

R_{T}^{d} := E [t = 1 \sum T f_{t} (\overset{x}{^}_{t}) - t = 1 \sum T f_{t} (x_{t}^{*})],

R_{T}^{d} := E [t = 1 \sum T f_{t} (\overset{x}{^}_{t}) - t = 1 \sum T f_{t} (x_{t}^{*})],

q_{t} := i \in V sup x \in R^{d} sup ∥ \nabla f_{i, t + 1} (x) - \nabla f_{i, t} (x) ∥,

q_{t} := i \in V sup x \in R^{d} sup ∥ \nabla f_{i, t + 1} (x) - \nabla f_{i, t} (x) ∥,

p_{t} := x_{t + 1}^{*} - x_{t}^{*} .

⟨ \nabla f_{t} (x) - \nabla f_{t} (y), x - y ⟩ \geq μ ∥ x - y ∥^{2},

⟨ \nabla f_{t} (x) - \nabla f_{t} (y), x - y ⟩ \geq μ ∥ x - y ∥^{2},

E [∥\nabla \hat{f}_{i, t} (x; ξ_{i, t}) - \nabla \hat{f}_{i, t} (y; ξ_{i, t}) ∥^{2}] \leq L_{g}^{2} ∥ x - y ∥^{2} .

E [∥\nabla \hat{f}_{i, t} (x; ξ_{i, t}) - \nabla \hat{f}_{i, t} (y; ξ_{i, t}) ∥^{2}] \leq L_{g}^{2} ∥ x - y ∥^{2} .

E [\nabla \hat{f}_{i, t} (x, ξ_{i, t}) ∣ F_{t}] = \nabla f_{i, t} (x),

E [\nabla \hat{f}_{i, t} (x, ξ_{i, t}) ∣ F_{t}] = \nabla f_{i, t} (x),

E [\nabla \hat{f}_{i, t} (x, ξ_{i, t}) - \nabla f_{i, t} (x)^{2} ∣ F_{t}] \leq σ^{2},

E [\nabla \hat{f}_{i, t} (x, ξ_{i, t}) - \nabla f_{i, t} (x)^{2} ∣ F_{t}] \leq σ^{2},

∥\nabla f_{i, t} (x) - \nabla f_{i, t} (y) ∥ \leq L_{g} ∥ x - y ∥, \forall x, y \in R^{d} .

∥\nabla f_{i, t} (x) - \nabla f_{i, t} (y) ∥ \leq L_{g} ∥ x - y ∥, \forall x, y \in R^{d} .

z_{i, t + 1}

z_{i, t + 1}

+ \nabla \hat{f}_{i, t + 1} (x_{i, t + 1}, ξ_{i, t + 1}),

z_{i, t + 1}

z_{i, t + 1}

stochastic recursive gradient (z_{i, t} + \nabla \hat{f}_{i, t + 1} (x_{i, t + 1}, ξ_{i, t + 1}) - \nabla \hat{f}_{i, t + 1} (x_{i, t}, ξ_{i, t + 1})) .

x_{i, t + 1} = j = 1 \sum n [A_{t}]_{ij} (x_{j, t} - α y_{j, t}),

x_{i, t + 1} = j = 1 \sum n [A_{t}]_{ij} (x_{j, t} - α y_{j, t}),

y_{i, t + 1} = j = 1 \sum n [B_{t}]_{ij} (y_{j, t} + z_{j, t + 1} - z_{j, t}) .

y_{i, t + 1} = j = 1 \sum n [B_{t}]_{ij} (y_{j, t} + z_{j, t + 1} - z_{j, t}) .

x_{i, t + 1} = j = 1 \sum n [A_{t}]_{ij} (x_{j, t} - α y_{j, t})

x_{i, t + 1} = j = 1 \sum n [A_{t}]_{ij} (x_{j, t} - α y_{j, t})

z_{i, t + 1}

z_{i, t + 1}

+ \nabla \hat{f}_{i, t + 1} (x_{i, t + 1}, ξ_{i, t + 1})

y_{i, t + 1} = j = 1 \sum n [B_{t}]_{ij} (y_{j, t} + z_{j, t + 1} - z_{j, t})

y_{i, t + 1} = j = 1 \sum n [B_{t}]_{ij} (y_{j, t} + z_{j, t + 1} - z_{j, t})

[A_{t}]_{ij} > 0,

[A_{t}]_{ij} > 0,

[B_{t}]_{j i} > 0,

min^{+} (A_{t}) \geq a, \forall t \geq 0,

min^{+} (A_{t}) \geq a, \forall t \geq 0,

min^{+} (B_{t}) \geq b, \forall t \geq 0,

min^{+} (B_{t}) \geq b, \forall t \geq 0,

∥ x - α \nabla f (x) - x^{*} ∥ \leq (1 - μα) ∥ x - x^{*} ∥,

∥ x - α \nabla f (x) - x^{*} ∥ \leq (1 - μα) ∥ x - x^{*} ∥,

i = 1 \sum k m_{i}^{2} \leq k i = 1 \sum k ∥ m_{i} ∥^{2} .

i = 1 \sum k m_{i}^{2} \leq k i = 1 \sum k ∥ m_{i} ∥^{2} .

i = 1 \sum k m_{i}^{2} \leq ζ ∥ m_{1} ∥^{2} + \frac{( k - 1 ) ζ}{ζ - 1} i = 2 \sum k ∥ m_{i} ∥^{2} .

i = 1 \sum k m_{i}^{2} \leq ζ ∥ m_{1} ∥^{2} + \frac{( k - 1 ) ζ}{ζ - 1} i = 2 \sum k ∥ m_{i} ∥^{2} .

∥ h_{t} (x_{t}) - \nabla f_{t} (\overset{x}{^}_{t}) ∥ \leq \frac{L _{g}}{n} ∥ x_{t} - \hat{x}_{t} ∥,

∥ h_{t} (x_{t}) - \nabla f_{t} (\overset{x}{^}_{t}) ∥ \leq \frac{L _{g}}{n} ∥ x_{t} - \hat{x}_{t} ∥,

∥ i = 1 \sum n γ_{i} u_{i} - ν ∥^{2} = i = 1 \sum n γ_{i} ∥ u_{i} - ν ∥^{2} - i = 1 \sum n γ_{i} ∥ u_{i} - j = 1 \sum n γ_{j} u_{j} ∥^{2} .

∥ i = 1 \sum n γ_{i} u_{i} - ν ∥^{2} = i = 1 \sum n γ_{i} ∥ u_{i} - ν ∥^{2} - i = 1 \sum n γ_{i} ∥ u_{i} - j = 1 \sum n γ_{j} u_{j} ∥^{2} .

ϕ_{t + 1}^{⊤} A_{t} = ϕ_{t}^{⊤}, \forall t \geq 0.

ϕ_{t + 1}^{⊤} A_{t} = ϕ_{t}^{⊤}, \forall t \geq 0.

π_{t + 1} = B_{t} π_{t}, with initial value π_{0} = 1 / n .

π_{t + 1} = B_{t} π_{t}, with initial value π_{0} = 1 / n .

ϕ_{t + C}^{⊤} (A_{t + C - 1} \dots A_{t + 1} A_{t}) = ϕ_{t}^{⊤},

ϕ_{t + C}^{⊤} (A_{t + C - 1} \dots A_{t + 1} A_{t}) = ϕ_{t}^{⊤},

π_{t + C} = (B_{t + C - 1} \dots B_{t + 1} B_{t}) π_{t} .

i = 1 \sum n π_{i} j = 1 \sum n A_{ij} x_{j} - \overset{x}{^}_{ϕ}^{2} \leq c j = 1 \sum n ϕ_{j} ∥ x_{j} - \overset{x}{^}_{ϕ} ∥^{2},

i = 1 \sum n π_{i} j = 1 \sum n A_{ij} x_{j} - \overset{x}{^}_{ϕ}^{2} \leq c j = 1 \sum n ϕ_{j} ∥ x_{j} - \overset{x}{^}_{ϕ} ∥^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Wireless Network Optimization · Network Traffic and Congestion Control · Energy Efficient Wireless Sensor Networks

Full text

A Hybrid Stochastic Gradient Tracking Method for Distributed Online Optimization Over Time-Varying Directed Networks

Xinli Shi [email protected]

Xingxing Yuan [email protected]

Longkang Zhu [email protected]

Guanghui Wen [email protected]

Abstract

With the increasing scale and dynamics of data, distributed online optimization has become essential for real-time decision-making in various applications. However, existing algorithms often rely on bounded gradient assumptions and overlook the impact of stochastic gradients, especially in time-varying directed networks. This study proposes a novel Time-Varying Hybrid Stochastic Gradient Tracking algorithm named TV-HSGT, based on hybrid stochastic gradient tracking and variance reduction mechanisms. Specifically, TV-HSGT integrates row-stochastic and column-stochastic communication schemes over time-varying digraphs, eliminating the need for Perron vector estimation or out-degree information. By combining current and recursive stochastic gradients, it effectively reduces gradient variance while accurately tracking global descent directions. Theoretical analysis demonstrates that TV-HSGT can achieve improved bounds on dynamic regret without assuming gradient boundedness. Experimental results on logistic regression tasks confirm the effectiveness of TV-HSGT in dynamic and resource-constrained environments.

keywords:

distributed online optimization; hybrid stochastic gradient tracking; time-varying directed networks; dynamic regret

, , ,

1 Introduction

Distributed optimization has received significant attention and found applications in various fields such as control, signal processing, and machine learning shahrampour2015distributed ; nedic2017fast ; Shahrampour2017ACC . It aims to solve a large-scale optimization problem by decomposing it into smaller, more tractable subproblems that can be solved iteratively and in parallel by a network of interconnected agents through communication. Most traditional works on distributed optimization focus on static problems, making them unsuitable for dynamic tasks arising in real-world applications, such as networked autonomous vehicles, smart grids, and online machine learning, among others Dall2020 .

Online optimization, which addresses time-varying cost functions, plays a vital role in solving dynamic problems in timely application fields Zinkevich2003 ; Mairal2009 ; Li2024TAC ; Cao2021TAC . In many practical scenarios, such as machine learning with information streams shalev2012online , the objective functions of optimization problems change over time, making them inherently dynamic wei2023distributed ; Zinkevich2003 . Online learning has emerged as a powerful method for handling sequential decision-making tasks in dynamic contexts, enabling real-time operation while ensuring bounded performance loss in terms of regret hazan2016introduction . Regret is the gap between the cumulative objective value achieved by the online algorithm and that of the optimal offline solution li2020distributed ; Shahrampour2018 . In the literature, two types of regret are commonly considered, i.e., static and dynamic regret. The former evaluates the performance of an online algorithm relative to a fixed optimal decision $x^{*}$ , and is typically formulated as $\min_{t=1}^{T}(f_{t}(x_{t})-f_{t}(x^{*}))$ , where $x_{t}$ denotes the output of the online algorithm and $x^{*}$ is the optimal fixed decision in hindsight, i.e., $x^{*}\in\arg\min_{t=1}^{T}f_{t}(x)$ . In contrast, the dynamic regret is obtained by replacing the above static $x^{*}$ by a dynamic solution $x_{t}^{*}\in\arg\min f_{t}(x)$ . This makes dynamic regret more suitable for non-stationary environments, although it is generally more challenging to minimize due to the evolving nature of the optimal points. Both metrics are commonly used to assess the performance of online algorithms. Achieving a sublinear regret growth, i.e., one that grows slower than linearly with time, is often regarded as a key indicator of algorithmic efficiency yuan2017adaptive . Therefore, minimizing regret, particularly in terms of establishing sublinear regret bounds, is fundamental to the design and analysis of effective online optimization methods.

Distributed online optimization offers a flexible framework for handling dynamic settings, combining the benefits of decentralized computation with the ability to adapt to non-stationary environments. Earlier works hosseini2013online ; yan2012distributed investigate online distributed optimization in networks with doubly stochastic mixing matrices and achieve a static regret bound of $\mathcal{O}(\sqrt{T})$ . Shahrampour2018 further consider dynamic regret for both determined and stochastic online distributed optimization. carnevale2022gtadam propose GTAdam without the bounded gradient assumption, combining gradient tracking and adaptive momentum. However, these works assume static or undirected communication topologies, which are insufficient for modeling dynamic networked systems with directional and time-varying interactions. To address this, several algorithms have been developed under time-varying directed graphs with corresponding theoretical guarantees. For instance, Lee2018TCNS propose the ODA-PS algorithm by integrating dual averaging with the Push-Sum protocol over a directed time-varying network, achieving an $\mathcal{O}(\sqrt{T})$ static regret. Li2021TAC further extend the Push-Sum framework to handle inequality-constrained optimization over unbalanced networks, establishing sublinear dynamic regret and constraint violation. Xiong2024TNSE address feedback delays and propose an event-triggered online mirror descent method with regret guarantees. In addition, stochastic gradient methods have been explored to reduce computational costs. Lee2017TAC analyze stochastic dual averaging under gradient noise, while Li2022TCMS introduce a gradient tracking scheme with aggregation variables, achieving regret bounds under both exact and noisy gradients.

Nevertheless, many of the above methods rely on the assumption of uniformly bounded gradients and neglect the high variance commonly encountered in practice. Moreover, few of them nazari2022dadam ; Lee2017TAC ; Li2022TCMS ; Li2024TAC incorporate variance reduction techniques, limiting both accuracy and stability in stochastic settings. To overcome these limitations, recent studies have focused on gradient tracking-based approaches, which aim to approximate global descent directions by dynamically aggregating local gradient information. Zhang2019CDC establish dynamic regret bounds for a basic tracking scheme, while carnevale2022gtadam propose a momentum-enhanced variant inspired by adaptive methods. Sharma2024TSP develop a generalized framework for strongly convex objectives without requiring gradient boundedness, further advancing the applicability of gradient tracking in decentralized online settings.

This work addresses the distributed online stochastic optimization over time-varying directed networks under limited computational resources, where agents interact over asymmetric communication links modeled by time-varying row- and column-stochastic mixing matrices. To overcome the challenges introduced by stochastic gradient noise and dynamic topologies, we design a novel online algorithm that incorporates hybrid variance reduction, gradient tracking, and an AB communication scheme Saadatniaki2020TAC ; Pu2021TAC ; Nguyen2023 . Table 1 summarizes the comparison of our methods with several existing online optimization algorithms in terms of communication schemes, gradient assumptions, and types of regret. The main contributions are summarized as follows:

We propose a Time-Varying Hybrid Stochastic Gradient Tracking method, named by TV-HSGT, for distributed online optimization over dynamic directed networks. It integrates a hybrid variance reduction strategy by combining current and recursive stochastic gradients. This method effectively reduces the variance introduced by stochastic gradients and accelerates convergence, as demonstrated in our experimental results. 2. 2.

To address the limited information access inherent in decentralized systems, the algorithm incorporates a gradient tracking mechanism to approximate the global gradient direction over time-varying directed networks. In addition, an AB communication scheme is employed, utilizing both row-stochastic and column-stochastic weight matrices. This design eliminates the need to estimate the Perron vector, as required in traditional Push-Sum methods, improving practical applicability in directed network settings. 3. 3.

The algorithm is implemented within an adapt-then-combine (ATC) framework, which allows for relaxed step-size conditions compared with the combine-then-adapt (CTA) framework li2024npga . We adopt a dynamic regret metric to evaluate performance and introduce a weighted averaging variable to characterize the deviation between local decisions and the global optimal trajectory. Theoretical analysis establishes upper bounds on dynamic regret, and numerical simulations validate the algorithm’s effectiveness in reducing stochastic gradient variance under dynamic and asymmetric communication topologies.

The remainder of this paper is organized as follows. Section II formulates the problem and introduces necessary notations. Section III provides the proposed TV-HSGT algorithm, and Section IV analyzes its dynamic regret. Section V presents numerical studies. Finally, we conclude the paper and discuss future directions in Section VI.

2 PROBLEM FORMULATION

Consider a networked system composed of $n$ agents, denoted by the set $\mathcal{V}=\{1,2,\dots,n\}$ . The agents communicate through a sequence of time-varying directed graphs $\{\mathcal{G}_{t}=(\mathcal{V},\mathcal{E}_{t})\}_{t\geq 0}$ , where $\mathcal{E}_{t}\subseteq\mathcal{V}\times\mathcal{V}$ represents the set of available communication links at time $t$ . If $(j,i)\in\mathcal{E}_{t}$ , agent $i$ can receive information from agent $j$ at time $t$ . This work aims to solve the following distributed online optimization problem:

[TABLE]

where $x\in\mathbb{R}^{d}$ is the decision variable, and $f_{i,t}(x):\mathbb{R}^{d}\rightarrow\mathbb{R}$ denotes the local loss function of agent $i$ at time $t$ , defined as the expected loss over a local random variable $\xi_{i,t}$ , i.e., $f_{i,t}(x):=\mathbb{E}_{\xi_{i,t}\sim\mathcal{D}_{i,t}}\left[\hat{f}_{i,t}(x;\xi_{i,t})\right],$ where $\xi_{i,t}$ is a random variable following the distribution $\mathcal{D}_{i,t}$ at time $t$ , and $\hat{f}_{i,t}(x;\xi_{i,t})$ denotes the loss function under the sampled random variable $\xi_{i,t}$ . In practical computation, due to limited computational resources, each agent constructs an unbiased stochastic gradient estimator $\nabla\hat{f}_{i,t}(x_{i,t};\xi_{i,t}),$ based on the current sample $\xi_{i,t}$ , and uses it to update its decision variable. The aim of this study is to design a distributed online optimization algorithm tailored to time-varying directed network topologies, where each agent relies solely on limited computational resources and cooperates with neighbors to effectively minimize $f_{t}(x)$ .

Definition 1 (Dynamic Regret).

For a sequence of local decisions $\{x_{i,t}\}$ generated by a given online distributed algorithm, the dynamic regret over $T$ time steps is defined as

[TABLE]

where $\hat{x}_{t}:=\sum_{i=1}^{n}[\phi_{t}]_{i}x_{i,t}$ denotes a weighted average of all agents’ decisions at time $t$ , and $\{x_{t}^{*}\}_{t\geq 1}$ denotes the sequence of minimizers of the global objective functions $f_{t}(x)$ .

To evaluate the algorithm’s performance in a time-varying environment, this work adopts dynamic regret as the performance metric, defined formally in Definition 1. Dynamic regret quantifies the discrepancy between the cumulative loss of an online algorithm and that of a time-dependent sequence of optimal solutions. Various forms of dynamic regret have been proposed in the literature. In particular, the GTAdam framework carnevale2022gtadam considers the version $R_{T}^{d}:=\mathbb{E}\left[\sum_{t=1}^{T}f_{t}(\bar{x}_{t})-\sum_{t=1}^{T}f_{t}(x_{t}^{*})\right],$ where $\bar{x}_{t}:=\frac{1}{n}\sum_{i=1}^{n}x_{i,t}$ is the simple average of agents’ decisions. However, GTAdam assumes undirected networks with doubly stochastic weight matrices. In contrast, this work addresses time-varying directed networks, where the weight matrices are not necessarily symmetric or doubly stochastic. Hence, we adopt a weighted average $\hat{x}_{t}:=\sum_{i=1}^{n}[\phi_{t}]_{i}x_{i,t}$ , as specified in Definition 1, where $\phi_{t}\in\mathbb{R}^{n}$ is a stochastic vector used to accommodate such network structures. Compared with static regret, dynamic regret effectively captures the algorithm’s asymptotic behavior relative to the evolving optimal decisions $\{x_{t}^{*}\}_{t=1}^{T}$ .

The time-variability and non-stationarity of the problem are characterized by two regularity measures that reflect changes in the objective functions and the evolving optimal solutions. Specifically, $q_{t}$ characterizes the maximum discrepancy between the gradients of local objective functions across agents at two consecutive time steps, while $p_{t}$ quantifies the variation between successive optimal solutions. These measures are defined as follows

[TABLE]

We impose the following standard assumptions on the loss functions.

Assumption 1.

The global objective function $f_{t}(x)$ is $\mu$ -strongly convex, i.e., for any $x,y\in\mathbb{R}^{d}$ , it holds that

[TABLE]

where $\mu>0$ is the strong convexity parameter.

Assumption 2.

For any agent $i\in\mathcal{V}$ , the stochastic gradient estimator is $L_{g}$ -Lipschitz continuous in the mean square sense. That is, for some constant $L_{g}>0$ and any $x,y\in\mathbb{R}^{d}$ , the following inequality holds

[TABLE]

Let $\mathcal{F}_{t}$ denote the $\sigma$ -algebra generated by $\{\xi_{i,0},\xi_{i,1},\ldots,\xi_{i,t-1}\}$ . The following assumption is widely adopted in distributed stochastic optimization and federated learning 9226112 ; pmlr-v139-xin21a ; 10715643 ; 9713700 .

Assumption 3.

For any agent $i\in\mathcal{V}$ , its stochastic gradient is unbiased and has bounded variance, i.e.,

[TABLE]

where $\sigma^{2}\geq 0$ is a finite constant.

Under Assumptions 2 and 3, one can derive that $f_{i,t}(x)$ is $L_{g}$ -smooth, i.e.,

[TABLE]

Assumptions 2 and 3 are standard in establishing the convergence of distributed stochastic optimization algorithms pmlr-v139-xin21a ; Huang2024 ; liu2020optimal ; Dinh2022 .

3 PROPOSED ALGORITHMS

In this section, based on an improved stochastic gradient tracking scheme, a novel distributed online optimization algorithm called TV-HSGT is provided to efficiently solve the problem (1) over a time-varying directed network.

We define $\nabla\hat{f}_{i,t+1}(x_{i,t+1},\xi_{i,t+1})$ and $\nabla\hat{f}_{i,t+1}(x_{i,t},\xi_{i,t+1})$ as the stochastic gradients evaluated at $x_{i,t+1}$ and $x_{i,t}$ , respectively, based on the random sample $\xi_{i,t+1}$ . To reduce the variance inherent in stochastic gradient estimation, we adopt a hybrid variance-reduction approach introduced for stochastic optimization problems liu2020optimal ; Dinh2022 ; pmlr-v139-xin21a . Let $z_{i,t}$ denote the hybrid stochastic gradient variable, which is updated as follows

[TABLE]

where $\beta\in[0,1]$ is the mixing parameter. This update rule is equivalent to

[TABLE]

When $\beta=1$ , the method reduces to the standard stochastic gradient, while for $\beta=0$ , it is equivalent to the stochastic recursive gradient method 10.55553305890.3305951 . Compared to classical variance-reduction methods such as SVRG Defazio2014 and SAGA NIPS2013_ac1dd209 , this hybrid strategy offers improved convergence speed and stabilitypmlr-v139-xin21a .

While variance reduction enhances gradient estimation stability, each agent in a distributed setting typically only accesses local information, which may not reflect the global objective direction accurately. To address this, the proposed algorithm incorporates a gradient tracking mechanism for estimating the global gradient direction. In contrast to the commonly used CTA framework 9226112 , our algorithm employs the ATC framework, which outperforms the CTA framework with larger step-sizes cattivelli2009diffusion ; li2024npga . Each agent $i\in\mathcal{V}$ maintains the variables including the decision variable $x_{i,t}\in\mathbb{R}^{d}$ , the hybrid stochastic gradient variable $z_{i,t}\in\mathbb{R}^{d}$ , and the gradient tracking variable $y_{i,t}\in\mathbb{R}^{d}$ . In each iteration, all agents execute the following procedures in parallel.

Each agent $i$ sends $x_{i,t}-\alpha y_{i,t}$ to its out-neighbors $j\in\mathcal{N}^{\text{out}}_{i,t}$ and receives corresponding vectors from its in-neighbors $j\in\mathcal{N}^{\text{in}}_{i,t}$ , then updates its decision variable as

[TABLE]

where $\alpha>0$ is the step size, $\mathcal{N}_{i,t}^{\mathrm{in}}$ and $\mathcal{N}_{i,t}^{\mathrm{out}}$ denote the in-neighbor and out-neighbor sets of agent $i$ at time $t$ , respectively.

Next, the agent computes the hybrid stochastic gradient $z_{i,t+1}$ using (3). It then forms the gradient tracking increment $y_{i,t}+z_{i,t+1}-z_{i,t}$ , transmits $[B_{t}]_{ji}(y_{i,t}+z_{i,t+1}-z_{i,t})$ to each out-neighbor, and updates its gradient tracking variable by

[TABLE]

The detailed execution steps are presented in Algorithm 1.

The iterative updates rely on two non-negative weight matrices $A_{t}$ and $B_{t}$ , consistent with the structure of the directed graph $\mathcal{G}_{t}$ . These matrices satisfy

[TABLE]

The following introduces the assumptions related to the time-varying communication networks.

Assumption 4.

For any $t\geq 0$ , the directed graph $\mathcal{G}_{t}$ is strongly connected, and each node $i\in\mathcal{V}$ has a self-loop, i.e., the edge $(i,i)$ exists.

Assumption 4 can be relaxed to the setting of a periodically strongly connected graph sequence. Specifically, if there exists a positive integer $C\geq 1$ such that for any $t\geq 0$ , the union of edge sets $\mathcal{E}_{t}^{C}:=\bigcup_{i=tC}^{(t+1)C-1}\mathcal{E}_{i}$ forms a strongly connected graph over $C$ consecutive iterations, then the sequence is said to be $C$ -strongly connected.

Each agent $i$ independently determines the values of $[A_{t}]_{ij}$ for its in-neighbors $j\in\mathcal{N}_{i,t}^{\mathrm{in}}$ , while the corresponding values of $[B_{t}]_{ij}$ are determined by its out-neighbors. We further impose the following assumptions on the matrices $A_{t}$ and $B_{t}$ .

Assumption 5.

For any $t\geq 0$ , $A_{t}$ is row-stochastic associated with $\mathcal{G}_{t}$ , i.e., $A_{t}\mathbf{1}=\mathbf{1}$ , and for some constant $a>0$ , it satisfies

[TABLE]

where $\min\nolimits^{+}(A_{t})$ denotes the smallest positive entry in $A_{t}$ .

Assumption 6.

For any $t\geq 0$ , $B_{t}$ is column-stochastic associated with $\mathcal{G}_{t}$ , i.e., $\mathbf{1}^{\top}B_{t}=\mathbf{1}^{\top}$ , and for some constant $b>0$ , it satisfies

[TABLE]

where $\min\nolimits^{+}(B_{t})$ denotes the smallest positive entry in $B_{t}$ .

4 CONVERGENCE ANALYSIS

This section presents a theoretical convergence analysis of the proposed TV-HSGT algorithm. We first provide several necessary preliminary lemmas in Subsection 4.1, and then give the main theoretical results in Subsection 4.2.

4.1 Preliminary Lemmas

Prior to conducting the convergence analysis, this subsection introduces several auxiliary lemmas that lay the theoretical foundation for the subsequent main results.

Lemma 1.

qu2017harnessing *

Suppose that $f(x)$ is $\mu$ -strongly convex and $L_{g}$ -smooth. Then, for any $x\in\mathbb{R}^{d}$ , if the step size satisfies $0<\alpha<\frac{2}{\mu+L_{g}}$ , the following inequality holds*

[TABLE]

where $x^{*}$ denotes the optimal solution to $f(x)$ .

Lemma 2.

liao2022compressed *

For any integer $k\geq 1$ and any set of vectors $\mathbf{m}_{i}\in\mathbb{R}^{n\times d}$ , it holds that*

[TABLE]

Moreover, for any constant $\zeta>1$ , we have

[TABLE]

Lemma 3.

nguyen2023distributed *

Suppose that $f_{i,t}$ is $L_{g}$ -smooth. Then, the following inequality holds*

[TABLE]

where $h_{t}(\mathbf{x}_{t}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i,t}(x_{i,t})$ , $\hat{x}_{t}:=\sum_{i=1}^{n}[\phi_{t}]_{i}x_{i,t}$ , $\mathbf{x}_{t}=[x_{1,t},\ x_{2,t},\ \dots,\ x_{n,t}]^{\top}\in\mathbb{R}^{n\times d}$ , $\mathbf{x}_{t}=\mathbf{1}_{n}\otimes\hat{x}_{t}^{\top}$ and $\phi_{t}$ is a stochastic vector.

Lemma 4.

nguyen2022distributed *

Give a set of vectors $\{u_{i}\}_{i\in\mathcal{V}}\subset\mathbb{R}^{d}$ and nonnegative weights $\{\gamma_{i}\}_{i\in\mathcal{V}}\subset\mathbb{R}$ satisfying $\sum_{i=1}^{n}\gamma_{i}=1$ . Then, for any $\nu\in\mathbb{R}^{d}$ , the following identity holds*

[TABLE]

Lemma 5.

nguyen2022distributed *

Under Assumptions 4 and 5, there exists a corresponding sequence of stochastic vectors $\{\phi_{t}\}$ such that*

[TABLE]

Moreover, for all $i\in\mathcal{V}$ and $t\geq 0$ , it holds that $[\phi_{t}]_{i}\geq\frac{a^{n}}{n}.$

Lemma 6.

nedic2023ab *

Let Assumptions 4 and 6 hold. Define the vector sequence $\pi_{t}$ by*

[TABLE]

Then, for any $t\geq 0$ , $\pi_{t}$ is a stochastic vector satisfying $[\pi_{t}]_{i}\geq\frac{b^{n}}{n},\forall i\in\mathcal{V}.$

If the graph sequence $\{\mathcal{G}_{t}\}$ satisfies the strong connectivity condition over a period of length $C>1$ , then the results of Lemmas 5 and 6 can be extended. Specifically, for all $t\geq 0$ , there exist stochastic vector sequences $\{\phi_{t}\}$ and $\{\pi_{t}\}$ such that the following equalities hold nguyen2022distributed ; nedic2023ab ; 10337617

[TABLE]

Moreover, for all $i\in\mathcal{V}$ , these vector sequences satisfy the following lower bounds $[\phi_{t}]_{i}\geq\frac{a^{nC}}{n},\quad[\pi_{t}]_{i}\geq\frac{b^{nC}}{n}.$

Let $\mathcal{G}=(\mathcal{V},\mathcal{E})$ be a strongly connected directed graph, and let the weight matrices $A$ and $B$ be consistent with the structure of $\mathcal{G}$ . Denote by $\mathrm{D}(\mathcal{G})$ the diameter of the graph and by $\mathrm{K}(\mathcal{G})$ its maximal edge utility nedic2023ab . The following lemmas describe the contraction properties satisfied by the matrices $A$ and $B$ .

Lemma 7.

nguyen2022distributed *

Let $A$ be a row-stochastic matrix, $\phi$ be a stochastic vector, and $\pi$ be a nonnegative vector such that $\pi^{\top}A=\phi^{\top}$ . For a set of vectors $\{x_{i}\in\mathbb{R}^{d}\}_{i=1}^{n}$ , define $\hat{x}_{\phi}=\sum_{i=1}^{n}\phi_{i}x_{i}$ . Then, it holds that*

[TABLE]

where the scalar $c\in(0,1)$ is defined by

[TABLE]

Lemma 8.

nedic2023ab *

Let $B$ be a column-stochastic matrix, and let $\nu$ be a stochastic vector with strictly positive elements, i.e., $\nu_{i}>0$ for all $i\in\mathcal{V}$ . Let $\pi=B\nu$ . Then, for any set of vectors $\{y_{i}\in\mathbb{R}^{d}\}_{i=1}^{n}$ , it holds that*

[TABLE]

where the scalar $\tau\in(0,1)$ is given by

[TABLE]

4.2 Main Results

This subsection establishes the key theoretical results on the convergence of the proposed algorithm. To simplify the mathematical exposition, we uniformly use the notation $\mathbb{E}[\cdot]$ to denote the expectation operator throughout the subsequent proofs and derivations. Unless otherwise specified, all expectations are interpreted as conditional expectations with respect to the filtration $\mathcal{F}_{t}$ , that is, we adopt the convention $\mathbb{E}[\cdot]:=\mathbb{E}[\cdot\mid\mathcal{F}_{t}]$ . The analysis focuses on bounding four critical error terms in terms of conditional expectations, which are the optimality error $\mathbb{E}[\|\hat{x}_{t}-x^{*}_{t}\|^{2}]$ , the consensus error $\mathbb{E}[\|\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|_{\phi_{t}}^{2}]$ , the gradient tracking error $\mathbb{E}[S^{2}(\mathbf{y}_{t},\pi_{t})]$ , and the hybrid stochastic gradient estimation error $\mathbb{E}\left[\|\mathbf{z}_{t+1}-\nabla F_{t+1}(\mathbf{x}_{t+1})\|^{2}\right]$ . Here, the consensus error is measured by the weighted norm $\|\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|_{\phi_{t}}$ , and the gradient tracking deviation is quantified by $S(\mathbf{y}_{t},\pi_{t})$ , which are defined as follows

[TABLE]

where $\hat{x}_{t}:=\sum_{i=1}^{n}[\phi_{t}]_{i}x_{i,t}$ represents the weighted average of local decision variables. The stochastic weight sequences $\{\phi_{t}\}$ and $\{\pi_{t}\}$ are defined by equations (17) and (18), respectively. Moreover, $x^{*}_{t}$ denotes the optimal solution to problem (1) at time $t$ . In the later analysis, we denote $\mathbf{x}_{t}=[x_{1,t},\ x_{2,t},\ \dots,\ x_{n,t}]^{\top}\in\mathbb{R}^{n\times d}$ (same to $\mathbf{y}_{t}$ and $\mathbf{z}_{t}$ ), $\mathbf{\hat{x}}_{t}=\mathbf{1}_{n}\otimes\hat{x}_{t}^{\top}$ , $\mathbf{x}_{t}^{*}=\mathbf{1}_{n}\otimes(x_{t}^{*})^{\top}$ , $\nabla F_{t}(\mathbf{x}_{t})=[\nabla f_{1,t}(x_{i,t}),\ \nabla f_{2,t}(x_{2,t}),\ \dots,\ \nabla f_{n,t}(x_{n,t})]^{\top}$ , and $h_{t}(\mathbf{x}_{t}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i,t}(x_{i,t})$ .

To facilitate the convergence analysis of the proposed algorithm under time-varying directed topologies, we introduce a set of auxiliary parameters: $\kappa_{t}\geq 1$ , $\varphi_{t}\geq 1$ , $\gamma_{t}\in(0,1]$ , $\psi_{t}>0$ , $\tau_{t}\in(0,1)$ , $c_{t}\in(0,1)$ , $\nu_{t}>0$ , and $\zeta_{t}>0$ . These quantities are defined as follows

[TABLE]

where $c\in(0,1)$ and $\tau\in(0,1)$ are constant upper bounds for the time-varying quantities $c_{t}$ and $\tau_{t}$ , respectively. Additionally, let $\eta$ denote a uniform lower bound of the inner product $\phi_{t}^{\top}\pi_{t}$ . Since $\phi_{t}$ and $\pi_{t}$ are stochastic vectors, it follows that $\phi_{t}^{\top}\pi_{t}\leq 1$ , and hence $\eta\leq 1$ . For notational conciseness and in order to establish uniform bounds on the algorithm’s performance, we also introduce constant upper bounds $\psi>0$ , $\kappa>1$ , and $\varphi>1$ for $\psi_{t}$ , $\kappa_{t}$ , and $\varphi_{t}$ , respectively. The bounding conditions are then given by

[TABLE]

In the following, we present Lemmas 9 to 16, which establish bounds on several key terms used in the subsequent convergence analysis. Detailed proofs can be found in the appendix.

Lemma 9.

*Under Assumptions 2 and 6, the following inequality holds for all $t\geq 0$ *

[TABLE]

Lemma 10.

*Under Assumptions 4 and 6, the following inequality holds for all $t\geq 0$ *

[TABLE]

Lemma 11.

Under Assumptions 1, 2, 3, and 4, if $0<\alpha<\frac{2}{n(\mu+L_{g})\phi_{t}^{\top}\pi_{t}}$ , it holds that for all $t\geq 0$

[TABLE]

Lemma 12.

Under Assumptions 2, 3, and 4, the following inequality holds for all $t\geq 0$

[TABLE]

Lemma 13.

*Under Assumptions 4 and 5, the following inequality holds for all $t\geq 0$ *

[TABLE]

where $\varphi_{t}=\sqrt{\frac{1}{\min(\phi_{t})}}$ , and $\gamma_{t}=\sqrt{\max_{i\in\mathcal{V}}\left([\phi_{t}]_{i}[\pi_{t}]_{i}\right)}$ .

Lemma 14.

*Under Assumptions 2 and 3, the following inequality holds for all $t\geq 0$ *

[TABLE]

Lemma 15.

Under Assumptions 2, 3, and 4, it holds that for all $t\geq 0$

[TABLE]

Lemma 16.

*Under Assumptions 2 and 3, it holds that for all $t\geq 0$ and $\zeta_{0}>0$ *

[TABLE]

where $q_{t}$ is defined in (2).

To facilitate the analysis, we establish a coupled relationship among the expectations of the following four error terms by defining the vector $V_{t}$ as

[TABLE]

Based on the results of the previously established lemmas, the following linear inequality system can be established.

Proposition 1.

Let the collections of sequences $\{\{x_{i,t}\}_{i=1}^{n}\}_{t=1}^{T}$ , $\{\{z_{i,t}\}_{i=1}^{n}\}_{t=1}^{T}$ , and $\{\{y_{i,t}\}_{i=1}^{n}\}_{t=1}^{T}$ be generated by Algorithm 1. Under Assumptions 1–6, the following linear inequality system holds

[TABLE]

where $b_{1,t}$ and $b_{2}$ are vectors given by

[TABLE]

The coefficient parameters are defined as $k_{1}=\frac{6n\beta^{2}\tau^{2}\psi}{1-\tau}$ , $k_{2}=\frac{4}{\mu\alpha n\eta}$ , and $k_{3}=(8+\zeta_{0}^{-1})n(1-\beta)^{2}$ with $\zeta_{0}\in(0,\frac{1}{(1-\beta)^{2}}-1)$ .

Proof. By applying Lemma 14 to (15), we get the following inequality

[TABLE]

By substituting the result of Lemma 13, which bounds $\mathbb{E}[\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|^{2}]$ , into (28) gives

[TABLE]

Then, combined with Lemmas 11 and 12, it follows that under the step size condition $0<\alpha<\frac{2}{n(\mu+\eta)L_{g}}$ , the vector $V_{t}$ satisfies the following dynamical system

[TABLE]

where $M_{t}(\alpha)$ can be expressed as

[TABLE]

and $B_{t}^{{}^{\prime}}=b^{\prime}_{1,t}+b^{\prime}_{2,t}$ with

[TABLE]

By introducing the parameter definitions in (4.2), the entries in $M_{t}(\alpha)$ are defined as follows

[TABLE]

By substituting the upper and lower bounds of parameters defined in (4.2), the upper bound of $M_{t}(\alpha)$ can be given by

[TABLE]

satisfying $M_{t}(\alpha)\leq M(\alpha)$ , where the time-varying coefficients can be upper bounded by the following constants

[TABLE]

Here $\zeta=\frac{24L_{g}^{2}\varphi^{2}\tau^{2}\psi}{1-\tau}$ , $\nu=\frac{6L_{g}^{2}(c\varphi+1)^{2}\tau^{2}\psi}{1-\tau}$ . Consequently, $B_{t}^{{}^{\prime}}$ can be bounded by $B^{\prime}=b_{1,t}+b_{2}$ defined in (31) and (32) Thus, the proof is completed. $\Box$

To obtain the main theoretical result, we establish a regret bound for the proposed TV-HSGT algorithm under time-varying directed networks. The result demonstrates that the algorithm effectively reduces the variance caused by stochastic gradients.

Theorem 1.

Let the collections of sequences $\{\{x_{i,t}\}_{i=1}^{n}\}_{t=1}^{T}$ , $\{\{z_{i,t}\}_{i=1}^{n}\}_{t=1}^{T}$ , and $\{\{y_{i,t}\}_{i=1}^{n}\}_{t=1}^{T}$ be generated by Algorithm 1. Let Assumptions 1–6 hold and the step size $\alpha$ satisfy the condition (46). Then, there exists a constant $\tilde{\rho}\in(0,1)$ such that the dynamic regret satisfies

[TABLE]

where $b_{1,t}$ is defined in (31) and $b^{\prime}_{2}=\begin{bmatrix}0,\frac{6n\tau^{2}\psi}{1-\tau},0,n\end{bmatrix}^{\top}$ .

Proof 4.1.

Recall the linear inequality system (30), given by $V_{t+1}\leq M(\alpha)V_{t}+b_{1,t}+b_{2}$ for all $t\geq 0$ . The goal is to determine a feasible range for the step size $\alpha$ such that the spectral radius $\rho(\alpha)$ of $M(\alpha)$ satisfies $\rho(\alpha)<1$ . It is sufficient to find a positive vector $\bm{\delta}=[\delta_{1},\delta_{2},\delta_{3},\delta_{4}]^{\top}$ and a range for $\alpha>0$ such that $M(\alpha)\bm{\delta}<\bm{\delta}$ horn2012matrix . Expanding and rearranging this inequality element-wisely, we obtain

[TABLE]

To ensure these inequalities hold for some $\alpha>0$ , the right-hand sides must be positive, which gives a set of constraints on the components of the vector $\bm{\delta}$ , i.e.,

[TABLE]

We now construct a feasible positive vector $\bm{\delta}$ that satisfies the conditions (43), (44), and (45). Let us fix $\delta_{1}=1$ . Based on (44), we can set $\delta_{4}=\frac{2m_{14}}{1-m_{0}}$ . Plugging this into (45), we select $\delta_{2}$ to satisfy

[TABLE]

Finally, based on (43), we set $\delta_{3}$ as

[TABLE]

With this choice, $\bm{\delta}=[\delta_{1},\delta_{2},\delta_{3},\delta_{4}]^{\top}$ is a positive vector satisfying the necessary constraints. Now, substituting these values back into inequalities (39), (40), and (42) to derive upper bounds on $\alpha$ yields

[TABLE]

To summarize, with the constructed positive vector $\mathbf{\delta}$ and the defined constants (38), together with Lemma 11, a sufficient condition on the step size $\alpha$ that guarantees $\rho(M(\alpha))<1$ is given by

[TABLE]

Recalling that the local function $f_{i,t}$ is $L_{g}$ -smooth and by the definition $f_{t}(x):=\frac{1}{n}\sum_{i=1}^{n}f_{i,t}(x)$ , it implies the global function $f_{t}(x)$ is also $L_{g}$ -smooth, which satisfies

[TABLE]

Let $y=\hat{x}_{t}$ and $x=x_{t}^{*}$ . Since $x_{t}^{*}$ is the minimizer of $f_{t}(x)$ , the first-order optimality condition under Assumption 1 implies $\nabla f_{t}(x_{t}^{*})=0$ . Substituting these into (47) yields

[TABLE]

which simplifies to

[TABLE]

Taking the expectation and summing over $t$ from 1 to $T$ , we get

[TABLE]

In any finite-dimensional vector space, all norms are equivalent, so there exist constants $\lambda_{1}$ and $\lambda_{2}$ satisfying

[TABLE]

Substituting (49) into (48) gives $R_{T}^{d}\leq\frac{L_{g}\lambda_{1}}{2}\|V_{t}\|_{\gamma}$ . According to matrix analysis theory horn2012matrix , for any $\gamma>0$ , a matrix norm $\|\cdot\|_{\gamma}$ exists such that

[TABLE]

Letting $\gamma\in(0,1-\rho(M(\alpha)))$ and defining $\tilde{\rho}=\rho(M(\alpha))+\gamma$ , we have $\|M(\alpha)\|_{\gamma}\leq\tilde{\rho}<1$ . Matrix norm submultiplicativity further implies $\|Nv\|_{\gamma}\leq\|N\|_{\gamma}\|v\|_{\gamma}$ for any matrix $N$ and vector $v$ . Applying this to the recursion (30), we obtain

[TABLE]

and applying (49) again yields

[TABLE]

As the geometric sum satisfies $\sum_{k=0}^{t-1}\tilde{\rho}^{k}\leq\frac{1}{1-\tilde{\rho}}$ , then we get

[TABLE]

which further simplifies to

[TABLE]

This completes the proof with $b_{2}=\beta^{2}\sigma^{2}b_{2}^{\prime}$ .

Remark 4.2.

Existing studies have shown that, in general settings, the dynamic regret bound cannot achieve sublinear convergence in time $T$ li2022survey ; eshraghi2022improving ; Shahrampour2018 ; Notarnicola2023TAC ; Li2021TAC ; Dall2020 ; Mokhtari2016CDC , which may explicitly depend on $P_{T}=\sum_{t=1}^{T-1}p_{t}$ , the path length related to the changes in the sequence of minimizers. Moreover, some works depend on strong assumptions about objective functions. For example, eshraghi2022improving establishes a bound of the form $\mathcal{O}(1+P_{T}),$ under the assumptions of strongly convex loss functions and bounded gradients. Shahrampour2018 gives a dynamic regret bound by $\mathcal{O}(\sqrt{(1+C_{T})T})$ with $C_{T}=\sum_{t=1}^{T}\|x_{t+1}^{*}-Ax_{t}^{*}\|$ , requiring that the local time-varying functions have uniformly bounded gradients and the graph is undirected and connected.

In contrast, Theorem 1 derives an upper bound on dynamic regret without the bounded gradient assumption under a stochastic setting and general time-varying digraphs. Due to the temporal variability of the gradients, the resulting bound incorporates additional error terms. Specifically, Theorem 1 shows that the dynamic regret $R^{d}_{T}$ consists of three components: a term dependent on initial conditions, a noise variance term induced by stochastic gradients, and an error that captures the time-varying nature of the problem, namely $p_{t}$ and $q_{t}$ . In particular, the parameter $\beta$ can be properly tuned to reduce variance introduced by stochastic gradients. Moreover, if the temporal variations of both the optimal solution and the objective function’s gradient decay sublinearly, and both the step size and the mixing parameter decrease over time, then the resulting dynamic regret can achieve sublinear convergence.

Specifically, for the static distributed optimization with time-invariant functions ( $f_{t}=f$ ), we can obtain a gradient-tracking based algorithm with variance reduction, as shown in the following corollary.

Corollary 4.3.

For the static case with $f_{t}=f,t\geq 0$ , when Assumptions 1, 2, 4, 5, 6 hold and $\alpha$ satisfies (46) with $m_{0}=(1-\beta)^{2}$ , it satisfies

[TABLE]

with a linear decay rate of $\rho^{t}(M(\alpha))$ , where $[u]_{i}$ denotes the $i$ th entry of $u$ and $b=[0,\frac{2n\tau^{2}\psi}{1-\tau}\beta^{2}\sigma^{2},0,2n\beta^{2}\sigma^{2}]^{\top}$ .

Remark 4.4.

Corollary 4.3 extends Nguyen2023 by incorporating the hybrid variance-reduction mechanism (3). As seen from the definition of $b$ , the resulting error bounds in Corollary 4.3 can be made arbitrarily small by reducing the parameter $\beta$ , which highlights the effectiveness of the variance-reduction strategy. Furthermore, in contrast to the CTA-based gradient tracking framework employed in Nguyen2023 for static distributed optimization, our algorithm adopts an ATC framework adapted for online distributed optimization settings, which has been shown superior to CTA framework cattivelli2009diffusion ; li2024npga , particularly in terms of stability and convergence under dynamic conditions.

5 Numerical Examples

In this section, we evaluate the effectiveness of the proposed TV-HSGT algorithm on two multi-agent distributed learning problems. The first problem is a distributed logistic regression task based on structured data, using the A9A dataset. The second problem is a distributed logistic regression task based on image data, using the MNIST dataset. We compare the performance of the TV-HSGT algorithm with three baseline methods: DSGD lian2017can , DSGT pu2021distributed , and DSGT-HB Gao2023 . All methods adopt a unified strategy for constructing the communication weight matrices. Specifically, in each iteration of TV-HSGT, agents communicate over a time-varying strongly connected directed graph. This graph is constructed by randomly sampling edges from a predefined base directed graph while ensuring strong connectivity is maintained at each round. The communication mechanism follows the AB framework, employing a pair of row-stochastic and column-stochastic matrices for updating the decision and gradient tracking variables, respectively. The weights are uniformly distributed over each node’s in-neighbors or out-neighbors, making the implementation suitable for local computation. In contrast, the baseline methods DSGD, DSGT, and DSGT-HB operate over a fixed complete graph and assign uniform weights across all neighbors, forming symmetric doubly stochastic matrices.

5.1 Distributed Logistic Regression on Structured Data

This subsection evaluates the performance of the proposed TV-HSGT algorithm on a classification task using the structured A9A dataset with a logistic regression model. The loss function 10806815 is defined as:

[TABLE]

where $M^{i}$ is the number of samples for agent $i$ , $r^{i}$ is a regularization coefficient, and $s(a)$ denotes the sigmoid function. We conduct two groups of experiments: (1) algorithm comparison and (2) parameter sensitivity analysis.

We compare TV-HSGT with the online versions of DSGD, DSGT, and DSGT-HB. Following the setup in 10806815 , 10 agents independently receive mini-batches of 100 randomly drawn samples from the pre-shuffled A9A dataset at each round, simulating a dynamic online learning environment. All methods use a fixed step size of 0.001. TV-HSGT adopts a mixing parameter $\beta=0.01$ ; DSGT-HB uses a momentum coefficient of 0.9; and regularization is set as $r^{i}=10^{-5}$ for all agents. Figs. 1–3 show that TV-HSGT consistently outperforms all baselines in terms of regret, loss, and accuracy. The hybrid variance reduction design effectively mitigates gradient noise and accelerates convergence, in line with the theoretical results in Theorem 1.

To examine the impact of the mixing parameter $\beta$ , we test values in {0.01, 0.1, 0.2, 0.3, 0.4, 0.5}. Figs. 4–6 show that smaller $\beta$ values lead to better performance, confirming the theoretical insights in Theorem 1. A larger $\beta$ increases gradient noise, degrading performance.

5.2 Distributed Logistic Regression on Image Data

To further evaluate the effectiveness of TV-HSGT in visual settings, we conduct experiments on the MNIST dataset using a multi-class logistic regression model with $L_{2}$ regularization. The loss function is given by

[TABLE]

where $\Theta=[\theta_{0},\dots,\theta_{9}]$ is the parameter matrix, $a_{s}^{i}$ and $b_{s}^{i}$ represent the feature vector and label of sample $s$ at agent $i$ , $M^{i}$ is the per-round batch size, and $r^{i}$ is the regularization coefficient.

All experimental settings match those of the structured-data experiments in Subsection 5.1. Each agent processes 100 random images per round. Figs. 7–9 show comparisons of time-averaged regret, loss, and accuracy across algorithms. The results demonstrate that TV-HSGT converges fastest, significantly reduces stochastic gradient noise, and achieves the highest final accuracy, outperforming DSGT-HB, DSGT, and DSGD—particularly in image classification applications.

We assess the effect of the mixing parameter $\beta\in\{0.01,0.1,0.2,0.3,0.4,0.5\}$ on performance. Figs. 10–12 illustrate that smaller $\beta$ values lead to better performance across regret, loss, and accuracy, consistent with our theoretical analysis in Theorem 1.

6 Conclusion

In this work, a novel decentralized online stochastic optimization algorithm named TV-HSGT has been proposed over time-varying directed networks with limited computation. By combining hybrid stochastic gradient estimation and gradient tracking strategies, an improved dynamic regret performance with variance reduction is achieved. An AB communication scheme is employed for a time-varying directed network to ensure consensus without eigenvector estimation. Theoretical analysis and experiments demonstrate the algorithm’s effectiveness in reducing variance and tracking the optimal solution. Future work will focus on improving the communication efficiency of TV-HSGT.

Appendix

Appendix A Proof of Lemma 9

Proof A.1.

To bound $\mathbb{E}\left[\left\|\sum_{i=1}^{n}y_{i,t}\right\|^{2}\right]$ , we first apply the triangle inequality of norms to split $\left\|\sum_{i=1}^{n}y_{i,t}\right\|$ as

[TABLE]

By the property of the global optimal solution $x_{t}^{*}$ , namely $\sum_{i=1}^{n}\nabla f_{i,t}(x_{t}^{*})=0$ , we obtain

[TABLE]

Since $\nabla f_{i,t}$ is $L_{g}$ -Lipschitz continuous, one has

[TABLE]

which leads to

[TABLE]

By applying Lemma 4 with $\gamma_{i}=\left[\phi_{t}\right]_{i}$ , $u_{i}=x_{i,t}$ , and $\nu=x_{t}^{*}$ , it can be derived that

[TABLE]

Noting that $\sum_{i=1}^{n}y_{i,t}=\sum_{i=1}^{n}z_{i,t}$ , we derive

[TABLE]

Combining (50) and (A.1), it holds that

[TABLE]

Taking the conditional expectation completes the proof.

Appendix B Proof of Lemma 10

Proof B.1.

Under the given assumptions, Lemma 6 ensures that all components of the stochastic vector $\pi_{t}$ are strictly positive. The scaling $[\pi_{t}]_{i}^{-1}$ is therefore well-defined for all $i\in\mathcal{V}$ and $t\geq 0$ . By definition, we have

[TABLE]

Applying Lemma 4 with $\gamma_{i}=[\pi_{t}]_{i}$ , $u_{i}=y_{i,t}/[\pi_{t}]_{i}$ , and $\nu=0$ , it holds that

[TABLE]

Taking the conditional expectation on both sides and applying Lemma 9 completes the proof.

Appendix C Proof of Lemma 11

Proof C.1.

According to the update rule in (11), it follows that $\hat{x}_{t+1}=\hat{x}_{t}-\alpha\hat{y}_{t}$ , so that $\|\hat{x}_{t+1}-x_{t+1}^{*}\|^{2}=\|\hat{x}_{t}-\alpha\hat{y}_{t}-x_{t+1}^{*}\|^{2}$ . Introducing the auxiliary term $\alpha n\phi_{t}^{\top}\pi_{t}\bar{y}_{t}$ , where $\bar{y}_{t}=\frac{1}{n}\sum_{j=1}^{n}y_{j,t}$ , the error can be decomposed as

[TABLE]

Applying Lemma 2, the following inequality holds

[TABLE]

Since $f_{t}$ is $\mu$ -strongly convex, Lemma 1 implies that if the step size satisfies $0<\alpha<\frac{2}{n(\mu+L_{g})\phi_{t}^{\top}\pi_{t}}$ , then $\|r_{1}\|^{2}\leq(1-\mu\alpha n\phi_{t}^{\top}\pi_{t})^{2}\|\hat{x}_{t}-x_{t}^{*}\|^{2}.$ By Lemma 3, we obtain $\|r_{2}\|^{2}\leq n\alpha^{2}(\phi_{t}^{\top}\pi_{t})^{2}L_{g}^{2}\varphi_{t}^{2}\|\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|_{\phi_{t}}^{2}.$ Since $\bar{y}_{t}=\bar{z}_{t}$ and based on the definition of the gradient tracking error, it holds that

[TABLE]

Applying Lemma 4 with $u_{i}=[\pi_{t}]_{i}\left(\frac{y_{i,t}}{[\pi_{t}]_{i}}-\sum_{j=1}^{n}y_{j,t}\right)$ , $\gamma_{i}=[\phi_{t}]_{i}$ , and $\nu=0$ , we obtain

[TABLE]

Therefore, from the definition of $S^{2}(\mathbf{y}_{t},\pi_{t})$ in (20), we have

[TABLE]

Combining the results above, and under the condition that $0<\alpha<\frac{2}{n(\mu+L_{g})\phi_{t}^{\top}\pi_{t}}$ , we have

[TABLE]

Finally, choosing $\zeta=\frac{1}{1-\mu\alpha n\phi_{t}^{\top}\pi_{t}}$ ensures convergence and completes the proof.

Appendix D Proof of Lemma 12

Proof D.1.

*Since $\hat{\mathbf{x}}_{t+1}=\hat{\mathbf{x}}_{t}-\alpha\hat{\mathbf{y}}_{t}$ and $\mathbf{x}_{t+1}=A_{t}\mathbf{x}_{t}-\alpha A_{t}\mathbf{y}_{t}$ , it follows that $\mathbf{x}_{t+1}-\hat{\mathbf{x}}_{t+1}=(A_{t}\mathbf{x}_{t}-\hat{\mathbf{x}}_{t})-\alpha(A_{t}\mathbf{y}_{t}-\hat{\mathbf{y}}_{t})$ . Taking the ${\phi_{t+1}}$ -norm on both sides and applying Lemma 2, we obtain *

[TABLE]

Both terms $\|A_{t}\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|_{\phi_{t+1}}^{2}$ and $\|A_{t}\mathbf{y}_{t}-\hat{\mathbf{y}}_{t}\|_{\phi_{t+1}}^{2}$ conform to the structure of Lemma 7, with $A=A_{t}$ and $x_{i}=x_{i,t}$ for all $i\in\mathcal{V}$ . In addition, Lemma 5 implies that $\phi_{t+1}^{\top}A_{t}=\phi_{t}^{\top}$ . Letting $\pi=\phi_{t+1}$ , $\phi=\phi_{t}$ , and $\hat{x}_{\phi}=x_{t}$ , and substituting into Lemma 7, we obtain $\|A_{t}\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|_{\phi_{t+1}}^{2}\leq c_{t}^{2}\|\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|_{\phi_{t}}^{2}.$ Using the upper bound of $c_{t}$ , this gives

[TABLE]

Similarly, it follows that

[TABLE]

To bound $\|\mathbf{y}_{t}-\hat{\mathbf{y}}_{t}\|_{\phi_{t}}^{2}$ , we apply Lemma 4 with $\gamma_{i}=[\phi_{t}]_{i}$ , $u_{i}=y_{i,t}$ , and $\nu=0$ . Then, we have

[TABLE]

where $\gamma_{t}=\sqrt{\max_{i\in\mathcal{V}}\left([\phi_{t}]_{i}[\pi_{t}]_{i}\right)}$ , and $\|\mathbf{y}_{t}\|_{\pi_{t}^{-1}}^{2}=\sum_{i=1}^{n}\frac{\|y_{i,t}\|^{2}}{[\pi_{t}]_{i}}$ . Therefore,

[TABLE]

*Letting $\zeta=\frac{1+c^{2}}{2c^{2}}$ , we obtain *

[TABLE]

Taking the conditional expectation and applying Lemma 10 completes the proof.

Appendix E Proof of Lemma 13

Proof E.1.

By adding and subtracting $\hat{\mathbf{x}}_{t}$ , we obtain $\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|=\|\mathbf{x}_{t+1}-\hat{\mathbf{x}}_{t}+\hat{\mathbf{x}}_{t}-\mathbf{x}_{t}\|\leq\|A_{t}\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|+\|\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|+\alpha\|A_{t}\mathbf{y}_{t}\|,$ where the inequality follows from the update rule of $x$ in Equation (11) and the triangle inequality. Expanding the norms and applying Lemma 7 yield

[TABLE]

Using inequality (56), (D.1) and the definition $\gamma_{t}=\sqrt{\max_{i}[\phi_{t}]_{i}[\pi_{t}]_{i}}$ , we obtain

[TABLE]

By employing the norm inequality $\|A_{t}\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|_{\phi_{t+1}}\leq c\|\mathbf{x}_{t}-\hat{\mathbf{x}}_{t}\|_{\phi_{t}}$ as given in Equation (55) and invoking Lemma 2, we derive

[TABLE]

Taking expectation on both sides and applying the bound from Lemma 10 yields the desired result.

Appendix F Proof of Lemma 14

Proof F.1.

Based on the update rule of the hybrid stochastic gradient estimator given in Equation (3), the update difference between $z_{i,t+1}$ and $z_{i,t}$ can be expressed as

[TABLE]

*Applying the norm inequality and Lemma 2, we decompose $\|z_{i,t+1}-z_{i,t}\|^{2}$ into three terms *

[TABLE]

From Assumption 2, the stochastic gradient $\nabla\hat{f}_{i,t+1}(\cdot,\xi_{i,t+1})$ is $L_{g}$ -Lipschitz continuous, and hence $\mathbb{E}\left[\|\nabla\hat{f}_{i,t+1}(x_{i,t+1},\xi_{i,t+1})-\nabla\hat{f}_{i,t+1}(x_{i,t},\xi_{i,t+1})\|^{2}\right]\leq L_{g}^{2}\mathbb{E}\left[\|x_{i,t+1}-x_{i,t}\|^{2}\right].$

Furthermore, decomposing the variance of stochastic gradients and temporal variation yields

[TABLE]

where $\sigma^{2}$ denotes the variance from the stochastic gradients due to Assumption 3, and $q_{t}$ is defined in (2).

Combining the bounds above, we obtain

[TABLE]

Substituting the bound from Lemma 13 into the expression completes the proof.

Appendix G Proof of Lemma 15

Proof G.1.

Since $B_{t}$ is a column-stochastic matrix, the update rule of the gradient tracking variable can be written compactly as

[TABLE]

By multiplying both sides with $\text{diag}^{-1}(\pi_{t+1})$ and subtracting the state $\mathbf{s}_{t+1}=\mathbf{1}_{n}\mathbf{1}_{n}^{\top}\mathbf{y}_{t+1}=\mathbf{s}_{t}+\mathbf{1}_{n}\mathbf{1}_{n}^{\top}(\mathbf{z}_{t+1}-\mathbf{z}_{t})$ , we obtain

[TABLE]

Define $r_{1}=\text{diag}^{-1}(\pi_{t+1})B_{t}\mathbf{y}_{t}-\mathbf{s}_{t}$ , and $r_{2}=\text{diag}^{-1}(\pi_{t+1})B_{t}(\mathbf{z}_{t+1}-\mathbf{z}_{t})-\mathbf{1}_{n}\mathbf{1}_{n}^{\top}(\mathbf{z}_{t+1}-\mathbf{z}_{t})$ . We analyze $r_{1}$ and $r_{2}$ separately.

For $r_{1}$ , we have

[TABLE]

where the inequality is based on Lemma 8, by taking $\mathcal{G}=\mathcal{G}_{t}$ , $B=B_{t}$ , $\pi=\pi_{t+1}$ , and $\nu=\pi_{t}$ , together with the definition of $\tau_{t}$ .

Taking conditional expectation and applying $\tau_{t}\leq\tau$ , we obtain

[TABLE]

For $r_{2}$ , we define $\Delta\mathbf{z}_{t}=\mathbf{z}_{t+1}-\mathbf{z}_{t}$ and $\tilde{\Delta}=\sum_{j=1}^{n}\Delta z_{j,t}$ , then

[TABLE]

where $\kappa_{t}$ is defined in (4.2). Then, applying Lemma 2, it can be derived that

[TABLE]

Choosing $\zeta=\frac{1}{\tau}>1$ and substituting into (G.1) yields the desired result.

Appendix H Proof of Lemma 16

Proof H.1.

Define the stochastic gradient noise at agent $i$ and time $t+1$ as $\delta^{1}_{i,t+1}=\nabla\hat{f}_{i,t+1}(x_{i,t+1},\xi_{i,t+1})-\nabla f_{i,t+1}(x_{i,t+1})$ , and an auxiliary noise term $\delta^{2}_{i,t}=\nabla\hat{f}_{i,t+1}(x_{i,t},\xi_{i,t+1})-\nabla f_{i,t}(x_{i,t})$ , where the randomness is induced by $\xi_{i,t+1}$ . Note that $\mathbb{E}[\delta^{1}_{i,t+1}]=0$ but $\mathbb{E}[\delta^{2}_{i,t}]\neq 0$ generally due to the time-varying objective functions.

*Let $\mathbf{\delta}^{1}_{t}=[\delta^{1}_{i,t}]_{i\in\mathcal{V}}$ and $\mathbf{\delta}^{2}_{t}=[\delta^{2}_{i,t}]_{i\in\mathcal{V}}$ . It can be derived that *

[TABLE]

Moreover, for any $\zeta_{0}>0$ , we have

[TABLE]

By applying Assumptions 2 and 3, we have $\mathbb{E}\left[\|\delta^{1}_{i,t+1}\|^{2}\right]\leq\sigma^{2}$ and

[TABLE]

which implies that

[TABLE]

Then, substituting (61) and (62) into (60) results in (28).

Appendix I Proof of Corollary 4.3

Proof I.1.

When $f_{t}=f$ , the previous Lemmas 14 and 16 related to the time-varying term $q_{t}$ can be revised as follows. Following the proof of Lemma 14, we have

[TABLE]

where the above inequalities uses Lemma 2 and Assumptions 2, 3. Hence, we obtain

[TABLE]

*For Lemma 16, we define $\delta^{1}_{i,t+1}=\nabla\hat{f}_{i}(x_{i,t+1},\xi_{i,t+1})-\nabla f_{i}(x_{i,t+1})$ and $\delta^{2}_{i,t}=\nabla\hat{f}_{i}(x_{i,t},\xi_{i,t+1})-\nabla f_{i}(x_{i,t})$ . Then, one can reorganize (63) as *

[TABLE]

where the first inequality holds due to $\mathbb{E}[\mathbf{\delta}^{1}_{t}]=\mathbb{E}[\mathbf{\delta}^{2}_{t}]=0$ , and the second inequality is obtained by applying $\mathbb{E}[\|\xi-\mathbb{E}[\xi]\|^{2}]=\mathbb{E}[\|\xi\|^{2}]-\|\mathbb{E}[\xi]^{2}\|$ and Assumption 2.

With these modifications, one can derive a new positive matrix $\widehat{M}(\alpha)\leq M(\alpha)$ element-wise, sharing the same structure as $M(\alpha)$ but with slightly different number coefficients and $m_{0}=(1-\beta)^{2}$ . In this case, the following inequality system holds

[TABLE]

with $b=[0,\frac{2n\tau^{2}\psi}{1-\tau}\beta^{2}\sigma^{2},0,2n\beta^{2}\sigma^{2}]^{\top}$ . By iteratively expanding this inequality, we get

[TABLE]

Since the spectral radius $\rho(M(\alpha))<1$ , we have $\lim_{t\to\infty}M(\alpha)^{t}=0$ . Therefore, the first term $M(\alpha)^{t}V_{0}$ tends to zero as $t\to\infty$ with a linear decay rate of $\rho_{M}$ . Next, consider the sum $\sum_{k=0}^{t-1}M(\alpha)^{k}b$ , which is a geometric series that can be written as

[TABLE]

As $t\to\infty$ , $M(\alpha)^{t}\to 0$ , so the above expression simplifies to

[TABLE]

Therefore, when $t\to\infty$ , $\limsup_{t\to\infty}V_{t}\leq-(\mathbb{I}-M(\alpha))^{-1}b.$ with a linear convergence rate of $\rho_{M}$ .

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Duong Thuy Anh Nguyen, Duong Tung Nguyen, and Angelia Nedic. Distributed stochastic optimization with gradient tracking over time- varying directed networks. In 2023 57th Asilomar Conference on Signals, Systems, and Computers , pages 1605–1609, 2023.
2[2] Xuanyu Cao and Tamer Başar. Decentralized online convex optimization with compressed communications. Automatica , 156:111186, 2023.
3[3] Xuanyu Cao, Junshan Zhang, and H. Vincent Poor. Online stochastic optimization with time-varying distributions. IEEE Transactions on Automatic Control , 66(4):1840–1847, 2021.
4[4] Guido Carnevale, Francesco Farina, Ivano Notarnicola, and Giuseppe Notarstefano. Gtadam: Gradient tracking with adaptive momentum for distributed online optimization. IEEE Transactions on Control of Network Systems , 10(3):1436–1448, 2022.
5[5] Federico S Cattivelli and Ali H Sayed. Diffusion lms strategies for distributed estimation. IEEE transactions on signal processing , 58(3):1035–1048, 2009.
6[6] Yiyue Chen, Abolfazl Hashemi, and Haris Vikalo. Accelerated distributed stochastic nonconvex optimization over time-varying directed networks. IEEE Transactions on Automatic Control , 70(4):2196–2211, 2025.
7[7] Ziqin Chen and Yongqiang Wang. Local differential privacy for decentralized online stochastic optimization with guaranteed optimality and convergence speed. IEEE Transactions on Automatic Control , pages 1–16, 2024.
8[8] Emiliano Dall’Anese, Andrea Simonetto, Stephen Becker, and Liam Madden. Optimization and learning with information streams: Time-varying algorithms and applications. IEEE Signal Processing Magazine , 37(3):71–83, 2020.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Hybrid Stochastic Gradient Tracking Method for Distributed Online Optimization Over Time-Varying Directed Networks

Abstract

keywords:

1 Introduction

2 PROBLEM FORMULATION

Definition 1** (Dynamic Regret).**

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

3 PROPOSED ALGORITHMS

Assumption 4**.**

Assumption 5**.**

Assumption 6**.**

4 CONVERGENCE ANALYSIS

4.1 Preliminary Lemmas

Lemma 1**.**

Lemma 2**.**

Lemma 3**.**

Lemma 4**.**

Lemma 5**.**

Lemma 6**.**

Lemma 7**.**

Lemma 8**.**

4.2 Main Results

Lemma 9**.**

Lemma 10**.**

Lemma 11**.**

Lemma 12**.**

Lemma 13**.**

Lemma 14**.**

Lemma 15**.**

Lemma 16**.**

Proposition 1**.**

Theorem 1**.**

Proof 4.1**.**

Remark 4.2**.**

Corollary 4.3**.**

Remark 4.4**.**

5 Numerical Examples

5.1 Distributed Logistic Regression on Structured Data

5.2 Distributed Logistic Regression on Image Data

6 Conclusion

Appendix

Appendix A Proof of Lemma 9

Proof A.1**.**

Appendix B Proof of Lemma 10

Proof B.1**.**

Appendix C Proof of Lemma 11

Proof C.1**.**

Appendix D Proof of Lemma 12

Proof D.1**.**

Appendix E Proof of Lemma 13

Proof E.1**.**

Appendix F Proof of Lemma 14

Proof F.1**.**

Appendix G Proof of Lemma 15

Proof G.1**.**

Appendix H Proof of Lemma 16

Proof H.1**.**

Appendix I Proof of Corollary 4.3

Proof I.1**.**

Definition 1 (Dynamic Regret).

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Assumption 6.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.

Lemma 15.

Lemma 16.

Proposition 1.

Theorem 1.

Proof 4.1.

Remark 4.2.

Corollary 4.3.

Remark 4.4.

Proof A.1.

Proof B.1.

Proof C.1.

Proof D.1.

Proof E.1.

Proof F.1.

Proof G.1.

Proof H.1.

Proof I.1.