A Hybrid Stochastic Gradient Tracking Method for Distributed Online Optimization Over Time-Varying Directed Networks
Xinli Shi, Xingxing Yuan, Longkang Zhu, Guanghui Wen

TL;DR
This paper introduces TV-HSGT, a novel distributed online optimization algorithm that effectively handles stochastic gradients over time-varying directed networks, improving convergence and regret bounds without requiring gradient boundedness.
Contribution
The paper presents a hybrid stochastic gradient tracking algorithm that operates over directed networks without Perron vector estimation, enhancing dynamic regret bounds in online optimization.
Findings
TV-HSGT outperforms existing methods in dynamic environments.
It reduces gradient variance through recursive stochastic gradient integration.
Experimental results validate its effectiveness in logistic regression tasks.
Abstract
With the increasing scale and dynamics of data, distributed online optimization has become essential for real-time decision-making in various applications. However, existing algorithms often rely on bounded gradient assumptions and overlook the impact of stochastic gradients, especially in time-varying directed networks. This study proposes a novel Time-Varying Hybrid Stochastic Gradient Tracking algorithm named TV-HSGT, based on hybrid stochastic gradient tracking and variance reduction mechanisms. Specifically, TV-HSGT integrates row-stochastic and column-stochastic communication schemes over time-varying digraphs, eliminating the need for Perron vector estimation or out-degree information. By combining current and recursive stochastic gradients, it effectively reduces gradient variance while accurately tracking global descent directions. Theoretical analysis demonstrates that TV-HSGT…
| Works | Weight Matrix | TVN? | SG? | NBG? | Mo. Term? | Regret Type |
| Shahrampour2018 | Undirected, DS | ✗ | ✓ | ✗ | ✗ | Dynamic |
| cao2023decentralized | Undirected, DS | ✗ | ✗ | ✗ | ✗ | Static |
| Zhang2022SMC | Directed, DS | ✓ | ✗ | ✗ | ✗ | Static |
| nazari2022dadam | Undirected, DS | ✗ | ✓ | ✗ | ✓ | Dynamic |
| Li2022TCMS | Directed, DS | ✓ | ✓ | ✗ | ✗ | Dynamic |
| carnevale2022gtadam | Undirected, DS | ✗ | ✗ | ✓ | ✓ | Dynamic |
| Sharma2024TSP | Undirected, DS | ✗ | ✗ | ✓ | ✗ | Dynamic |
| Li2024TAC | Directed, RS | ✗ | ✓ | ✗ | ✗ | Static |
| yao2025online | Directed, RCS | ✗ | ✗ | ✗ | ✗ | Dynamic |
| Ours | Directed, RCS | ✓ | ✓ | ✓ | ✓ | Dynamic |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Wireless Network Optimization · Network Traffic and Congestion Control · Energy Efficient Wireless Sensor Networks
A Hybrid Stochastic Gradient Tracking Method for Distributed Online Optimization Over Time-Varying Directed Networks
Xinli Shi [email protected]
Xingxing Yuan [email protected]
Longkang Zhu [email protected]
Guanghui Wen [email protected]
Abstract
With the increasing scale and dynamics of data, distributed online optimization has become essential for real-time decision-making in various applications. However, existing algorithms often rely on bounded gradient assumptions and overlook the impact of stochastic gradients, especially in time-varying directed networks. This study proposes a novel Time-Varying Hybrid Stochastic Gradient Tracking algorithm named TV-HSGT, based on hybrid stochastic gradient tracking and variance reduction mechanisms. Specifically, TV-HSGT integrates row-stochastic and column-stochastic communication schemes over time-varying digraphs, eliminating the need for Perron vector estimation or out-degree information. By combining current and recursive stochastic gradients, it effectively reduces gradient variance while accurately tracking global descent directions. Theoretical analysis demonstrates that TV-HSGT can achieve improved bounds on dynamic regret without assuming gradient boundedness. Experimental results on logistic regression tasks confirm the effectiveness of TV-HSGT in dynamic and resource-constrained environments.
keywords:
distributed online optimization; hybrid stochastic gradient tracking; time-varying directed networks; dynamic regret
, , ,
1 Introduction
Distributed optimization has received significant attention and found applications in various fields such as control, signal processing, and machine learning shahrampour2015distributed ; nedic2017fast ; Shahrampour2017ACC . It aims to solve a large-scale optimization problem by decomposing it into smaller, more tractable subproblems that can be solved iteratively and in parallel by a network of interconnected agents through communication. Most traditional works on distributed optimization focus on static problems, making them unsuitable for dynamic tasks arising in real-world applications, such as networked autonomous vehicles, smart grids, and online machine learning, among others Dall2020 .
Online optimization, which addresses time-varying cost functions, plays a vital role in solving dynamic problems in timely application fields Zinkevich2003 ; Mairal2009 ; Li2024TAC ; Cao2021TAC . In many practical scenarios, such as machine learning with information streams shalev2012online , the objective functions of optimization problems change over time, making them inherently dynamic wei2023distributed ; Zinkevich2003 . Online learning has emerged as a powerful method for handling sequential decision-making tasks in dynamic contexts, enabling real-time operation while ensuring bounded performance loss in terms of regret hazan2016introduction . Regret is the gap between the cumulative objective value achieved by the online algorithm and that of the optimal offline solution li2020distributed ; Shahrampour2018 . In the literature, two types of regret are commonly considered, i.e., static and dynamic regret. The former evaluates the performance of an online algorithm relative to a fixed optimal decision , and is typically formulated as , where denotes the output of the online algorithm and is the optimal fixed decision in hindsight, i.e., . In contrast, the dynamic regret is obtained by replacing the above static by a dynamic solution . This makes dynamic regret more suitable for non-stationary environments, although it is generally more challenging to minimize due to the evolving nature of the optimal points. Both metrics are commonly used to assess the performance of online algorithms. Achieving a sublinear regret growth, i.e., one that grows slower than linearly with time, is often regarded as a key indicator of algorithmic efficiency yuan2017adaptive . Therefore, minimizing regret, particularly in terms of establishing sublinear regret bounds, is fundamental to the design and analysis of effective online optimization methods.
Distributed online optimization offers a flexible framework for handling dynamic settings, combining the benefits of decentralized computation with the ability to adapt to non-stationary environments. Earlier works hosseini2013online ; yan2012distributed investigate online distributed optimization in networks with doubly stochastic mixing matrices and achieve a static regret bound of . Shahrampour2018 further consider dynamic regret for both determined and stochastic online distributed optimization. carnevale2022gtadam propose GTAdam without the bounded gradient assumption, combining gradient tracking and adaptive momentum. However, these works assume static or undirected communication topologies, which are insufficient for modeling dynamic networked systems with directional and time-varying interactions. To address this, several algorithms have been developed under time-varying directed graphs with corresponding theoretical guarantees. For instance, Lee2018TCNS propose the ODA-PS algorithm by integrating dual averaging with the Push-Sum protocol over a directed time-varying network, achieving an static regret. Li2021TAC further extend the Push-Sum framework to handle inequality-constrained optimization over unbalanced networks, establishing sublinear dynamic regret and constraint violation. Xiong2024TNSE address feedback delays and propose an event-triggered online mirror descent method with regret guarantees. In addition, stochastic gradient methods have been explored to reduce computational costs. Lee2017TAC analyze stochastic dual averaging under gradient noise, while Li2022TCMS introduce a gradient tracking scheme with aggregation variables, achieving regret bounds under both exact and noisy gradients.
Nevertheless, many of the above methods rely on the assumption of uniformly bounded gradients and neglect the high variance commonly encountered in practice. Moreover, few of them nazari2022dadam ; Lee2017TAC ; Li2022TCMS ; Li2024TAC incorporate variance reduction techniques, limiting both accuracy and stability in stochastic settings. To overcome these limitations, recent studies have focused on gradient tracking-based approaches, which aim to approximate global descent directions by dynamically aggregating local gradient information. Zhang2019CDC establish dynamic regret bounds for a basic tracking scheme, while carnevale2022gtadam propose a momentum-enhanced variant inspired by adaptive methods. Sharma2024TSP develop a generalized framework for strongly convex objectives without requiring gradient boundedness, further advancing the applicability of gradient tracking in decentralized online settings.
This work addresses the distributed online stochastic optimization over time-varying directed networks under limited computational resources, where agents interact over asymmetric communication links modeled by time-varying row- and column-stochastic mixing matrices. To overcome the challenges introduced by stochastic gradient noise and dynamic topologies, we design a novel online algorithm that incorporates hybrid variance reduction, gradient tracking, and an AB communication scheme Saadatniaki2020TAC ; Pu2021TAC ; Nguyen2023 . Table 1 summarizes the comparison of our methods with several existing online optimization algorithms in terms of communication schemes, gradient assumptions, and types of regret. The main contributions are summarized as follows:
We propose a Time-Varying Hybrid Stochastic Gradient Tracking method, named by TV-HSGT, for distributed online optimization over dynamic directed networks. It integrates a hybrid variance reduction strategy by combining current and recursive stochastic gradients. This method effectively reduces the variance introduced by stochastic gradients and accelerates convergence, as demonstrated in our experimental results. 2. 2.
To address the limited information access inherent in decentralized systems, the algorithm incorporates a gradient tracking mechanism to approximate the global gradient direction over time-varying directed networks. In addition, an AB communication scheme is employed, utilizing both row-stochastic and column-stochastic weight matrices. This design eliminates the need to estimate the Perron vector, as required in traditional Push-Sum methods, improving practical applicability in directed network settings. 3. 3.
The algorithm is implemented within an adapt-then-combine (ATC) framework, which allows for relaxed step-size conditions compared with the combine-then-adapt (CTA) framework li2024npga . We adopt a dynamic regret metric to evaluate performance and introduce a weighted averaging variable to characterize the deviation between local decisions and the global optimal trajectory. Theoretical analysis establishes upper bounds on dynamic regret, and numerical simulations validate the algorithm’s effectiveness in reducing stochastic gradient variance under dynamic and asymmetric communication topologies.
The remainder of this paper is organized as follows. Section II formulates the problem and introduces necessary notations. Section III provides the proposed TV-HSGT algorithm, and Section IV analyzes its dynamic regret. Section V presents numerical studies. Finally, we conclude the paper and discuss future directions in Section VI.
2 PROBLEM FORMULATION
Consider a networked system composed of agents, denoted by the set . The agents communicate through a sequence of time-varying directed graphs , where represents the set of available communication links at time . If , agent can receive information from agent at time . This work aims to solve the following distributed online optimization problem:
[TABLE]
where is the decision variable, and denotes the local loss function of agent at time , defined as the expected loss over a local random variable , i.e., where is a random variable following the distribution at time , and denotes the loss function under the sampled random variable . In practical computation, due to limited computational resources, each agent constructs an unbiased stochastic gradient estimator based on the current sample , and uses it to update its decision variable. The aim of this study is to design a distributed online optimization algorithm tailored to time-varying directed network topologies, where each agent relies solely on limited computational resources and cooperates with neighbors to effectively minimize .
Definition 1** (Dynamic Regret).**
For a sequence of local decisions generated by a given online distributed algorithm, the dynamic regret over time steps is defined as
[TABLE]
where denotes a weighted average of all agents’ decisions at time , and denotes the sequence of minimizers of the global objective functions .
To evaluate the algorithm’s performance in a time-varying environment, this work adopts dynamic regret as the performance metric, defined formally in Definition 1. Dynamic regret quantifies the discrepancy between the cumulative loss of an online algorithm and that of a time-dependent sequence of optimal solutions. Various forms of dynamic regret have been proposed in the literature. In particular, the GTAdam framework carnevale2022gtadam considers the version where is the simple average of agents’ decisions. However, GTAdam assumes undirected networks with doubly stochastic weight matrices. In contrast, this work addresses time-varying directed networks, where the weight matrices are not necessarily symmetric or doubly stochastic. Hence, we adopt a weighted average , as specified in Definition 1, where is a stochastic vector used to accommodate such network structures. Compared with static regret, dynamic regret effectively captures the algorithm’s asymptotic behavior relative to the evolving optimal decisions .
The time-variability and non-stationarity of the problem are characterized by two regularity measures that reflect changes in the objective functions and the evolving optimal solutions. Specifically, characterizes the maximum discrepancy between the gradients of local objective functions across agents at two consecutive time steps, while quantifies the variation between successive optimal solutions. These measures are defined as follows
[TABLE]
We impose the following standard assumptions on the loss functions.
Assumption 1**.**
The global objective function is -strongly convex, i.e., for any , it holds that
[TABLE]
where is the strong convexity parameter.
Assumption 2**.**
For any agent , the stochastic gradient estimator is -Lipschitz continuous in the mean square sense. That is, for some constant and any , the following inequality holds
[TABLE]
Let denote the -algebra generated by . The following assumption is widely adopted in distributed stochastic optimization and federated learning 9226112 ; pmlr-v139-xin21a ; 10715643 ; 9713700 .
Assumption 3**.**
For any agent , its stochastic gradient is unbiased and has bounded variance, i.e.,
[TABLE]
[TABLE]
where is a finite constant.
Under Assumptions 2 and 3, one can derive that is -smooth, i.e.,
[TABLE]
Assumptions 2 and 3 are standard in establishing the convergence of distributed stochastic optimization algorithms pmlr-v139-xin21a ; Huang2024 ; liu2020optimal ; Dinh2022 .
3 PROPOSED ALGORITHMS
In this section, based on an improved stochastic gradient tracking scheme, a novel distributed online optimization algorithm called TV-HSGT is provided to efficiently solve the problem (1) over a time-varying directed network.
We define and as the stochastic gradients evaluated at and , respectively, based on the random sample . To reduce the variance inherent in stochastic gradient estimation, we adopt a hybrid variance-reduction approach introduced for stochastic optimization problems liu2020optimal ; Dinh2022 ; pmlr-v139-xin21a . Let denote the hybrid stochastic gradient variable, which is updated as follows
[TABLE]
where is the mixing parameter. This update rule is equivalent to
[TABLE]
When , the method reduces to the standard stochastic gradient, while for , it is equivalent to the stochastic recursive gradient method 10.55553305890.3305951 . Compared to classical variance-reduction methods such as SVRG Defazio2014 and SAGA NIPS2013_ac1dd209 , this hybrid strategy offers improved convergence speed and stabilitypmlr-v139-xin21a .
While variance reduction enhances gradient estimation stability, each agent in a distributed setting typically only accesses local information, which may not reflect the global objective direction accurately. To address this, the proposed algorithm incorporates a gradient tracking mechanism for estimating the global gradient direction. In contrast to the commonly used CTA framework 9226112 , our algorithm employs the ATC framework, which outperforms the CTA framework with larger step-sizes cattivelli2009diffusion ; li2024npga . Each agent maintains the variables including the decision variable , the hybrid stochastic gradient variable , and the gradient tracking variable . In each iteration, all agents execute the following procedures in parallel.
Each agent sends to its out-neighbors and receives corresponding vectors from its in-neighbors , then updates its decision variable as
[TABLE]
where is the step size, and denote the in-neighbor and out-neighbor sets of agent at time , respectively.
Next, the agent computes the hybrid stochastic gradient using (3). It then forms the gradient tracking increment , transmits to each out-neighbor, and updates its gradient tracking variable by
[TABLE]
The detailed execution steps are presented in Algorithm 1.
The iterative updates rely on two non-negative weight matrices and , consistent with the structure of the directed graph . These matrices satisfy
[TABLE]
The following introduces the assumptions related to the time-varying communication networks.
Assumption 4**.**
For any , the directed graph is strongly connected, and each node has a self-loop, i.e., the edge exists.
Assumption 4 can be relaxed to the setting of a periodically strongly connected graph sequence. Specifically, if there exists a positive integer such that for any , the union of edge sets forms a strongly connected graph over consecutive iterations, then the sequence is said to be -strongly connected.
Each agent independently determines the values of for its in-neighbors , while the corresponding values of are determined by its out-neighbors. We further impose the following assumptions on the matrices and .
Assumption 5**.**
For any , is row-stochastic associated with , i.e., , and for some constant , it satisfies
[TABLE]
where denotes the smallest positive entry in .
Assumption 6**.**
For any , is column-stochastic associated with , i.e., , and for some constant , it satisfies
[TABLE]
where denotes the smallest positive entry in .
4 CONVERGENCE ANALYSIS
This section presents a theoretical convergence analysis of the proposed TV-HSGT algorithm. We first provide several necessary preliminary lemmas in Subsection 4.1, and then give the main theoretical results in Subsection 4.2.
4.1 Preliminary Lemmas
Prior to conducting the convergence analysis, this subsection introduces several auxiliary lemmas that lay the theoretical foundation for the subsequent main results.
Lemma 1**.**
Suppose that is -strongly convex and -smooth. Then, for any , if the step size satisfies , the following inequality holds*
[TABLE]
where denotes the optimal solution to .
Lemma 2**.**
For any integer and any set of vectors , it holds that*
[TABLE]
Moreover, for any constant , we have
[TABLE]
Lemma 3**.**
Suppose that is -smooth. Then, the following inequality holds*
[TABLE]
where , , , and is a stochastic vector.
Lemma 4**.**
Give a set of vectors and nonnegative weights satisfying . Then, for any , the following identity holds*
[TABLE]
Lemma 5**.**
Under Assumptions 4 and 5, there exists a corresponding sequence of stochastic vectors such that*
[TABLE]
Moreover, for all and , it holds that
Lemma 6**.**
Let Assumptions 4 and 6 hold. Define the vector sequence by*
[TABLE]
Then, for any , is a stochastic vector satisfying
If the graph sequence satisfies the strong connectivity condition over a period of length , then the results of Lemmas 5 and 6 can be extended. Specifically, for all , there exist stochastic vector sequences and such that the following equalities hold nguyen2022distributed ; nedic2023ab ; 10337617
[TABLE]
Moreover, for all , these vector sequences satisfy the following lower bounds
Let be a strongly connected directed graph, and let the weight matrices and be consistent with the structure of . Denote by the diameter of the graph and by its maximal edge utility nedic2023ab . The following lemmas describe the contraction properties satisfied by the matrices and .
Lemma 7**.**
Let be a row-stochastic matrix, be a stochastic vector, and be a nonnegative vector such that . For a set of vectors , define . Then, it holds that*
[TABLE]
where the scalar is defined by
[TABLE]
Lemma 8**.**
Let be a column-stochastic matrix, and let be a stochastic vector with strictly positive elements, i.e., for all . Let . Then, for any set of vectors , it holds that*
[TABLE]
where the scalar is given by
[TABLE]
4.2 Main Results
This subsection establishes the key theoretical results on the convergence of the proposed algorithm. To simplify the mathematical exposition, we uniformly use the notation to denote the expectation operator throughout the subsequent proofs and derivations. Unless otherwise specified, all expectations are interpreted as conditional expectations with respect to the filtration , that is, we adopt the convention . The analysis focuses on bounding four critical error terms in terms of conditional expectations, which are the optimality error , the consensus error , the gradient tracking error , and the hybrid stochastic gradient estimation error . Here, the consensus error is measured by the weighted norm , and the gradient tracking deviation is quantified by , which are defined as follows
[TABLE]
where represents the weighted average of local decision variables. The stochastic weight sequences and are defined by equations (17) and (18), respectively. Moreover, denotes the optimal solution to problem (1) at time . In the later analysis, we denote (same to and ), , , , and .
To facilitate the convergence analysis of the proposed algorithm under time-varying directed topologies, we introduce a set of auxiliary parameters: , , , , , , , and . These quantities are defined as follows
[TABLE]
where and are constant upper bounds for the time-varying quantities and , respectively. Additionally, let denote a uniform lower bound of the inner product . Since and are stochastic vectors, it follows that , and hence . For notational conciseness and in order to establish uniform bounds on the algorithm’s performance, we also introduce constant upper bounds , , and for , , and , respectively. The bounding conditions are then given by
[TABLE]
In the following, we present Lemmas 9 to 16, which establish bounds on several key terms used in the subsequent convergence analysis. Detailed proofs can be found in the appendix.
Lemma 9**.**
*Under Assumptions 2 and 6, the following inequality holds for all *
[TABLE]
Lemma 10**.**
*Under Assumptions 4 and 6, the following inequality holds for all *
[TABLE]
Lemma 11**.**
Under Assumptions 1, 2, 3, and 4, if , it holds that for all
[TABLE]
Lemma 12**.**
Under Assumptions 2, 3, and 4, the following inequality holds for all
[TABLE]
Lemma 13**.**
*Under Assumptions 4 and 5, the following inequality holds for all *
[TABLE]
where , and .
Lemma 14**.**
*Under Assumptions 2 and 3, the following inequality holds for all *
[TABLE]
Lemma 15**.**
Under Assumptions 2, 3, and 4, it holds that for all
[TABLE]
Lemma 16**.**
*Under Assumptions 2 and 3, it holds that for all and *
[TABLE]
where is defined in (2).
To facilitate the analysis, we establish a coupled relationship among the expectations of the following four error terms by defining the vector as
[TABLE]
Based on the results of the previously established lemmas, the following linear inequality system can be established.
Proposition 1**.**
Let the collections of sequences , , and be generated by Algorithm 1. Under Assumptions 1–6, the following linear inequality system holds
[TABLE]
where and are vectors given by
[TABLE]
The coefficient parameters are defined as , , and with .
Proof. By applying Lemma 14 to (15), we get the following inequality
[TABLE]
By substituting the result of Lemma 13, which bounds , into (28) gives
[TABLE]
Then, combined with Lemmas 11 and 12, it follows that under the step size condition , the vector satisfies the following dynamical system
[TABLE]
where can be expressed as
[TABLE]
and with
[TABLE]
[TABLE]
By introducing the parameter definitions in (4.2), the entries in are defined as follows
[TABLE]
By substituting the upper and lower bounds of parameters defined in (4.2), the upper bound of can be given by
[TABLE]
satisfying , where the time-varying coefficients can be upper bounded by the following constants
[TABLE]
Here , . Consequently, can be bounded by defined in (31) and (32) Thus, the proof is completed.
To obtain the main theoretical result, we establish a regret bound for the proposed TV-HSGT algorithm under time-varying directed networks. The result demonstrates that the algorithm effectively reduces the variance caused by stochastic gradients.
Theorem 1**.**
Let the collections of sequences , , and be generated by Algorithm 1. Let Assumptions 1–6 hold and the step size satisfy the condition (46). Then, there exists a constant such that the dynamic regret satisfies
[TABLE]
where is defined in (31) and .
Proof 4.1**.**
Recall the linear inequality system (30), given by for all . The goal is to determine a feasible range for the step size such that the spectral radius of satisfies . It is sufficient to find a positive vector and a range for such that horn2012matrix . Expanding and rearranging this inequality element-wisely, we obtain
[TABLE]
To ensure these inequalities hold for some , the right-hand sides must be positive, which gives a set of constraints on the components of the vector , i.e.,
[TABLE]
We now construct a feasible positive vector that satisfies the conditions (43), (44), and (45). Let us fix . Based on (44), we can set . Plugging this into (45), we select to satisfy
[TABLE]
Finally, based on (43), we set as
[TABLE]
With this choice, is a positive vector satisfying the necessary constraints. Now, substituting these values back into inequalities (39), (40), and (42) to derive upper bounds on yields
[TABLE]
To summarize, with the constructed positive vector and the defined constants (38), together with Lemma 11, a sufficient condition on the step size that guarantees is given by
[TABLE]
Recalling that the local function is -smooth and by the definition , it implies the global function is also -smooth, which satisfies
[TABLE]
Let and . Since is the minimizer of , the first-order optimality condition under Assumption 1 implies . Substituting these into (47) yields
[TABLE]
which simplifies to
[TABLE]
Taking the expectation and summing over from 1 to , we get
[TABLE]
In any finite-dimensional vector space, all norms are equivalent, so there exist constants and satisfying
[TABLE]
Substituting (49) into (48) gives . According to matrix analysis theory horn2012matrix , for any , a matrix norm exists such that
[TABLE]
Letting and defining , we have . Matrix norm submultiplicativity further implies for any matrix and vector . Applying this to the recursion (30), we obtain
[TABLE]
and applying (49) again yields
[TABLE]
As the geometric sum satisfies , then we get
[TABLE]
which further simplifies to
[TABLE]
This completes the proof with .
Remark 4.2**.**
Existing studies have shown that, in general settings, the dynamic regret bound cannot achieve sublinear convergence in time li2022survey ; eshraghi2022improving ; Shahrampour2018 ; Notarnicola2023TAC ; Li2021TAC ; Dall2020 ; Mokhtari2016CDC , which may explicitly depend on , the path length related to the changes in the sequence of minimizers. Moreover, some works depend on strong assumptions about objective functions. For example, eshraghi2022improving establishes a bound of the form under the assumptions of strongly convex loss functions and bounded gradients. Shahrampour2018 gives a dynamic regret bound by with , requiring that the local time-varying functions have uniformly bounded gradients and the graph is undirected and connected.
In contrast, Theorem 1 derives an upper bound on dynamic regret without the bounded gradient assumption under a stochastic setting and general time-varying digraphs. Due to the temporal variability of the gradients, the resulting bound incorporates additional error terms. Specifically, Theorem 1 shows that the dynamic regret consists of three components: a term dependent on initial conditions, a noise variance term induced by stochastic gradients, and an error that captures the time-varying nature of the problem, namely and . In particular, the parameter can be properly tuned to reduce variance introduced by stochastic gradients. Moreover, if the temporal variations of both the optimal solution and the objective function’s gradient decay sublinearly, and both the step size and the mixing parameter decrease over time, then the resulting dynamic regret can achieve sublinear convergence.
Specifically, for the static distributed optimization with time-invariant functions (), we can obtain a gradient-tracking based algorithm with variance reduction, as shown in the following corollary.
Corollary 4.3**.**
For the static case with , when Assumptions 1, 2, 4, 5, 6 hold and satisfies (46) with , it satisfies
[TABLE]
with a linear decay rate of , where denotes the th entry of and .
Remark 4.4**.**
Corollary 4.3 extends Nguyen2023 by incorporating the hybrid variance-reduction mechanism (3). As seen from the definition of , the resulting error bounds in Corollary 4.3 can be made arbitrarily small by reducing the parameter , which highlights the effectiveness of the variance-reduction strategy. Furthermore, in contrast to the CTA-based gradient tracking framework employed in Nguyen2023 for static distributed optimization, our algorithm adopts an ATC framework adapted for online distributed optimization settings, which has been shown superior to CTA framework cattivelli2009diffusion ; li2024npga , particularly in terms of stability and convergence under dynamic conditions.
5 Numerical Examples
In this section, we evaluate the effectiveness of the proposed TV-HSGT algorithm on two multi-agent distributed learning problems. The first problem is a distributed logistic regression task based on structured data, using the A9A dataset. The second problem is a distributed logistic regression task based on image data, using the MNIST dataset. We compare the performance of the TV-HSGT algorithm with three baseline methods: DSGD lian2017can , DSGT pu2021distributed , and DSGT-HB Gao2023 . All methods adopt a unified strategy for constructing the communication weight matrices. Specifically, in each iteration of TV-HSGT, agents communicate over a time-varying strongly connected directed graph. This graph is constructed by randomly sampling edges from a predefined base directed graph while ensuring strong connectivity is maintained at each round. The communication mechanism follows the AB framework, employing a pair of row-stochastic and column-stochastic matrices for updating the decision and gradient tracking variables, respectively. The weights are uniformly distributed over each node’s in-neighbors or out-neighbors, making the implementation suitable for local computation. In contrast, the baseline methods DSGD, DSGT, and DSGT-HB operate over a fixed complete graph and assign uniform weights across all neighbors, forming symmetric doubly stochastic matrices.
5.1 Distributed Logistic Regression on Structured Data
This subsection evaluates the performance of the proposed TV-HSGT algorithm on a classification task using the structured A9A dataset with a logistic regression model. The loss function 10806815 is defined as:
[TABLE]
where is the number of samples for agent , is a regularization coefficient, and denotes the sigmoid function. We conduct two groups of experiments: (1) algorithm comparison and (2) parameter sensitivity analysis.
We compare TV-HSGT with the online versions of DSGD, DSGT, and DSGT-HB. Following the setup in 10806815 , 10 agents independently receive mini-batches of 100 randomly drawn samples from the pre-shuffled A9A dataset at each round, simulating a dynamic online learning environment. All methods use a fixed step size of 0.001. TV-HSGT adopts a mixing parameter ; DSGT-HB uses a momentum coefficient of 0.9; and regularization is set as for all agents. Figs. 1–3 show that TV-HSGT consistently outperforms all baselines in terms of regret, loss, and accuracy. The hybrid variance reduction design effectively mitigates gradient noise and accelerates convergence, in line with the theoretical results in Theorem 1.
To examine the impact of the mixing parameter , we test values in {0.01, 0.1, 0.2, 0.3, 0.4, 0.5}. Figs. 4–6 show that smaller values lead to better performance, confirming the theoretical insights in Theorem 1. A larger increases gradient noise, degrading performance.
5.2 Distributed Logistic Regression on Image Data
To further evaluate the effectiveness of TV-HSGT in visual settings, we conduct experiments on the MNIST dataset using a multi-class logistic regression model with regularization. The loss function is given by
[TABLE]
where is the parameter matrix, and represent the feature vector and label of sample at agent , is the per-round batch size, and is the regularization coefficient.
All experimental settings match those of the structured-data experiments in Subsection 5.1. Each agent processes 100 random images per round. Figs. 7–9 show comparisons of time-averaged regret, loss, and accuracy across algorithms. The results demonstrate that TV-HSGT converges fastest, significantly reduces stochastic gradient noise, and achieves the highest final accuracy, outperforming DSGT-HB, DSGT, and DSGD—particularly in image classification applications.
We assess the effect of the mixing parameter on performance. Figs. 10–12 illustrate that smaller values lead to better performance across regret, loss, and accuracy, consistent with our theoretical analysis in Theorem 1.
6 Conclusion
In this work, a novel decentralized online stochastic optimization algorithm named TV-HSGT has been proposed over time-varying directed networks with limited computation. By combining hybrid stochastic gradient estimation and gradient tracking strategies, an improved dynamic regret performance with variance reduction is achieved. An AB communication scheme is employed for a time-varying directed network to ensure consensus without eigenvector estimation. Theoretical analysis and experiments demonstrate the algorithm’s effectiveness in reducing variance and tracking the optimal solution. Future work will focus on improving the communication efficiency of TV-HSGT.
Appendix
Appendix A Proof of Lemma 9
Proof A.1**.**
To bound , we first apply the triangle inequality of norms to split as
[TABLE]
By the property of the global optimal solution , namely , we obtain
[TABLE]
Since is -Lipschitz continuous, one has
[TABLE]
which leads to
[TABLE]
By applying Lemma 4 with , , and , it can be derived that
[TABLE]
Noting that , we derive
[TABLE]
Combining (50) and (A.1), it holds that
[TABLE]
Taking the conditional expectation completes the proof.
Appendix B Proof of Lemma 10
Proof B.1**.**
Under the given assumptions, Lemma 6 ensures that all components of the stochastic vector are strictly positive. The scaling is therefore well-defined for all and . By definition, we have
[TABLE]
Applying Lemma 4 with , , and , it holds that
[TABLE]
Taking the conditional expectation on both sides and applying Lemma 9 completes the proof.
Appendix C Proof of Lemma 11
Proof C.1**.**
According to the update rule in (11), it follows that , so that . Introducing the auxiliary term , where , the error can be decomposed as
[TABLE]
Applying Lemma 2, the following inequality holds
[TABLE]
Since is -strongly convex, Lemma 1 implies that if the step size satisfies , then By Lemma 3, we obtain Since and based on the definition of the gradient tracking error, it holds that
[TABLE]
Applying Lemma 4 with , , and , we obtain
[TABLE]
Therefore, from the definition of in (20), we have
[TABLE]
Combining the results above, and under the condition that , we have
[TABLE]
Finally, choosing ensures convergence and completes the proof.
Appendix D Proof of Lemma 12
Proof D.1**.**
*Since and , it follows that . Taking the -norm on both sides and applying Lemma 2, we obtain *
[TABLE]
Both terms and conform to the structure of Lemma 7, with and for all . In addition, Lemma 5 implies that . Letting , , and , and substituting into Lemma 7, we obtain Using the upper bound of , this gives
[TABLE]
Similarly, it follows that
[TABLE]
To bound , we apply Lemma 4 with , , and . Then, we have
[TABLE]
where , and . Therefore,
[TABLE]
*Letting , we obtain *
[TABLE]
Taking the conditional expectation and applying Lemma 10 completes the proof.
Appendix E Proof of Lemma 13
Proof E.1**.**
By adding and subtracting , we obtain where the inequality follows from the update rule of in Equation (11) and the triangle inequality. Expanding the norms and applying Lemma 7 yield
[TABLE]
Using inequality (56), (D.1) and the definition , we obtain
[TABLE]
By employing the norm inequality as given in Equation (55) and invoking Lemma 2, we derive
[TABLE]
Taking expectation on both sides and applying the bound from Lemma 10 yields the desired result.
Appendix F Proof of Lemma 14
Proof F.1**.**
Based on the update rule of the hybrid stochastic gradient estimator given in Equation (3), the update difference between and can be expressed as
[TABLE]
*Applying the norm inequality and Lemma 2, we decompose into three terms *
[TABLE]
From Assumption 2, the stochastic gradient is -Lipschitz continuous, and hence
Furthermore, decomposing the variance of stochastic gradients and temporal variation yields
[TABLE]
where denotes the variance from the stochastic gradients due to Assumption 3, and is defined in (2).
Combining the bounds above, we obtain
[TABLE]
Substituting the bound from Lemma 13 into the expression completes the proof.
Appendix G Proof of Lemma 15
Proof G.1**.**
Since is a column-stochastic matrix, the update rule of the gradient tracking variable can be written compactly as
[TABLE]
By multiplying both sides with and subtracting the state , we obtain
[TABLE]
Define , and . We analyze and separately.
For , we have
[TABLE]
where the inequality is based on Lemma 8, by taking , , , and , together with the definition of .
Taking conditional expectation and applying , we obtain
[TABLE]
For , we define and , then
[TABLE]
where is defined in (4.2). Then, applying Lemma 2, it can be derived that
[TABLE]
Choosing and substituting into (G.1) yields the desired result.
Appendix H Proof of Lemma 16
Proof H.1**.**
Define the stochastic gradient noise at agent and time as , and an auxiliary noise term , where the randomness is induced by . Note that but generally due to the time-varying objective functions.
*Let and . It can be derived that *
[TABLE]
Moreover, for any , we have
[TABLE]
By applying Assumptions 2 and 3, we have and
[TABLE]
which implies that
[TABLE]
Then, substituting (61) and (62) into (60) results in (28).
Appendix I Proof of Corollary 4.3
Proof I.1**.**
When , the previous Lemmas 14 and 16 related to the time-varying term can be revised as follows. Following the proof of Lemma 14, we have
[TABLE]
where the above inequalities uses Lemma 2 and Assumptions 2, 3. Hence, we obtain
[TABLE]
*For Lemma 16, we define and . Then, one can reorganize (63) as *
[TABLE]
where the first inequality holds due to , and the second inequality is obtained by applying and Assumption 2.
With these modifications, one can derive a new positive matrix element-wise, sharing the same structure as but with slightly different number coefficients and . In this case, the following inequality system holds
[TABLE]
with . By iteratively expanding this inequality, we get
[TABLE]
Since the spectral radius , we have . Therefore, the first term tends to zero as with a linear decay rate of . Next, consider the sum , which is a geometric series that can be written as
[TABLE]
As , , so the above expression simplifies to
[TABLE]
Therefore, when , with a linear convergence rate of .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Duong Thuy Anh Nguyen, Duong Tung Nguyen, and Angelia Nedic. Distributed stochastic optimization with gradient tracking over time- varying directed networks. In 2023 57th Asilomar Conference on Signals, Systems, and Computers , pages 1605–1609, 2023.
- 2[2] Xuanyu Cao and Tamer Başar. Decentralized online convex optimization with compressed communications. Automatica , 156:111186, 2023.
- 3[3] Xuanyu Cao, Junshan Zhang, and H. Vincent Poor. Online stochastic optimization with time-varying distributions. IEEE Transactions on Automatic Control , 66(4):1840–1847, 2021.
- 4[4] Guido Carnevale, Francesco Farina, Ivano Notarnicola, and Giuseppe Notarstefano. Gtadam: Gradient tracking with adaptive momentum for distributed online optimization. IEEE Transactions on Control of Network Systems , 10(3):1436–1448, 2022.
- 5[5] Federico S Cattivelli and Ali H Sayed. Diffusion lms strategies for distributed estimation. IEEE transactions on signal processing , 58(3):1035–1048, 2009.
- 6[6] Yiyue Chen, Abolfazl Hashemi, and Haris Vikalo. Accelerated distributed stochastic nonconvex optimization over time-varying directed networks. IEEE Transactions on Automatic Control , 70(4):2196–2211, 2025.
- 7[7] Ziqin Chen and Yongqiang Wang. Local differential privacy for decentralized online stochastic optimization with guaranteed optimality and convergence speed. IEEE Transactions on Automatic Control , pages 1–16, 2024.
- 8[8] Emiliano Dall’Anese, Andrea Simonetto, Stephen Becker, and Liam Madden. Optimization and learning with information streams: Time-varying algorithms and applications. IEEE Signal Processing Magazine , 37(3):71–83, 2020.
