TL;DR
This paper introduces two online algorithms for real-time identification of causality graphs from multivariate time series using VAR models, suitable for big data and dynamic environments, with proven asymptotic optimality.
Contribution
It develops two novel online algorithms for tracking time-varying causality graphs from VAR models, with theoretical performance guarantees and applicability to large-scale data.
Findings
Algorithms achieve asymptotic performance matching batch estimators.
Algorithms have constant complexity per update, suitable for big data.
Numerical results validate effectiveness in static and dynamic scenarios.
Abstract
Causality graphs are routinely estimated in social sciences, natural sciences, and engineering due to their capacity to efficiently represent the spatiotemporal structure of multivariate data sets in a format amenable for human interpretation, forecasting, and anomaly detection. A popular approach to mathematically formalize causality is based on vector autoregressive (VAR) models and constitutes an alternative to the well-known, yet usually intractable, Granger causality. Relying on such a VAR causality notion, this paper develops two algorithms with complementary benefits to track time-varying causality graphs in an online fashion. Their constant complexity per update also renders these algorithms appealing for big-data scenarios. Despite using data sequentially, both algorithms are shown to asymptotically attain the same average performance as a batch estimator which uses the entire…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Online Topology Identification from Vector Autoregressive Time Series
Bakht Zaman, Luis Miguel Lopez Ramos,
Daniel Romero, and Baltasar Beferull-Lozano The work in this paper was supported by the SFI Offshore Mechatronics grant 237896/E30, the PETROMAKS Smart-Rig grant 244205 and the IKTPLUSS Indurb grant 270730/O70 from the Research Council of Norway. The authors are with the WISENET Lab, Dept. of ICT, University of Agder, Jon Lilletunsvei 3, Grimstad, 4879 Norway. E-mails:{bakht.zaman, luismiguel.lopez, daniel.romero, baltasar.beferull}@uia.no.The material in this work was presented, in part, at CAMSAP 2017 [1].
Abstract
Causality graphs are routinely estimated in social sciences, natural sciences, and engineering due to their capacity to efficiently represent the spatiotemporal structure of multi-variate data sets in a format amenable for human interpretation, forecasting, and anomaly detection. A popular approach to mathematically formalize causality is based on vector autoregressive (VAR) models and constitutes an alternative to the well-known, yet usually intractable, Granger causality. Relying on such a VAR causality notion, this paper develops two algorithms with complementary benefits to track time-varying causality graphs in an online fashion. Their constant complexity per update also renders these algorithms appealing for big-data scenarios. Despite using data sequentially, both algorithms are shown to asymptotically attain the same average performance as a batch estimator which uses the entire data set at once. To this end, sublinear (static) regret bounds are established. Performance is also characterized in time-varying setups by means of dynamic regret analysis. Numerical results with real and synthetic data further support the merits of the proposed algorithms in static and dynamic scenarios.
I Introduction
Inferring causal relations among time series finds countless applications in social sciences, natural sciences, and engineering. These relations are typically encoded as the edges of a causality graph, where each node corresponds to a time series, and oftentimes reveal the topology of e.g. an underlying social, biological, or brain network [2]. Causality graphs may also offer valuable insights into the spatio-temporal structure of time series and assist data processing tasks such as forecasting [3], signal reconstruction [4], anomaly detection [5], and dimensionality reduction [6]. In some applications, graphs capturing different forms of causality can be constructed based on domain knowledge; see e.g. [7, Ch. 8]. However, this approach is often impractical in the aforementioned applications due to the large dimension of the data or because such prior knowledge is unavailable. Instead, causality graphs need to be inferred from data in these situations. This paper accomplishes this task in an online fashion.
Identifying graphs capturing the spatiotemporal “interactions” among time series has attracted great attention [2, 8]. Some approaches focus on instantaneous interactions, i.e., they disregard the temporal structure. The simplest one is to connect two nodes if the sample correlation between the associated time series exceeds a certain threshold [2]. To distinguish mediated from unmediated interactions [2, Sec. 7.3.2], one may resort to conditional independence, partial correlations, Markov random fields, or other approaches in graph signal processing; see e.g. [9, 10, 11, 7, 12, 13]. For directed interactions, one may employ structural equation models (SEM) [14] (see also [15] and references therein) or Bayesian networks [7, Sec. 8.1]. However, these methods account only for memoryless interactions, i.e., they cannot accommodate delayed interactions where the value of a time series at a given time instant is related to the past values of other time series.
The earliest effort to formalize the notion of causality among time series is due to Granger [16] and relies on the rationale that the cause precedes the effect. A time series is said to be Granger-caused by another if the optimal prediction error of the former is decreased when the past of the latter is taken into account. Albeit elegant, this definition is generally impractical since the optimal prediction error is difficult to determine [17, p. 33], [18]. Thus, alternative causality definitions based on vector autoregressive (VAR) models are typically preferred [19, 20, 21]. VAR causality is determined from the support of VAR matrix parameters and is equivalent to Granger causality [22, Chap. 2] in certain cases (yet sometimes treated as equivalent [20, 21]). VAR causality is further motivated by the widespread usage of VAR models to approximate the response of systems of linear partial differential equations [23] and, more generally, in disciplines such as econometrics, bio-informatics, neuroscience, and engineering [24, 25, 26]. VAR topologies are estimated assuming Gaussianity and stationarity in [27, 28] and assuming sparsity in [29, 30, 31, 32]. All these approaches assume that the graph does not change over time. Since this is not the case in many applications, approaches have been devised to identify undirected time-varying topologies [33, 34] and directed piecewise-constant time-varying topologies [35].
The complexity of all previously discussed approaches becomes prohibitive for long observation windows since they process the entire data set at once and cannot accommodate data arriving sequentially. The modern approach to tackle these issues is online optimization, where an estimate is refined with every new data instance. Existing online topology identification algorithms include [36, 15],[37, 38, 39], and [40], but they only account for memoryless interactions.
The present work is the first to propose online algorithms to estimate the memory-aware causality graphs associated with a collection of time series111The related work in [41] was run in parallel and published after the conference version [1] of this work.. We take as a starting point an online algorithm for estimating directed VAR causality graphs which basically minimizes a sequential, sparse topology identification criterion by means of a composite-objective iteration [42]. This procedure, which we termed TISO (Topology Identification via Sparse Online learning) throughout the paper, promotes sparse updates and enjoys constant computational complexity and memory requirements per iteration, which renders it suitable for sequential and big-data scenarios. Building upon this basic algorithm, the contributions of the present paper include the derivation of a more advanced algorithm, theoretical results that characterize the performance of both algorithms, and empirical validation of their performance through extensive experiments with synthetic and real data sets.
The proposed algorithm is named Topology Identification via Recursive Sparse Online learning (TIRSO), which substantially improves the tracking performance of TISO and robustness to input variability by minimizing a novel estimation criterion inspired by recursive least squares (RLS) where the instantaneous loss function accounts for past samples. TIRSO inherits certain benefits of TISO but incurs a moderate increase in computational complexity, which is still constant per iteration.
We summarize our theoretical results, which constitute the main contribution of our paper: (R1) it is established that the hindsight solution of TISO and TIRSO are asymptotically the same; (R2) The performance of TISO and TIRSO is analyzed in terms of static regret bounds, which are sublinear and suggest that TIRSO outperforms TISO. Hence, in the long run, these algorithms perform as well as the best (batch) predictor in hindsight, which supports their adoption for online topology identification. The static regret analysis goes beyond simply stating that the regret is sublinear (which is a direct consequence of applying the algorithm in [42] to the aforementioned criterion), but rather establishes a bound based on properties of the time series that can be checked in practice; (R3) A logarithmic regret bound is proved for TIRSO (such a bound has been proven for TIRSO and could not be proven for TISO thanks to the strong convexity of the loss function). (R4) To analyze the performance of TIRSO when the topology is time-varying, a dynamic regret bound is derived. Moreover, the steady-state error of TIRSO in time-varying scenarios is quantified in terms of the data properties. Remarkably, the performance (regret) analysis does not require probabilistic assumptions, which endows the developed approaches with high generality.
The conference version [1] of this work presents two online algorithms that are different from the algorithms presented here. One is based on the subgradient approximation for regularized RLS proposed in [43] and has computational complexity comparable to that of TIRSO, and the other one is based on a block coordinate minimization via Newton’s method and has lower computational complexity for large networks with small process order. In addition, no convergence guarantees were provided.
The rest of the paper is organized as follows: Sec. II presents the model, a batch estimation criterion, and background on online optimization. Sec. III develops TISO and TIRSO. Sec. IV and Sec. V respectively assess performance analytically and via simulations, whereas Sec. VI concludes the paper. All code will be made public at the authors’ websites.
Notation. Bold lowercase (uppercase) letters denote column vectors (matrices). Operators , , , , , , , , , and respectively denote expectation, gradient, subgradient, sub-differential, matrix transpose, vectorization, maximum eigenvalue, range or column space, pseudo-inverse, and diagonal of a matrix. Symbols , , , and respectively represent the all-zero vector of size , the all-ones vector of size , the all-zero matrix of size , and the size- identity matrix. Also, . For functions and , the notation means . The operator is the indicator satisfying if is true and otherwise. Finally, for time series, the notation corresponds to .
II Preliminaries
After outlining the notion of directed causality graphs, this section reviews how these graphs can be identified in a batch fashion. Later, the basics of online optimization are described.
II-A Directed Causality Graphs
Consider a collection of time series , , where denotes the value of the -th time series at time . A causality graph is a graph where the -th vertex in is identified with the -th time series and there is an edge (or arc) from to (i.e. ) if and only if (iff) causes according to a certain causality notion. For the reasons outlined in Sec. I, a prominent notion of causality described later in this section can be defined using VAR models. To this end, let and define a VAR time series as a sequence generated by the order- VAR model[22]
[TABLE]
where , are the VAR parameters222For the sake of clarity, matrices are deemed constant throughout this section. However, all the notions explained here can be easy generalized to time-varying scenarios, as detailed in subsequent sections. and is the innovation process. This process is generally assumed to be a temporally white, zero-mean stochastic process, i.e., and for . Yet, the present work does not even need to assume that is random, which benefits its generality; see the remark at the end of Sec. IV. With the -th entry of , expression (1) becomes
[TABLE]
for , where and . Recognizing the convolution operation in the right-hand side enables one to express (2) as in signal processing notation. Thus, in a VAR model, equals the sum of noise and the output of linear time-invariant filters where the -th filter has input and coefficients .
When is a zero-mean and temporally white stochastic process, the term in (2) is the minimum mean square error estimator of given the previous values of all time series ; see e.g. [18, Sec. 12.7]. The set therefore collects the indices of those time series that participate in this optimal predictor of or, alternatively, the information provided by time series with is not informative to predict . This motivates the following definition of causality, which embodies the spirit of Granger causality (see Sec. I): VAR-causes whenever . Equivalently, VAR-causes if . A detailed comparison with Granger causality lies out of scope, yet it is worth mentioning that the main distinction lies in the prediction horizon333Whereas VAR causality just pertains to prediction 1 time instant ahead, Granger causality involves prediction of all future samples , given the ones up to a certain time instant . Therefore VAR causality implies Granger causality, but the converse is false.; see [22, Sec. 2.3.1] for a more detailed comparison. VAR causality relations among the time series can be represented using a causality graph where . Clearly, in such a graph, is the in-neighborhood of node . To quantify the strength of these causality relations, a weighted graph can be constructed by assigning e.g. the weight to the edge .
With these definitions, the batch problem of identifying a VAR causality graph reduces to estimating the VAR coefficient matrices given and the observations . To simplify notation, form the tensor by stacking the matrices along the third dimension as shown in Fig. 1.
II-B Batch Estimation Criterion for Topology Identification
This section presents an estimation criterion to address the batch problem formulated in Sec. II-A. A natural estimate could be pursued through least-squares by minimizing [22]
[TABLE]
This estimation task becomes underdetermined unless the number of available data samples meaningfully exceeds the number of unknowns , or, equivalently, . Even more, to obtain a reasonable performance, one requires which may not be possible in practice, especially if the parameters remain constant only for short periods of time. To circumvent this limitation, one may note that most causality relations between two time series will be mediated by one or more time series. This means that the causality graph introduced in Sec. II-A is expected to be sparse, meaning that many of the vectors equal zero. Such a sparsity structure can be promoted by properly regularizing the aforementioned least squares objective. To this end, the following criterion has been proposed in [29]:
[TABLE]
where is a regularization parameter that can be adjusted e.g. via cross-validation [7, Ch. 1]. The second term in (3) is conventionally referred to as a group-lasso444Although other norms (such as the sum of infinity norms) can be used to enforce group sparsity, recoverability results associated with this norm are provided in [29]. regularizer and the solution to (3) as a group-lasso estimate [44]. This promotes a group-sparse structure in to exploit the information that the number of edges in is typically small. Self-connections (, ) are excluded from the regularization term so that the inferred causal relations pertain to the component of each time series that cannot be predicted using its own past. This is motivated by the improvement in consistency reported in [29]. The criterion (3) can be further motivated on the grounds of the consistency of group-lasso estimators [45].
Remarkably, (3) separates along . To see this, let and
[TABLE]
and express as , where and . Then, (3) becomes with
[TABLE]
for . Thus, the VAR causality graph can be identified by separately estimating the VAR coefficients, and hence incoming edge weights, for each node.
The batch estimation criterion in (5) requires all data before an estimate can be obtained and cannot track changes. Furthermore, solving (5) eventually becomes prohibitively complex for sufficiently large . To address these challenges, this paper adopts the framework of online optimization, which is reviewed in the following subsection.
Remark: As seen in (3), is the same for all candidate edges . This can be readily replaced with an edge-dependent regularization parameter without any complexity increase to exploit possibly available prior-information about edges.
II-C Background on Online Optimization
This section reviews the fundamental notions of online optimization from a general perspective, not necessarily applied to the problem of topology identification. To this end, consider the generic unconstrained optimization problem
[TABLE]
where is a convex function, which in many applications depends on the data received at time . For example, in least squares , where and are the data vector and matrix made available at time . To solve (6), it is necessary that all be available. Approaches that process all data at once are termed batch and, hence, suffer from potentially long waiting times, which generally render them inappropriate for real-time operation. Besides, computational complexity and memory generally grow super-linearly with , which eventually becomes prohibitive.
Online algorithms alleviate these limitations. Let denote an estimate of the solution to (6) at time produced by an online algorithm. Online algorithms compute a new every time a new data element (or, more generally, a new ) is processed. At every iteration, also known as update, is obtained from , , , and possibly some additional information carried from each update to the next. The memory requirements and number of arithmetic operations per iteration must not grow unbounded for increasing . This requirement rules out approaches involving solving (6) as a batch problem per update or carrying all the past data from the -th update to the -th update. Thus, online algorithms are especially appealing when data vectors are received sequentially or is so large that batch solvers are not computationally affordable. Additionally, online algorithms can track changes in the underlying model. When represents a probabilistic model parameter that must be estimated, the estimate obtained through an online method is generally capable of tracking variations in so long as they do not occur too rapidly.
The most common performance metric to evaluate online algorithms is the regret, which quantifies the cumulative loss incurred by an online algorithm relative to the loss corresponding to the optimal constant solution in hindsight. Formally, the (static) regret555The static regret is known simply as regret in earlier works, e.g. [46], and different types of regret were formalized later, see e.g. [47]. at iteration is given by [46]:
[TABLE]
where is the optimal constant hindsight solution, i.e., the batch solution after data vectors have been processed. Observe that the regret in (7) may be negative since the estimates are allowed to depend on and, hence, it may hold that for multiple (potentially all) values of . In practice, nevertheless, the regret will typically be positive and increase with . To be deemed admissible, online algorithms must yield a sublinear regret, i.e., as . Thus, online algorithm with sublinear regret perform asymptotically as well as the batch solution on average. It is worth noting that the online learning framework does not involve statistical assumptions on the data, which can even be generated by an “adversary” [48].
In dynamic settings where the parameters of the data generating process vary over time, may not be a suitable reference since its computation involves potentially very old data, namely , which is informative about old values of the true parameters but not about the new values. In those cases, it is customary to compare against the instantaneous minimizer by means of the so-called dynamic regret [47], [49]:
[TABLE]
More details about the dynamic regret are given in Sec. IV-C.
III Online Topology Identification
This section develops online algorithms for the considered problem of topology identification from time series. To this end, cast (5) for the -th node in the form (6) by setting
[TABLE]
for . The most immediate approach to solve (6) would be applying online subgradient descent (OSGD), whose updates are given by with a subgradient of at and the step size at time . From (8), equals plus times a valid subgradient of the form
[TABLE]
evaluated at . For example, for , set for and for . It is easy to see that the resulting iterates are not necessarily sparse; see also [42]. Since the solution to the batch problem is indeed sparse for a properly selected , alternative approaches are required
To this end, note that OSGD fails to provide sparse iterates because it implicitly linearizes the instantaneous objective . Since the regularizer (last term in (8)) is not differentiable, it is not well approximated by a linear function and, as a result, it fails to promote sparsity. To address this issue, composite algorithms decompose as , where is a convex loss function and is a convex regularizer, and linearize only . Algorithms of this family, which include regularized dual averaging (RDA) [50] and composite objective mirror descent (COMID) [42], solve the generic problem
[TABLE]
by linearizing but not . This work focuses on COMID since, unlike RDA, there exist bounds for its regret for constant step size when the regularizer is not strongly convex. The COMID update is
[TABLE]
where is a subgradient of at point (that is, ), is a step size, and is the so-called Bregman divergence associated with a -strongly convex and continuously differentiable function . The strong convexity condition means that , which motivates using as a surrogate of a distance between and . Thus, the Bregman divergence in (10) penalizes updates lying far from the previous one , which essentially smoothes the sequence of iterates.
Relative to each term in (9), the loss in (10) has been linearized but the regularizer has been kept intact. When is a sparsity-promoting regularizer, then the online estimate is therefore expected to be sparse.
In view of these appealing features, the algorithm proposed in Sec. III-A builds upon COMID to address the problem of online causality graph identification from time series.
III-A Topology Identification via Sparse Online optimization
This section proposes topology identification via sparse online optimization (TISO), an online algorithm for the problem in Sec. II-B that provides a causality graph estimate every time a new is processed. The key idea of this first algorithm is to refine the previous topology estimate with the information provided by the new data vector by means of a COMID update.
To this end, express in (8) in the form by setting
[TABLE]
for
To choose , note that (10) with and given by (11) can be solved in closed form when . In that case, and can be found via a modified group soft-thresholding operator, as detailed next. With these expressions, the TISO update after processing is
[TABLE]
where
[TABLE]
and (using the vector defined in (4))
[TABLE]
To solve (12) in closed form, expand the squared norm in (13) to obtain
[TABLE]
where and . From (15), it can be observed that the updates in (12) can be computed separately for each group .
For , the -th subvector of (or -th group) can be expressed in terms of the so-called multidimensional shrinkage-thresholding operator [51] as:
[TABLE]
where . Expression (16) is composed of two terms: whereas is the result of performing a gradient-descent step in a direction that decreases the instantaneous loss , the second term promotes group sparsity by setting for those groups with . Recalling that each vector corresponds to an edge in the estimated causality graph (see Sec. II-A), expression (16) indicates that only the relatively strong edges (i.e. causality relations) survive. In view of such a shrinkage operation, a larger will result in sparser estimates.
On the other hand, when , the -th subvector of in (12) is given by:
[TABLE]
and, as intended, no sparsity is promoted on self-connections; see Sec. II-B. Combining (16) and (17), the estimate of the -th group at time is given by:
[TABLE]
The performance of TISO depends on the choice of the step-size sequence , as discussed in Sec. IV. The overall TISO algorithm is listed as Procedure 1. It only requires memory entries to store the last data vectors and the last estimate. On the other hand, each update requires arithmetic operations, which is in the same order as the number of parameters to be estimated. Thus, TISO can arguably be deemed a low-complexity algorithm.
The next section will build upon TISO to develop an algorithm with increased robustness to input variability.
III-B Topology Identification via Recursive Sparse Online optimization
As seen in Sec. III-A, each update of TISO depends on the data through the instantaneous loss , which quantifies the prediction error of the newly received vector when the VAR parameters are given by the previous estimate . Thus, the residual of predicting each data vector is used only in a single TISO update. Although this renders TISO a computationally efficient algorithm for online topology identification, it also increases sensitivity to noise and input variability. To this end, this section pursues an alternative approach at the expense of a moderate increase in computational complexity and memory requirements.
It is clear from (12) that is determined by and . The latter incorporates the residual only at time . The step size controls how much variability in the input data propagates to the estimates . When a diminishing step-size sequence is adopted, the influence of each new on the estimate becomes arbitrarily small, and the variability of the estimates fades away. However, decreasing sequences cannot be utilized when the application at hand demands tracking changes in the coefficients . In these settings, a constant step size is preferable. In such a scenario, a desire to reduce output variability would therefore force one to adopt a small , but this would hinder TISO from tracking changes in the topology.
An approach to reduce output variability without sacrificing tracking capability will be developed next by drawing inspiration from the connections between TISO, the least mean squares (LMS) algorithm, and the recursive least squares (RLS) algorithm [52]. Indeed, observe that TISO generalizes LMS, which is recovered for . To speed up convergence and reduce variability in the output of LMS, it is customary to resort to RLS, which accommodates the received data in a more sophisticated fashion, allowing to control the influence of each data vector on future estimates through forgetting factors.
Along these lines, the trick is to replace the instantaneous loss in (11) with a running average loss. To maintain tracking capabilities, a heavier weight is assigned to recent data using the exponential window customarily adopted by RLS. Specifically, consider setting in (11) with
[TABLE]
where is the user-selected forgetting factor and is set to normalize the exponential weighting window, i.e., .
Having specified a loss function, the next step is to derive the update equation. In a direct application of COMID to solve (9) with , each iteration would involve the evaluation of the gradient of the terms of . The computational complexity per iteration would grow with and, therefore, the resulting updates would not make up a truly online algorithm according to the requirements expressed in Sec. II-C. To remedy this issue, the structure of (19) will be exploited next to develop an algorithm with constant memory and complexity per iteration. To this end, expand and rewrite (19) to obtain
[TABLE]
where
[TABLE]
The variables and can be respectively thought of as a weighted sample autocorrelation matrix and a weighted sample cross-correlation vector. The key observation here is that, as occurs in RLS, these quantities can be updated recursively as and . Noting that
[TABLE]
and letting , the estimate after receiving becomes
[TABLE]
where
[TABLE]
Proceeding similarly to Sec. III-A yields the update
[TABLE]
where . Due to the recursive nature of the updates for and , the resulting algorithm is termed Topology Identification via Recursive Sparse Online optimization (TIRSO) and tabulated as Procedure 2.
The choice of the step size affects the convergence properties of TIRSO, as analyzed in Sec. IV. Regarding step size selection, natural choices include (i) constant step size, which is convenient in dynamic setups where changes in the coefficients need to be tracked over time (see \threfth:dynamicregretbound) but also gives rise to performance guarantees in static scenarios (\threfprop:regrettiso and \threfprop:regrettirso in the supplementary material); (ii) diminishing step size, commonly in the form of or (see \threfth:strongconvexitytirso); or (iii) an adaptive step size that depends on the data, as discussed at the end of Sec. IV-C.
Observe that only needs to be updated once per observed sample , whereas the vector need to be updated for each at every . The computational complexity is dominated by step 7, which is operations per . However, exploiting the group-sparse structure of may reduce the computation by disregarding the columns of corresponding to the zero entries of . If, for instance, the number of edges is , then the complexity of TIRSO becomes per . Regarding memory complexity, TIRSO requires memory positions to store and positions to store .
IV Theoretical Results
In this section, the performance of TISO and TIRSO is analyzed. The upcoming results will make use of one or more of the following assumptions:
- A1.
Bounded samples: There exists {\color[rgb]{0,0,0}B}_{y}\!>0\! such that |y_{n}[t]|^{2}\leq{\color[rgb]{0,0,0}B}_{y}~{}\forall\,n,t. 2. A2.
Bounded minimum eigenvalue of : There exists such that . 3. A3.
Bounded maximum eigenvalue of : There exists such that . 4. A4.
Asymptotically invertible sample covariance: There exists and such that
[TABLE]
Note that A1 entails no loss of generality in real-world applications, where
data are bounded and thus {\color[rgb]{0,0,0}B}_{y} necessarily exists. A2 usually holds in practice unless the data is redundant, meaning that some time series can be obtained as a linear combination of the others. In general, the latter will not be the case e.g. if the data adheres to a continuous probability distribution, in which case is positive definite for all with probability 1. A3 will also hold in practice since it can be shown that it is implied by A1. In particular, if A1 holds, then A3 holds with L=PN{\color[rgb]{0,0,0}B}_{y}. Similarly, A26 will also generally hold since it is a weaker version of A2.
Next, the asymptotic equivalence of the batch solutions for TISO and TIRSO is established.
IV-A Asymptotic Equivalence between TISO and TIRSO
To complement the arguments given in Sec. III-B to support the decision of setting , which laid the grounds to develop TIRSO, we establish that the batch problems that TISO and TIRSO implicitly solve become asymptotically equivalent as . To this end, let denote the hindsight solution for TISO, which is given by
[TABLE]
where
[TABLE]
Observe that (28) is identical to the objective in the batch criterion (5). Likewise, let denote the hindsight solution of TIRSO, which is given by
[TABLE]
with
[TABLE]
In this case, (30) no longer coincides with the objective in (5). Therefore, one can argue that the TIRSO algorithm is not pursuing the estimates that minimize the batch criterion (5). This idea is dispelled next by establishing the asymptotic equivalence between minimizing {\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n}) and minimizing {\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n}), since the latter is identical to (5).
Theorem 1**.**
\thlabel
prop:asymptoticequivalence Under assumption A1:
It holds for all that \displaystyle\lim_{T\rightarrow\infty}|{\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n})-{\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n})|=0. 2. 2.
It holds that \displaystyle\lim_{T\rightarrow\infty}\big{|}\inf_{\bm{a}_{n}}{\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n})-\inf_{\bm{a}_{n}}{\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n})\big{|}=0. 3. 3.
If, additionally, assumption A2 holds, then
Proof:
See Appendix A in the supplementary material. ∎
\thref
prop:asymptoticequivalence essentially establishes not only that the TISO and TIRSO hindsight objectives are asymptotically the same but also that their minima and minimizers asymptotically coincide. Since the TISO hindsight objective equals the batch objective (5), it follows that the TIRSO hindsight objective asymptotically approaches the batch objective (5). This observation is very important since the regret analysis from Sec. IV-B will establish that the TISO and TIRSO estimates asymptotically match their hindsight counterparts.
IV-B Static Regret Analysis
This section characterizes the performance of TISO and TIRSO analytically. Specifically, it is shown that the sequences of estimates produced by these algorithms yield a sublinear static regret, which is a basic requirement in online optimization; see Sec. II-C. Broadly speaking, this property means that, on average and asymptotically, the online estimates perform as well as their hindsight counterparts.
A general definition of the regret metric is given in (7). Since the problem at hand is separable across nodes, it is natural to separately quantify the regret for each node. The total regret will be the sum of the regret for all nodes. Applying this idea and shifting the time index to simplify notation, one can replace in (7) with , function with , and with to write the regret of TISO for the -th node at time as
[TABLE]
where and is defined in (27). For TIRSO, the regret for the -th node is given by
[TABLE]
where and is defined in (29).
Since constant step size sequences allow tracking time-varying topologies, one could think of seeking a sublinear bound for the regret. However, it is easy to see (cf. (17) and (18) in the case of TISO) that the sequences of estimates in this case are generally noisy, unless the innovation process in (1) is . For this reason, a sublinear regret bound cannot be obtained for a constant . However, it is possible to establish sublinear regret when the step size is “asymptotically constant,” as described next.
The idea is to run the selected algorithm in time windows of exponentially increasing length with a step size that differs across windows but is constant within each one. Specifically, let the -th window, , comprise the time indices for some user-selected . Set for those satisfying . The following result proves sublinear regret for TISO.
Theorem 2**.**
\thlabel
cor:doublingtricktiso Let be generated by applying TISO (Procedure 1) with step size in the window , Then, the regret of TISO under assumptions A1 and A26 is
[TABLE]
where B_{\bm{a}}=1/\beta({\color[rgb]{0,0,0}B}_{y}\sqrt{PN}+\sqrt{{\color[rgb]{0,0,0}B}_{y}^{2}PN+\beta{\color[rgb]{0,0,0}B}_{y}}).
Proof:
See Appendix B in the supplementary material. ∎
Similarly, the regret of TIRSO is characterized as follows:
Theorem 3**.**
\thlabel
cor:doublingtricktirso Let be generated by applying TIRSO (Procedure 2) with step size in the window , Then, the regret of TIRSO under assumptions A1, A2, and A3, is
[TABLE]
where B_{{\tilde{\bm{a}}}}\triangleq 1/\beta_{\tilde{\ell}}({\color[rgb]{0,0,0}B}_{y}\sqrt{PN}+\sqrt{{\color[rgb]{0,0,0}B}_{y}^{2}PN+\beta_{\tilde{\ell}}{\color[rgb]{0,0,0}B}_{y}}).
Proof:
See Appendix D in the supplementary material. ∎
\thref
cor:doublingtricktirso has the same form as \threfcor:doublingtricktiso with the exception of (79), where the constant term multiplying differs from the one in (33). However, it can be readily shown that L\leq PN{\color[rgb]{0,0,0}B}_{y}, which implies that TIRSO also satisfies (33).
To sum up, both TISO and TIRSO behave asymptotically in the same fashion and provide, on average, the same performance as the hindsight solution of TISO, which coincides with the batch solution in (5). The difference between TISO and TIRSO is, therefore, in the non-asymptotic regime, where TIRSO can track changes in the estimated graph more swiftly than TISO. This is at the expense of a slight increase in the number of arithmetic operations and required memory. Note, however, that TIRSO offers an additional degree of freedom through the selection of the forgetting factor . This enables the user to select the desired point in the trade-off between adaptability to changes and low variability in the estimates.
As demonstrated next, tighter regret bounds can be obtained when a diminishing step size sequence is adopted. Such sequences are of special interest when the VAR coefficients do not change over time. Even in this scenario, the application of online algorithms such as TISO or TIRSO is well-motivated when the number or dimension of the data vectors is prohibitively large to tackle with a batch algorithm.
Theorem 4**.**
\thlabel
*th:strongconvexitytirso Under assumptions A1, A2, and A3, let be generated by TIRSO (Procedure 2) with . Then, the static regret of TIRSO satisfies
[TABLE]
where G_{\tilde{\ell}}\triangleq(1+\kappa_{\bm{\Phi}})\sqrt{PN}{\color[rgb]{0,0,0}B}_{y} with and is defined in \threfcor:doublingtricktirso.
Proof:
See Appendix F in the supplementary material. ∎
Next, we analyze the performance of TIRSO in dynamic environments.
IV-C Dynamic Regret Analysis of TIRSO
In this section, the performance of TIRSO is analyzed in dynamic settings. Specifically, a dynamic regret bound is derived for TIRSO, and its steady-state tracking error in dynamic scenarios is also discussed.
To characterize the performance of TIRSO in dynamic setups, the dynamic regret is defined as:
[TABLE]
where is the TIRSO estimate and .
The dynamic regret in (36) compares the estimate with in terms of the metric . As opposed to , estimate does not “know” since is obtained from whereas depends on both and . This means that the dynamic regret captures the ability of an algorithm to attain small future residuals. Furthermore, note that comparing with is highly meaningful in the present case since, by definition, , which therefore minimizes a version of the batch (5) or hindsight (29) objectives where the more recent residuals are weighted more heavily. Thus, constitutes a significant estimator in dynamic setups and therefore the dynamic regret also quantifies the ability of an estimator to track changes.
It can be easily shown that the static regret is upper-bounded by the dynamic regret. The dynamic regret in (36) would coincide with the static regret if were replaced with . Attaining a low dynamic regret is therefore more challenging because the estimator under consideration is compared with a time-varying reference.
This implies that a sublinear dynamic regret may not be attained if this time-varying reference changes too rapidly, which generally occurs when the tracked parameters vary too quickly. For this reason, the dynamic regret is commonly upper-bounded in terms of the cumulative distance between two consecutive instantaneous optimal solutions, known as path length:
[TABLE]
Next, we bound the dynamic regret of TIRSO.
Theorem 5**.**
\thlabel
th:dynamicregretbound Under assumptions A1, A2, and A3, let be generated by TIRSO (Procedure 2) with a constant step size . If there exists such that
[TABLE]
then the dynamic regret of TIRSO satisfies:
[TABLE]
where .
Proof:
See Appendix G in the supplementary material. ∎
The derivation of the dynamic regret bound above relies on the strong convexity of (20), and thus cannot be done for TISO. The derivation of a different dynamic regret bound in [53] relies on the strong convexity of the quadratic term of an elastic-net regularizer, which is not necessary here. Several remarks about \threfth:dynamicregretbound are in order. If the path length is sublinear in , then the dynamic regret is also sublinear in .
When the path length is not sublinear, the dynamic regret may not be sublinear, but we can still bound the steady-state error under certain conditions:
Theorem 6**.**
\thlabel
*th:boundingerror Under assumptions A1, A2, and A3, let be generated by TIRSO (Procedure 2) with a constant step size . If there exists such that (38) holds, then *
[TABLE]
Proof:
Following similar arguments as in the proof of \threfth:dynamicregretbound, (40) follows by applying [54, Lemma 4]. ∎
This theorem establishes that the steady-state error incurred by TIRSO with in dynamic scenarios eventually becomes bounded, which shows its tracking capability in time-varying environments. If , then the upper bound on the steady-state error becomes , where is an upper bound on the condition number of . This clearly agrees with intuition. In practice, one may not know the value of and therefore selecting an guaranteed to be in would not be possible. In those cases, it makes sense to compute a running approximation of given by and adopt the approximately constant step size , where . However, in setups where the true VAR parameters change over time, the operation may lead the algorithm to use an overly pessimistic approximation of . Thus, it may be preferable to directly adopt the adaptive step size , as analyzed in Sec. V.
Remark. None of the algorithms and analytical results in this paper require any probabilistic assumption or mention to probability theory, making our results fully compatible with the deterministic interpretation of the estimator at hand. This is because these results establish performance guarantees for the proposed online algorithms relative to the batch estimator or hindsight solutions. If one wished to obtain performance guarantees in terms of probabilistic metrics, such as consistency of the estimators, probabilistic assumptions would of course be required. For example, when , the batch estimator in (3) boils down to the ordinary least squares estimator, which is consistent if the VAR process is stable and the noise is standard white [22, Lemma 3.1]. When , consistency of (3) is discussed in [29]. Remarkably, consistency of the VAR coefficient estimates is not enough to ensure the correct identification of the true graph. Theorem 1 in [29] provides conditions that depend on the true VAR parameters that guarantee that the graph is successfully recovered.
V Numerical Results and Analysis
Simulation tests for the proposed algorithms are performed on both synthetic and real data. All code will be made public at the authors’ websites.
The proposed algorithms are evaluated based on the performance metrics described next, where expectations are approximated by the Monte Carlo method. For synthetic-data experiments, the normalized mean square deviation
[TABLE]
measures the difference between the estimates and the (possibly time-varying) true VAR coefficients . The ability to detect edges of the true VAR-causality graph is assessed using the probability of miss detection
[TABLE]
for a given threshold , which is the probability of not identifying an edge that actually exists, and the probability of false alarm
[TABLE]
which is the probability of detecting an edge that does not exist. Another relevant metric is the edge identification error rate (EIER), which measures how many edges are misidentified relative to the number of possible edges [55]:
[TABLE]
Note that self-loops are excluded in these metrics. To quantify the forecasting performance, define recursively the -step ahead predictor given as:
[TABLE]
where are the estimated VAR coefficients at time and for . The -step normalized mean square error is given by
[TABLE]
The values of all parameters involved in the experiments are listed in the captions and legends of the figures.
V-A Synthetic Data Tests
Throughout this section, unless otherwise stated, the expectations in (41) to (44) are taken with respect to realizations of the graph, VAR parameters, and innovation process . Similarly, the step size is set to ; see Sec. IV-C. The regularization parameter is selected to approximately minimize NMSD.
V-A1 Stationary VAR Processes
An Erdős-Rényi random graph is generated with edge probability and self-loop probability 1. This graph determines which entries of the matrices are zero. The rest of entries are drawn i.i.d. from a standard normal distribution. Matrices are scaled down afterwards by a constant that ensures that the VAR process is stable [22]. The innovation process samples are drawn independently as .
The first experiment analyzes TISO and TIRSO in a stationary setting. Figs. 2(a) and 2(b) depict the NMSD and for three different values of . As a benchmark, Fig. 2(b) includes the of the genie-aided predictor, obtained from (43) after replacing with . It is observed that yields a better NMSD and than lower and higher values of . This corroborates the importance of promoting sparse solutions, as done in TISO and TIRSO. Furthermore, as expected, TIRSO generally converges faster than TISO. Fig. 2(c) shows the receiver operating characteristic (ROC) curve, composed of pairs for different values of the threshold . The values of these pairs are obtained by respectively averaging and over time in the interval . Remarkably, both TISO and TIRSO can simultaneously attain and below 10%. This ability to satisfactorily detect edges is further investigated in Figs. 2(d-f), where is set for each algorithm so that and have the same average over the time interval .
Fig. 3 analyzes different step size sequences. Because the true VAR parameters remain constant, the diminishing sequence yields the best performance; see \threfth:strongconvexitytirso. Besides, TISO and TIRSO are compared with benchmarks in Fig. 4, namely online subgradient descent (OSGD) and proximal gradient descent (PGD). The former obtains a minimizer for (5) in an online fashion (labeled as OSGDTISO since it uses the same information as TISO at each iteration). The latter approximates by using the (batch) algorithm PGD for iterations over (labeled as PGDTIRSO since it uses the same information as TIRSO at each iteration). Fig. 4 shows that TISO outperforms OSGDTISO in terms of NMSD, and TIRSO eventually attains better NMSD level than PGDTIRSO. Note that the computational complexity of PGDTIRSO is significantly larger than the complexity of TIRSO. Although the NMSD of TISO in Fig. 4 is close to that of OSGD, a more in-depth study reveals that the former yields sparse iterates without any thresholding; moreover, TIRSO offers a significantly improved edge-detection performance (EIER), see Fig. 5. Fig. 6 compares the true (left) and recovered (right) graphs via TIRSO and TISO by thresholding the average of the estimated VAR coefficients across the intervals , . The threshold is selected to detect edges. Note that this is displayed for a single graph and realization of the VAR process; in other words, this is not a Monte Carlo experiment. It is observed that both TIRSO and TISO can identify the true graph quite accurately and approximate the true VAR coefficients soon afterwards.
V-A2 Non-stationary VAR Processes
The next experiment analyzes TISO and TIRSO when is a (non-stationary) smooth-transition VAR process [56, Ch. 18] \bm{y}[t]=\sum_{p=1}^{P}\big{(}\bm{A}_{p}+s_{f}[t](\bm{B}_{p}-\bm{A}_{p})\big{)}\bm{y}[t-p]+\bm{u}[t]. The signal determines the transition profile from a VAR model with parameters to a VAR model with parameters . In this experiment, where controls the transition speed and denotes transition starting instant. Over an Erdős-Rényi random graph, and are generated independently as in Sec. V-A1. It is easy to show that the coefficients yield a stable VAR process for all .
Figs. 7(a) and 7(b) illustrate the influence of the forgetting factor , of critical importance in non-stationary setups. TISO and TIRSO are seen to satisfactorily estimate and track the model coefficients. As intuition predicts, the lower is, the more rapidly TIRSO can adapt to changes, but after a sufficiently long time after the transition, a higher is preferred.
Finally, to demonstrate that TISO and TIRSO successfully leverage sparsity to track time-varying topologies, Fig. 8 illustrates an approximately optimal point in the trade-off of selecting .
V-B Real-Data Tests
The real data is taken from Lundin’s offshore oil and gas (O&G) platform Edvard-Grieg666https://www.lundin-petroleum.com/operations/production/norway-edvard-grieg. Each node corresponds to a temperature, pressure, or oil-level sensor placed in the decantation system that separates oil, gas, and water. The measured time series are physically coupled due to the pipelines connecting the system parts and due to the control systems. Hence, causal relations among time series are expected. Topology identification is motivated to forecast the short-term future state of the system and to unveil dependencies that cannot be detected by simple inspection. All time series are resampled to a common set of equally-spaced sampling instants using linear interpolation. Since the data was quantized and compressed using a lossy scheme, a significant amount of noise is expected. Each time series is normalized to have zero mean and unit sample standard deviation.
Here, the step size is set to and the NMSE is defined as
Fig. 9 shows the vs. the prediction horizon for the time series in the data set. The temperature, pressure, and oil level time series are respectively denoted by T, P, and L and an identifying index. As expected, the prediction error increases with . The NMSE ranges from to due to the different predictability of each time series.
Fig. 10 presents the graph obtained by thresholding the average coefficient estimates over a three-hour duration. The threshold is such that the number of reported edges is . Self-loops are omitted for clarity, and arrow colors encode edge weights. It is observed that most identified edges connect sensors within each subsystem.
VI Conclusions
Two online algorithms were proposed for identifying and tracking VAR-causality graphs from time series. These algorithms sequentially accommodate data and refine their sparse topology estimates accordingly. The proposed algorithms offer complementary benefits: whereas TISO is computationally simpler, TIRSO showcases improved tracking behavior. Performance is assessed theoretically and empirically. Asymptotic equivalence of the hindsight solutions of the proposed algorithms is established and sublinear regret bounds are derived. Experiments with synthetic and real data validate the conclusions of the theoretical analysis. Future directions include explicitly modeling the variations in the VAR coefficients, possibly along the lines of [57, 58, 59], as well as identifying topologies whose adjacency matrix has a low-rank plus sparse structure along the lines of [60] to account for clusters.
Appendix A Proof of \threfprop:asymptoticequivalence
The first step is to rewrite (30) to be able to obtain a simple expression for {\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n})-{\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n}). To this end, substitute (19) into (30) and exchange the order of the summations to obtain
[TABLE]
where . From the geometric series summation formula, which establishes that , and noting that , the above equation becomes
[TABLE]
From (28) and the equation above, the difference d_{T}(\bm{a}_{n})\triangleq{\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n})-{\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n}) between the TISO and TIRSO hindsight objectives is given by:
[TABLE]
To prove part 1, it suffices to show that as for all . To this end, expand
[TABLE]
and apply Cauchy-Schwarz inequality to obtain
[TABLE]
On the other hand, the hypothesis |y_{n}[t]|^{2}\leq{\color[rgb]{0,0,0}B}_{y}\forall n,t implies that \lVert\bm{y}[t]\rVert_{2}^{2}\leq N{\color[rgb]{0,0,0}B}_{y}, and hence
[TABLE]
Substituting the upper bound of into (47) yields
[TABLE]
Applying the latter bound to (45) results in
[TABLE]
Taking the limit of the right-hand side clearly yields
[TABLE]
Noting from (45) that , it follows that , which concludes the proof of part 1.
To prove part 2, note from (45) that , which in turn implies that
[TABLE]
for all and . On the other hand, it follows from (29) that
[TABLE]
Thus, by combining (51) and (52),
[TABLE]
Similarly, from (27), it holds that {\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n}^{*}[T])\leq{\color[rgb]{0,0,0}C}_{T}({\tilde{\bm{a}}}_{n}^{*}[T]). Subtracting {\color[rgb]{0,0,0}\tilde{C}}_{T}({\tilde{\bm{a}}}_{n}^{*}[T]) from both sides of the latter inequality yields
[TABLE]
By combining (53) and (A), it holds that
[TABLE]
Since , (55) implies that
[TABLE]
Finally, to establish part 3, note that it follows from assumption A2, (20) and (30) that {\color[rgb]{0,0,0}\tilde{C}}_{T} is -strongly convex for some . Thus, from (29), one finds that
[TABLE]
By combining (51) and (57), it follows that
[TABLE]
or, equivalently,
[TABLE]
Taking limits gives rise to
[TABLE]
From (56) and the sandwich theorem applied to (60), we have
[TABLE]
which concludes the proof.
Appendix B Proof of \threfcor:doublingtricktiso
Consider first the regret of TISO with constant step size.
Lemma 1**.**
\thlabel
prop:regrettiso Let be generated by TISO (Procedure 1) with constant step size \alpha_{t}\!=\alpha\!=\!\mathcal{O}\big{(}1/\sqrt{T}\big{)}. Under assumptions A1 and A26, we have
[TABLE]
Proof:
See Appendix C. ∎
Observe that the step size in \threfprop:regrettiso depends on and therefore (62) cannot be interpreted as directly establishing sublinear regret for TISO. To understand this result, consider a sequence of copies of TISO, each one for a value of . Each copy has a (potentially) different step size, but uses the same step size for all . Expression (62) bounds the regret of the -th copy at time . However, \threfprop:regrettiso can be used next to establish sublinear regret for step size sequences that remain constant over windows of exponentially increasing length; see the doubling trick [46].
To this end, let the regret in the window be
[TABLE]
where is an arbitrary sequence and
[TABLE]
The next result establishes a bound on the static regret given the regret at each window.
Lemma 2**.**
\thlabel
lemma:individualregrets For and for an arbitrary sequence , the regret in (31) is bounded as:
[TABLE]
Proof:
For , expression (31) can be written as:
[TABLE]
On the other hand, it follows from (63) that (65) is equivalent to
[TABLE]
The inequality in (67) can also be rewritten as
[TABLE]
By comparing (66) and (68), proving (65) is equivalent to showing that
[TABLE]
From the definitions of in (27) and in (64), the above inequality holds since . ∎
The next step is to bound the regret at each window using \threfprop:regrettiso. To this end, one must set as a function , where is the length of the -th window, . Invoking \threfprop:regrettiso, the regret for the -th window is given by R_{s}^{(n)}[t_{0}2^{m-1}+1,t_{0}2^{m}]=\mathcal{O}(PN{\color[rgb]{0,0,0}B}_{y}B_{\bm{a}}^{2}\sqrt{2^{m-1}}). By \threflemma:individualregrets, the regret of TISO becomes
[TABLE]
which concludes the proof.
Appendix C Proof of \threfprop:regrettiso
First we present a lemma that establishes that the hindsight solution of TISO is bounded and then we will present the proof of \threfprop:regrettiso.
Lemma 3**.**
\thlabel
lemma:boundedhindsightsolutionTISO Under assumptions A1, A2, and A26, the hindsight solution of TISO given in (27) is bounded as
[TABLE]
Proof:
Note that belongs to the sublevel set of TISO hindsight objective for , given by
[TABLE]
where {\color[rgb]{0,0,0}C}_{T}(\bm{0}_{NP}) is upper bounded by
[TABLE]
This means that we can write:
[TABLE]
Next, we find a lower bound to {\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n}^{*}[T]) that is an increasing function of as follows
[TABLE]
Therefore,
[TABLE]
Further, we can write
[TABLE]
with B_{\bm{a}}\triangleq 1/\beta({\color[rgb]{0,0,0}B}_{y}\sqrt{PN}+\sqrt{{\color[rgb]{0,0,0}B}_{y}^{2}PN+\beta{\color[rgb]{0,0,0}B}_{y}}). Expression (75) implies that the TISO hindsight solution is bounded. ∎
Now, we present the proof of \threfprop:regrettiso. This proof is based on the idea that if the inequality \lVert\nabla\ell_{t}^{(n)}(\bm{a}_{n})\rVert_{2}^{2}\leq 2PN{\color[rgb]{0,0,0}B}_{y}\,\ell_{t}^{(n)}(\bm{a}_{n}),\forall\,t,n holds and the strong convexity parameter of is 1, then it follows from [42, Corollary 5] that:
[TABLE]
where is defined in (70). We still need to show that the inequality \lVert\nabla\ell_{t}^{(n)}(\bm{a}_{n})\rVert_{2}^{2}~{}\leq~{}2PN{\color[rgb]{0,0,0}B}_{y}\,\ell_{t}^{(n)}(\bm{a}_{n}), , holds. To this end, note from (14) that:
[TABLE]
On the other hand, the hypothesis |y_{n}[t]|^{2}\leq{\color[rgb]{0,0,0}B}_{y}~{}\forall~{}n,t implies that \left\lVert\bm{y}[t]\right\rVert_{2}^{2}\leq N{\color[rgb]{0,0,0}B}_{y} and, therefore:
[TABLE]
Combining (76) and (77) yields
[TABLE]
Thus, to satisfy
[TABLE]
it suffices to set \rho=2PN{\color[rgb]{0,0,0}B}_{y}.
Appendix D Proof of \threfcor:doublingtricktirso
The first step is to obtain a bound for constant step size.
Lemma 4**.**
\thlabel
prop:regrettirso Let be generated by TIRSO (Procedure 2) with constant step size \alpha_{t}\!=\!\alpha\!=\!\mathcal{O}\big{(}1/\sqrt{T}\big{)}. Under assumptions A1, A2, and A3, we have
[TABLE]
Proof:
See Appendix E. ∎
The rest of the proof proceeds along the lines of the proof of \threfcor:doublingtricktiso.
Appendix E Proof of \threfprop:regrettirso
First, we present a lemma that establishes that the hindsight solution of TIRSO is bounded. Then, we will present the proof of \threfprop:regrettirso.
Lemma 5**.**
\thlabel
lemma:boundedhindsightsolutionTIRSO Under the assumptions A1 and A2, the hindsight solution of TIRSO given in (29) is bounded as
[TABLE]
Proof:
The proof follows similar steps to those of \threflemma:boundedhindsightsolutionTISO. Consider the sublevel set of TIRSO hindsight objective for ,
[TABLE]
where {\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{0}_{NP}) is upper bounded as follows:
[TABLE]
This implies that
[TABLE]
Next, we find a lower bound to {\color[rgb]{0,0,0}\tilde{C}}_{T}({\tilde{\bm{a}}}_{n}^{*}[T]) that is an increasing function of as follows
[TABLE]
Therefore,
[TABLE]
Further, we can write
[TABLE]
Expression (86) implies that the TIRSO hindsight solution is bounded. ∎
Now, we present the proof of \threfprop:regrettirso. The proof has two parts. The first step is to prove that there exists such that
[TABLE]
holds for all . The second step is to apply the result of [42, Corollary 5] in the present case. To prove the first part, from (20) and , it follows that (87) is equivalent to
[TABLE]
By expanding the left-hand side of (E), rearranging terms, and introducing as
[TABLE]
the condition in (87) is equivalent to . So the goal becomes finding such that for all and . For this condition to hold, it is necessary that (a) is finite for all , and (b) for all . It can be seen [61, Appendix A.5] that condition (a) holds iff (a1) the Hessian matrix is positive semidefinite, and (a2) , where denotes the span of the columns of a matrix . The first step is to find such that (a1) holds. To this end, consider the eigenvalue decomposition of , where the index is omitted to simplify notation. Therefore,
[TABLE]
Let denote the maximum eigenvalue of . It follows from (90) that is positive semidefinite if
[TABLE]
It remains to be shown that there exists such that (91), (a2), and (b) simultaneously hold. To this end, focus first on (a2), which can be rewritten as
[TABLE]
Clearly, if , then is invertible and, hence, [62, Ch. 4]. Thus, (92) holds if and . The former condition is trivial. To verify the latter, define
[TABLE]
and ; note that . It follows that . Therefore, holds and, consequently, (a2) holds whenever .
So far, this proof has established that, if , then both (a1) and (a2) hold. The next step is to show that (b) also holds when . To this end, set the gradient of equal to zero and use to obtain , where the symbol denotes pseudo-inverse. From this expression and (E), it follows that
[TABLE]
Applying the properties of the pseudoinverse and simplifying results in
[TABLE]
From this expression, note that the condition is equivalent to
[TABLE]
and, upon defining ,
[TABLE]
This inequality trivially holds when . Thus, assume without loss of generality that . By setting , one obtains and .
Since the nonzero eigenvalues of and are the same [63, Sec. 3.2.11] and the maximum eigenvalue of is 1, then the maximum eigenvalue of is also 1. Therefore
[TABLE]
and, hence, (97) holds. To sum up, conditions (a) and (b) hold if . In other words, (87) holds for any choice of such that for all . This completes the first part of the proof. The second part of the proof consists of setting with an arbitrary constant, and invoking [42, Corollary 5] to conclude that
[TABLE]
Using assumption A3 and substituting the upper bound on from (80) into the above expression completes the proof.
Appendix F Proof of \threfth:strongconvexitytirso
To prove \threfth:strongconvexitytirso, first we present two lemmas. Before presenting the result related to logarithmic regret of TIRSO, it is worth mentioning that a related result is presented in [42, Th. 7], which is applicable to strongly convex regularization functions. Note that in TIRSO, the data-fitting function is strongly convex. It can be easily shown that COMID applied to a problem with strongly convex regularizer produces different iterates than COMID applied to a strongly convex data-fitting function.
Lemma 6**.**
\thlabel
lemma:strongcvxlemma Under assumption A2, let the sequence be generated by TIRSO (Procedure 2) with a step size , and let be the hindsight solution for TIRSO at time defined in (29). Then
[TABLE]
for , .
Proof:
For a strongly convex , by the subgradient inequality, we have
[TABLE]
. On the other hand, since is convex,
[TABLE]
. Adding (100) and (101), scaling by , and rearranging terms,
[TABLE]
where (a) results from adding and subtracting the term followed by rearranging terms; in (b) the inequality is used, which is implied by the optimality of in (23), i.e., ; in (c) the Pythagorean theorem for Euclidean distance (i.e. ) is used; in (d) the inequality is used. Dividing both sides of (102) by completes the proof. ∎
Next, we establish that TIRSO estimates are bounded and a bound on that depends on parameters of the algorithm, is derived.
Lemma 7**.**
\thlabel
lemma:boundingestiamtes Under assumptions A1 and A2, and let the sequence of iterates be generated by TIRSO (Procedure 2). Then
[TABLE]
Proof:
From the update expression of TIRSO, we have
[TABLE]
Now, we derive an upper bound on . By the definition of in (21b) and assumption A1, we have
[TABLE]
Substituting the upper bound of from (105b) into (104) completes the proof. ∎
Lemma 8**.**
\thlabel
lemma:boundg Under assumptions A1, A2, and A3, and let the sequence of iterates be generated by TIRSO (Procedure 2) with . Then
[TABLE]
[TABLE]
Proof:
Invoking \threflemma:boundingestiamtes and setting in (103),
[TABLE]
Substituting the upper bound of using (108a), we have
[TABLE]
After substitutions, the above bound can be written in terms of as follows
[TABLE]
. The bound on in terms of the initial estimate is obtained for in the above inequality, given by
[TABLE]
This completes the proof of (106), the first part of the lemma. To prove the second part of the lemma, by taking the value of the gradient in (22), and by the triangular inequality,
[TABLE]
∎
Now, we are ready to prove \threfth:strongconvexitytirso. We start from the result presented in \threflemma:strongcvxlemma. Summing both sides of (99) from to results in
[TABLE]
where the inequality in (a) results from ignoring the term and combining similar terms. To relate the l.h.s. of (111) and the static regret in this case, consider the definition of the static regret for TIRSO in (32)
[TABLE]
Adding and subtracting the term to the r.h.s. of (113) and rearranging of terms results in
[TABLE]
where and are used in the above inequality. Observe that the r.h.s. of the above inequality coincides with the l.h.s. of (111). Therefore, from (112) and (114), we have
[TABLE]
Setting in the above inequality yields
[TABLE]
where in (a) the bound on the gradient given in (107) is used; in (b) the inequality and the fact is used, and (c) is obtained by using the bound from (80).
Appendix G Proof of \threfth:dynamicregretbound
We derive the dynamic regret of TIRSO. To this end, since is convex, we have by definition
[TABLE]
, where with . Rearranging (115) and summing both sides of the inequality from to results in:
[TABLE]
By applying the Cauchy–Schwarz inequality on each term of the summation in the r.h.s. of the above inequality, we obtain
[TABLE]
The next step is to derive an upper bound on . From the definition of and by the triangular inequality, we have
[TABLE]
To bound , we invoke \threflemma:boundingestiamtes and set to obtain
[TABLE]
where . Observe that for , we have . Substituting (118b) recursively, we obtain
[TABLE]
where . For , the above inequality becomes
[TABLE]
which implies that \lVert\nabla\tilde{\ell}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])\rVert_{2}\leq(1+L/\beta_{\tilde{\ell}})\sqrt{PN}{\color[rgb]{0,0,0}B}_{y}, as in the proof of \threflemma:boundg by following the same arguments as in (110a). Next, we need to find an upper bound on in (117). To this end, we apply the result in [46, Lemma 2.6] to , which establishes that all the subgradients of are bounded by its Lipschitz continuity parameter . In the following, we show that . Lipschitz smoothness of means that there exists such that
[TABLE]
for all . By definition, we have with . Let and by taking the l.h.s. of (121), we have
[TABLE]
where the inequality in (122a) holds due to the triangle inequality for scalars ( as scalars); (122b) holds due to the reverse triangle inequality (given by ); and (122c) follows from the inequality with [64, Sec. 2.2.2]. The inequality in (122c) implies that (121) is satisfied with , i.e., is -Lipschitz continuous. Thus, we have \lVert\tilde{\nabla}\tilde{h}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])\rVert_{2}\leq(1+L/\beta_{\tilde{\ell}})\sqrt{PN}{\color[rgb]{0,0,0}B}_{y}+\lambda\sqrt{N}. Substituting this bound in (116) leads to:
[TABLE]
Next, we show that TIRSO for a constant step size can alternatively be derived by applying online proximal gradient descent to minimize . With given by (19) and is given by (11b), applying the online proximal gradient algorithm with a constant step size yields:
[TABLE]
where the proximal operator of a function at point is defined by [65]:
[TABLE]
The parameter controls the trade-off between minimizing and being close to . According to the definition in Sec. III-B, , and , which enables us to write the above update expression as
[TABLE]
Observe that the above problem is separable and the solution to the -th problem is given by:
[TABLE]
which is the same as (25) with a constant step size . Therefore, TIRSO can be equivalently derived by applying online proximal gradient descent method. Next, we apply Lemma 2 in [54] in order to bound in (G). The hypotheses of Lemma 2 are Lipschitz smoothness of , Lipschitz continuity of , and strong convexity of . Lipschitz continuity of is proved in (122c) whereas strong convexity of is implied by the assumption A2. So we need to verify that is Lipschitz-smooth, which means that there is such that
[TABLE]
for all . To this end, taking the l.h.s. of (127) and substituting the value of the gradient of from (22) results in:
[TABLE]
where denotes the maximum eigenvalue of the input matrix. Due to assumption A3, the inequality in (127) holds with . To apply Lemma 2 in [54], one can set in [54] as , as , and as , it follows that in [54] equals and equals . Then, since we have already shown above that the hypotheses of Lemma 2 in [54] hold in our case, applying it to bound in (G) yields:
[TABLE]
Noting that concludes the proof.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. Zaman, L. M. López-Ramos, D. Romero, and B. Beferull-Lozano, “Online topology estimation for vector autoregressive processes in data networks,” in Proc. IEEE Int. Workshop Comput. Advan. Multi-Sensor Adapt. Process. , Curaçao, Dutch Antilles, Dec. 2017.
- 2[2] E. D. Kolaczyk, Statistical Analysis of Network Data: Methods and Models , Springer, New York, 2009.
- 3[3] E. Isufi, A. Loukas, N. Perraudin, and G. Leus, “Forecasting time series with varma recursions on graphs,” ar Xiv preprint ar Xiv:1810.08581 , 2018.
- 4[4] P. Di Lorenzo, S. Barbarossa, P. Banelli, and S. Sardellitti, “Adaptive least mean squares estimation of graph signals,” IEEE Trans. Signal Info. Process. Netw. , vol. 2, no. 4, pp. 555–568, Dec. 2016.
- 5[5] C. Liu, S. Ghosal, Z. Jiang, and S. Sarkar, “An unsupervised spatiotemporal graphical modeling approach to anomaly detection in distributed CPS,” in ACM/IEEE Int. Conf. Cyber-Physical Syst. , Apr. 2016, pp. 1–10.
- 6[6] Y. Shen, P. A. Traganitis, and G. B. Giannakis, “Nonlinear dimensionality reduction on graphs,” in Proc. IEEE Int. Workshop Comput. Advan. Multi-Sensor Adapt. Process. , Curacao, Netherlands Antilles, Dec. 2017.
- 7[7] C. M. Bishop, Pattern Recognition and Machine Learning , Information Science and Statistics. Springer, 2006.
- 8[8] G. Mateos, S. Segarra, A. G. Marques, and A. Ribeiro, “Connecting the dots: Identifying network structure via graph signal processing,” ar Xiv preprint ar Xiv:1810.13066 , 2018.
