Online Topology Identification from Vector Autoregressive Time Series

Bakht Zaman; Luis Miguel Lopez Ramos; Daniel Romero; Baltasar; Beferull-Lozano

arXiv:1904.01864·eess.SP·November 16, 2020·IEEE Trans. Signal Process.

Online Topology Identification from Vector Autoregressive Time Series

Bakht Zaman, Luis Miguel Lopez Ramos, Daniel Romero, Baltasar, Beferull-Lozano

PDF

1 Repo

TL;DR

This paper introduces two online algorithms for real-time identification of causality graphs from multivariate time series using VAR models, suitable for big data and dynamic environments, with proven asymptotic optimality.

Contribution

It develops two novel online algorithms for tracking time-varying causality graphs from VAR models, with theoretical performance guarantees and applicability to large-scale data.

Findings

01

Algorithms achieve asymptotic performance matching batch estimators.

02

Algorithms have constant complexity per update, suitable for big data.

03

Numerical results validate effectiveness in static and dynamic scenarios.

Abstract

Causality graphs are routinely estimated in social sciences, natural sciences, and engineering due to their capacity to efficiently represent the spatiotemporal structure of multivariate data sets in a format amenable for human interpretation, forecasting, and anomaly detection. A popular approach to mathematically formalize causality is based on vector autoregressive (VAR) models and constitutes an alternative to the well-known, yet usually intractable, Granger causality. Relying on such a VAR causality notion, this paper develops two algorithms with complementary benefits to track time-varying causality graphs in an online fashion. Their constant complexity per update also renders these algorithms appealing for big-data scenarios. Despite using data sequentially, both algorithms are shown to asymptotically attain the same average performance as a batch estimator which uses the entire…

Equations415

y [t] = \sum_{p = 1}^{P} A_{p} y [t - p] + u [t],

y [t] = \sum_{p = 1}^{P} A_{p} y [t - p] + u [t],

y_{n} [t]

y_{n} [t]

= \sum_{n^{'} \in N (n)} \sum_{p = 1}^{P} a_{n, n^{'}}^{(p)} y_{n^{'}} [t - p] + u_{n} [t]

\displaystyle\mathcal{L}\left(\mathcal{A}\right)\triangleq\frac{1}{\,2(T-P)}\sum_{\tau=P}^{T-1}\Big{\lVert}\bm{y}[\tau]-\sum_{p=1}^{P}\bm{A}_{p}\,\bm{y}[\tau-p]\Big{\lVert}_{2}^{2}

\displaystyle\mathcal{L}\left(\mathcal{A}\right)\triangleq\frac{1}{\,2(T-P)}\sum_{\tau=P}^{T-1}\Big{\lVert}\bm{y}[\tau]-\sum_{p=1}^{P}\bm{A}_{p}\,\bm{y}[\tau-p]\Big{\lVert}_{2}^{2}

\displaystyle=\frac{1}{2\,(T-P)}\sum_{n=1}^{N}\sum_{\tau=P}^{T-1}\Big{[}y_{n}[\tau]-\sum_{n^{\prime}=1}^{N}\sum_{p=1}^{P}a_{n,n^{\prime}}^{(p)}\,y_{n^{\prime}}[\tau-p]\Big{]}^{2}.

A ar g min L (A) + λ \sum_{n = 1}^{N} \sum_{n^{'} = 1, n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2},

A ar g min L (A) + λ \sum_{n = 1}^{N} \sum_{n^{'} = 1, n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2},

\displaystyle\bm{g}[t]\triangleq\mathrm{vec}\big{(}\left[\bm{y}[t-1],\ldots,\bm{y}[t-P]\right]^{\top}\!\big{)}\in\mathbb{R}^{NP},

\displaystyle\bm{g}[t]\triangleq\mathrm{vec}\big{(}\left[\bm{y}[t-1],\ldots,\bm{y}[t-P]\right]^{\top}\!\big{)}\in\mathbb{R}^{NP},

a_{n}^{*} = a_{n} ar g min ℓ^{(n)} (a_{n}) + λ \sum_{n^{'} = 1, n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2}

a_{n}^{*} = a_{n} ar g min ℓ^{(n)} (a_{n}) + λ \sum_{n^{'} = 1, n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2}

minimize_{a} \frac{1}{T _{0}} t = 0 \sum T_{0} - 1 h_{t} (a),

minimize_{a} \frac{1}{T _{0}} t = 0 \sum T_{0} - 1 h_{t} (a),

R_{s} [T_{0}] ≜ t = 0 \sum T_{0} - 1 [h_{t} (a [t]) - h_{t} (a^{*} [T_{0}])],

R_{s} [T_{0}] ≜ t = 0 \sum T_{0} - 1 [h_{t} (a [t]) - h_{t} (a^{*} [T_{0}])],

R_{d} [T_{0}] ≜ t = 0 \sum T_{0} - 1 [h_{t} (a [t]) - h_{t} (a^{\circ} [t])] .

R_{d} [T_{0}] ≜ t = 0 \sum T_{0} - 1 [h_{t} (a [t]) - h_{t} (a^{\circ} [t])] .

h_{t} (a_{n}) = ℓ_{t + P}^{(n)} (a_{n}) + λ \sum_{n^{'} = 1, n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2},

h_{t} (a_{n}) = ℓ_{t + P}^{(n)} (a_{n}) + λ \sum_{n^{'} = 1, n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2},

\tilde{\nabla}_{a_{n}} \sum_{n^{'} = 1 n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2} ≜ [\tilde{\nabla}_{a_{n, 1}}^{⊤} ∥ a_{n, 1} ∥_{2}, \dots, \tilde{\nabla}_{a_{n, n - 1}}^{⊤} ∥ a_{n, n - 1} ∥_{2}, 0_{P}, \tilde{\nabla}_{a_{n, n + 1}}^{⊤} ∥ a_{n, n + 1} ∥_{2}, \dots, \tilde{\nabla}_{a_{n, N}}^{⊤} ∥ a_{n, N} ∥_{2}]^{⊤},

\tilde{\nabla}_{a_{n}} \sum_{n^{'} = 1 n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2} ≜ [\tilde{\nabla}_{a_{n, 1}}^{⊤} ∥ a_{n, 1} ∥_{2}, \dots, \tilde{\nabla}_{a_{n, n - 1}}^{⊤} ∥ a_{n, n - 1} ∥_{2}, 0_{P}, \tilde{\nabla}_{a_{n, n + 1}}^{⊤} ∥ a_{n, n + 1} ∥_{2}, \dots, \tilde{\nabla}_{a_{n, N}}^{⊤} ∥ a_{n, N} ∥_{2}]^{⊤},

\underset{\bm{a}_{n}}{\mathop{\text{minimize}}}~{}\frac{1}{T_{0}}\sum_{t=0}^{T_{0}-1}\big{[}{f_{t}^{(n)}}(\bm{a}_{n})+{{\Omega}^{(n)}}(\bm{a}_{n})\big{]},

\underset{\bm{a}_{n}}{\mathop{\text{minimize}}}~{}\frac{1}{T_{0}}\sum_{t=0}^{T_{0}-1}\big{[}{f_{t}^{(n)}}(\bm{a}_{n})+{{\Omega}^{(n)}}(\bm{a}_{n})\big{]},

\textstyle{\bm{a}_{n}}[t+1]=\underset{\bm{a}_{n}}{\arg\min}\big{[}{\alpha_{t}}\tilde{\nabla}{f_{t}^{(n)}}^{\top}(\bm{a}_{n}[t])\left(\bm{a}_{n}-\bm{a}_{n}[t]\right)\\ +B_{\psi}\left(\bm{a}_{n},\bm{a}_{n}[t]\right)+{\alpha_{t}{\Omega}^{(n)}}(\bm{a}_{n})\big{]},

\textstyle{\bm{a}_{n}}[t+1]=\underset{\bm{a}_{n}}{\arg\min}\big{[}{\alpha_{t}}\tilde{\nabla}{f_{t}^{(n)}}^{\top}(\bm{a}_{n}[t])\left(\bm{a}_{n}-\bm{a}_{n}[t]\right)\\ +B_{\psi}\left(\bm{a}_{n},\bm{a}_{n}[t]\right)+{\alpha_{t}{\Omega}^{(n)}}(\bm{a}_{n})\big{]},

f_{t}^{(n)} (a_{n})

f_{t}^{(n)} (a_{n})

Ω^{(n)} (a_{n})

a_{n} [t + 1]

a_{n} [t + 1]

J_{t}^{(n)} (a_{n}) ≜ v_{n}^{⊤} [t] (a_{n} - a_{n} [t]) + \frac{1}{2 α _{t}} ∥ a_{n} - a_{n} [t] ∥_{2}^{2} + λ \sum_{n^{'} = 1, n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2}

J_{t}^{(n)} (a_{n}) ≜ v_{n}^{⊤} [t] (a_{n} - a_{n} [t]) + \frac{1}{2 α _{t}} ∥ a_{n} - a_{n} [t] ∥_{2}^{2} + λ \sum_{n^{'} = 1, n^{'} \neq = n}^{N} ∥ a_{n, n^{'}} ∥_{2}

\vspace - 1 mm v_{n} [t] ≜ \nabla ℓ_{t}^{(n)} (a_{n} [t]) = g [t] (g^{⊤} [t] a_{n} [t] - y_{n} [t]) .

\vspace - 1 mm v_{n} [t] ≜ \nabla ℓ_{t}^{(n)} (a_{n} [t]) = g [t] (g^{⊤} [t] a_{n} [t] - y_{n} [t]) .

J_{t}^{(n)} (a_{n})

J_{t}^{(n)} (a_{n})

\displaystyle=\sum_{n^{\prime}=1}^{N}\Big{[}\frac{1}{2\alpha_{t}}\|\bm{a}_{n,n^{\prime}}\|_{2}^{2}\!+\!\bm{a}_{n,n^{\prime}}^{\top}\big{(}\bm{v}_{n,n^{\prime}}[t]\!-\!\frac{1}{\alpha_{t}}\,\bm{a}_{n,n^{\prime}}[t]\big{)}

\displaystyle\quad+\lambda\left\lVert\bm{a}_{n,n^{\prime}}\right\rVert_{2}\mathds{1}\{n^{\prime}\neq n\}\Big{]},

a_{n, n^{'}} [t + 1] = a_{n, n^{'}}^{f} [t] [1 - \frac{α _{t} λ}{a _{n, n^{'}}^{f} [ t ] _{2}}]_{+},

a_{n, n^{'}} [t + 1] = a_{n, n^{'}}^{f} [t] [1 - \frac{α _{t} λ}{a _{n, n^{'}}^{f} [ t ] _{2}}]_{+},

a_{n, n^{'}} [t + 1] = a_{n, n^{'}} [t] - α_{t} v_{n, n^{'}} [t] = a_{n, n^{'}}^{f} [t]

a_{n, n^{'}} [t + 1] = a_{n, n^{'}} [t] - α_{t} v_{n, n^{'}} [t] = a_{n, n^{'}}^{f} [t]

a_{n, n^{'}} [t + 1] = a_{n, n^{'}}^{f} [t] [1 - \frac{α _{t} λ \mathds 1 { n \neq = n ^{'} }}{a _{n, n^{'}}^{f} [ t ] _{2}}]_{+} .

a_{n, n^{'}} [t + 1] = a_{n, n^{'}}^{f} [t] [1 - \frac{α _{t} λ \mathds 1 { n \neq = n ^{'} }}{a _{n, n^{'}}^{f} [ t ] _{2}}]_{+} .

\tilde{ℓ}_{t}^{(n)} (a_{n}) ≜ μ \sum_{τ = P}^{t} γ^{t - τ} ℓ_{τ}^{(n)} (a_{n}),

\tilde{ℓ}_{t}^{(n)} (a_{n}) ≜ μ \sum_{τ = P}^{t} γ^{t - τ} ℓ_{τ}^{(n)} (a_{n}),

\tilde{ℓ}_{t}^{(n)} (a_{n}) =

\tilde{ℓ}_{t}^{(n)} (a_{n}) =

=

Φ [t] ≜ μ \sum_{τ = P}^{t} γ^{t - τ} g [τ] g^{⊤} [τ],

Φ [t] ≜ μ \sum_{τ = P}^{t} γ^{t - τ} g [τ] g^{⊤} [τ],

r_{n} [t] ≜ μ \sum_{τ = P}^{t} γ^{t - τ} y_{n} [τ] g [τ] .

\nabla \tilde{ℓ}_{t}^{(n)} (a_{n}) = Φ [t] a_{n} - r_{n} [t],

\nabla \tilde{ℓ}_{t}^{(n)} (a_{n}) = Φ [t] a_{n} - r_{n} [t],

\tilde{a}_{n} [t + 1]

\tilde{a}_{n} [t + 1]

\tilde{J}_{t}^{(n)} (\tilde{a}_{n}) ≜ \tilde{v}_{n}^{⊤} [t] (\tilde{a}_{n} - \tilde{a}_{n} [t]) + \frac{1}{2 α _{t}} ∥ \tilde{a}_{n} - \tilde{a}_{n} [t] ∥_{2}^{2} + λ \sum_{n^{'} = 1, n^{'} \neq = n}^{N} ∥ \tilde{a}_{n, n^{'}} ∥_{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uia-wisenet/OnlineTopologyId
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Online Topology Identification from Vector Autoregressive Time Series

Bakht Zaman, Luis Miguel Lopez Ramos,

Daniel Romero, and Baltasar Beferull-Lozano The work in this paper was supported by the SFI Offshore Mechatronics grant 237896/E30, the PETROMAKS Smart-Rig grant 244205 and the IKTPLUSS Indurb grant 270730/O70 from the Research Council of Norway. The authors are with the WISENET Lab, Dept. of ICT, University of Agder, Jon Lilletunsvei 3, Grimstad, 4879 Norway. E-mails:{bakht.zaman, luismiguel.lopez, daniel.romero, baltasar.beferull}@uia.no.The material in this work was presented, in part, at CAMSAP 2017 [1].

Abstract

Causality graphs are routinely estimated in social sciences, natural sciences, and engineering due to their capacity to efficiently represent the spatiotemporal structure of multi-variate data sets in a format amenable for human interpretation, forecasting, and anomaly detection. A popular approach to mathematically formalize causality is based on vector autoregressive (VAR) models and constitutes an alternative to the well-known, yet usually intractable, Granger causality. Relying on such a VAR causality notion, this paper develops two algorithms with complementary benefits to track time-varying causality graphs in an online fashion. Their constant complexity per update also renders these algorithms appealing for big-data scenarios. Despite using data sequentially, both algorithms are shown to asymptotically attain the same average performance as a batch estimator which uses the entire data set at once. To this end, sublinear (static) regret bounds are established. Performance is also characterized in time-varying setups by means of dynamic regret analysis. Numerical results with real and synthetic data further support the merits of the proposed algorithms in static and dynamic scenarios.

I Introduction

Inferring causal relations among time series finds countless applications in social sciences, natural sciences, and engineering. These relations are typically encoded as the edges of a causality graph, where each node corresponds to a time series, and oftentimes reveal the topology of e.g. an underlying social, biological, or brain network [2]. Causality graphs may also offer valuable insights into the spatio-temporal structure of time series and assist data processing tasks such as forecasting [3], signal reconstruction [4], anomaly detection [5], and dimensionality reduction [6]. In some applications, graphs capturing different forms of causality can be constructed based on domain knowledge; see e.g. [7, Ch. 8]. However, this approach is often impractical in the aforementioned applications due to the large dimension of the data or because such prior knowledge is unavailable. Instead, causality graphs need to be inferred from data in these situations. This paper accomplishes this task in an online fashion.

Identifying graphs capturing the spatiotemporal “interactions” among time series has attracted great attention [2, 8]. Some approaches focus on instantaneous interactions, i.e., they disregard the temporal structure. The simplest one is to connect two nodes if the sample correlation between the associated time series exceeds a certain threshold [2]. To distinguish mediated from unmediated interactions [2, Sec. 7.3.2], one may resort to conditional independence, partial correlations, Markov random fields, or other approaches in graph signal processing; see e.g. [9, 10, 11, 7, 12, 13]. For directed interactions, one may employ structural equation models (SEM) [14] (see also [15] and references therein) or Bayesian networks [7, Sec. 8.1]. However, these methods account only for memoryless interactions, i.e., they cannot accommodate delayed interactions where the value of a time series at a given time instant is related to the past values of other time series.

The earliest effort to formalize the notion of causality among time series is due to Granger [16] and relies on the rationale that the cause precedes the effect. A time series is said to be Granger-caused by another if the optimal prediction error of the former is decreased when the past of the latter is taken into account. Albeit elegant, this definition is generally impractical since the optimal prediction error is difficult to determine [17, p. 33], [18]. Thus, alternative causality definitions based on vector autoregressive (VAR) models are typically preferred [19, 20, 21]. VAR causality is determined from the support of VAR matrix parameters and is equivalent to Granger causality [22, Chap. 2] in certain cases (yet sometimes treated as equivalent [20, 21]). VAR causality is further motivated by the widespread usage of VAR models to approximate the response of systems of linear partial differential equations [23] and, more generally, in disciplines such as econometrics, bio-informatics, neuroscience, and engineering [24, 25, 26]. VAR topologies are estimated assuming Gaussianity and stationarity in [27, 28] and assuming sparsity in [29, 30, 31, 32]. All these approaches assume that the graph does not change over time. Since this is not the case in many applications, approaches have been devised to identify undirected time-varying topologies [33, 34] and directed piecewise-constant time-varying topologies [35].

The complexity of all previously discussed approaches becomes prohibitive for long observation windows since they process the entire data set at once and cannot accommodate data arriving sequentially. The modern approach to tackle these issues is online optimization, where an estimate is refined with every new data instance. Existing online topology identification algorithms include [36, 15],[37, 38, 39], and [40], but they only account for memoryless interactions.

The present work is the first to propose online algorithms to estimate the memory-aware causality graphs associated with a collection of time series111The related work in [41] was run in parallel and published after the conference version [1] of this work.. We take as a starting point an online algorithm for estimating directed VAR causality graphs which basically minimizes a sequential, sparse topology identification criterion by means of a composite-objective iteration [42]. This procedure, which we termed TISO (Topology Identification via Sparse Online learning) throughout the paper, promotes sparse updates and enjoys constant computational complexity and memory requirements per iteration, which renders it suitable for sequential and big-data scenarios. Building upon this basic algorithm, the contributions of the present paper include the derivation of a more advanced algorithm, theoretical results that characterize the performance of both algorithms, and empirical validation of their performance through extensive experiments with synthetic and real data sets.

The proposed algorithm is named Topology Identification via Recursive Sparse Online learning (TIRSO), which substantially improves the tracking performance of TISO and robustness to input variability by minimizing a novel estimation criterion inspired by recursive least squares (RLS) where the instantaneous loss function accounts for past samples. TIRSO inherits certain benefits of TISO but incurs a moderate increase in computational complexity, which is still constant per iteration.

We summarize our theoretical results, which constitute the main contribution of our paper: (R1) it is established that the hindsight solution of TISO and TIRSO are asymptotically the same; (R2) The performance of TISO and TIRSO is analyzed in terms of static regret bounds, which are sublinear and suggest that TIRSO outperforms TISO. Hence, in the long run, these algorithms perform as well as the best (batch) predictor in hindsight, which supports their adoption for online topology identification. The static regret analysis goes beyond simply stating that the regret is sublinear (which is a direct consequence of applying the algorithm in [42] to the aforementioned criterion), but rather establishes a bound based on properties of the time series that can be checked in practice; (R3) A logarithmic regret bound is proved for TIRSO (such a bound has been proven for TIRSO and could not be proven for TISO thanks to the strong convexity of the loss function). (R4) To analyze the performance of TIRSO when the topology is time-varying, a dynamic regret bound is derived. Moreover, the steady-state error of TIRSO in time-varying scenarios is quantified in terms of the data properties. Remarkably, the performance (regret) analysis does not require probabilistic assumptions, which endows the developed approaches with high generality.

The conference version [1] of this work presents two online algorithms that are different from the algorithms presented here. One is based on the subgradient approximation for regularized RLS proposed in [43] and has computational complexity comparable to that of TIRSO, and the other one is based on a block coordinate minimization via Newton’s method and has lower computational complexity for large networks with small process order. In addition, no convergence guarantees were provided.

The rest of the paper is organized as follows: Sec. II presents the model, a batch estimation criterion, and background on online optimization. Sec. III develops TISO and TIRSO. Sec. IV and Sec. V respectively assess performance analytically and via simulations, whereas Sec. VI concludes the paper. All code will be made public at the authors’ websites.

Notation. Bold lowercase (uppercase) letters denote column vectors (matrices). Operators $\mathbb{E}[\cdot]$ , $\nabla$ , $\tilde{\nabla}$ , $\partial$ , $(\cdot)^{\top}$ , $\mathrm{vec}(\cdot)$ , $\lambda_{\mathrm{max}}(\cdot)$ , $\mathcal{R}(\cdot)$ , $(\cdot)^{\dagger}$ , and $\mathrm{diag}(\cdot)$ respectively denote expectation, gradient, subgradient, sub-differential, matrix transpose, vectorization, maximum eigenvalue, range or column space, pseudo-inverse, and diagonal of a matrix. Symbols $\bm{0}_{N}$ , $\bm{1}_{N}$ , $\bm{0}_{N\times N}$ , and $\bm{I}_{N}$ respectively represent the all-zero vector of size $N$ , the all-ones vector of size $N$ , the all-zero matrix of size $N\times N$ , and the size- $N$ identity matrix. Also, $[\cdot]_{+}=\mathrm{max}(\cdot,0)$ . For functions $f(x)$ and $g(x)$ , the notation $f(x)\propto g(x)$ means $\exists a>0,b:f(x)=ag(x)+b$ . The operator $\mathds{1}$ is the indicator satisfying $\mathds{1}\{x\}=1$ if $x$ is true and $\mathds{1}\{x\}=0$ otherwise. Finally, for time series, the notation $\{y_{n}[t]\}_{t}$ corresponds to $\{y_{n}[t]\}_{t\in\mathbb{Z}}$ .

II Preliminaries

After outlining the notion of directed causality graphs, this section reviews how these graphs can be identified in a batch fashion. Later, the basics of online optimization are described.

II-A Directed Causality Graphs

Consider a collection of $N$ time series $\{y_{n}[t]\}_{t}$ , $n=1,...,N$ , where $y_{n}[t]$ denotes the value of the $n$ -th time series at time $t$ . A causality graph $\mathcal{G}\!\triangleq\!(\mathcal{V},\mathcal{E})$ is a graph where the $n$ -th vertex in $\mathcal{V}\!=\!\{1,\ldots,N\}$ is identified with the $n$ -th time series $\{y_{n}[t]\}_{t}$ and there is an edge (or arc) from $n^{\prime}$ to $n$ (i.e. $(n,n^{\prime})\in\mathcal{E}$ ) if and only if (iff) $\{y_{n^{\prime}}[t]\}_{t}$ causes $\{y_{n}[t]\}_{t}$ according to a certain causality notion. For the reasons outlined in Sec. I, a prominent notion of causality described later in this section can be defined using VAR models. To this end, let $\bm{y}[t]\!\triangleq\![y_{1}[t],\ldots,y_{N}[t]]^{\top}$ and define a VAR time series $\{\bm{y}[t]\}_{t}$ as a sequence generated by the order- $P$ VAR model[22]

[TABLE]

where $\bm{A}_{p}\!\in\!\mathbb{R}^{N\times N},p=~{}\!1,\ldots,P$ , are the VAR parameters222For the sake of clarity, matrices $\{\bm{A}_{p}\}_{p=1}^{P}$ are deemed constant throughout this section. However, all the notions explained here can be easy generalized to time-varying scenarios, as detailed in subsequent sections. and $\bm{u}[t]\!\triangleq\![u_{1}[t],\ldots,u_{N}[t]]^{\top}$ is the innovation process. This process is generally assumed to be a temporally white, zero-mean stochastic process, i.e., $\mathbb{E}[\bm{u}[t]]=\bm{0}_{N}$ and $\mathbb{E}[\bm{u}[t]\bm{u}^{\top}[\tau]]\!=\!\bm{0}_{N\times N}$ for $t\!\neq\!\tau$ . Yet, the present work does not even need to assume that $\bm{u}[t]$ is random, which benefits its generality; see the remark at the end of Sec. IV. With $a_{n,n^{\prime}}^{(p)}$ the $n,n^{\prime}$ -th entry of $\bm{A}_{p}$ , expression (1) becomes

[TABLE]

for $n=1,\ldots,N$ , where $\mathcal{N}(n)\!\triangleq\!\{n^{\prime}\!:\!\bm{a}_{n,n^{\prime}}\neq\bm{0}_{P}\}$ and $\bm{a}_{n,n^{\prime}}\!\triangleq\![a_{n,n^{\prime}}^{(1)},\ldots,a_{n,n^{\prime}}^{({P})}]^{\top}$ . Recognizing the convolution operation in the right-hand side enables one to express (2) as $y_{n}[t]=\textstyle{\sum_{n^{\prime}\in\mathcal{N}(n)}}a_{n,n^{\prime}}^{(t)}\ast y_{n^{\prime}}[t]\!+\!u_{n}[t]$ in signal processing notation. Thus, in a VAR model, $y_{n}[t]$ equals the sum of noise and the output of $|\mathcal{N}(n)|$ linear time-invariant filters where the $n,n^{\prime}$ -th filter has input $\{y_{n^{\prime}}[t]\}_{t}$ and coefficients $\{a_{n,n^{\prime}}^{(p)}\}_{p=1}^{P}$ .

When $\bm{u}[t]$ is a zero-mean and temporally white stochastic process, the term $\hat{y}_{n}[t]\!\triangleq\!\sum_{n^{\prime}\in\mathcal{N}(n)}\sum_{p=1}^{P}a_{n,n^{\prime}}^{(p)}y_{n^{\prime}}[t-p]$ in (2) is the minimum mean square error estimator of $y_{n}[t]$ given the previous values of all time series $\{y_{n^{\prime}}[\tau],n^{\prime}\!=\!1,...,N,\tau<\!t\}$ ; see e.g. [18, Sec. 12.7]. The set $\mathcal{N}(n)$ therefore collects the indices of those time series that participate in this optimal predictor of $y_{n}[t]$ or, alternatively, the information provided by time series $\{y_{n^{\prime}}[\tau]\}_{\tau<t}$ with $n^{\prime}\!\notin\!\mathcal{N}(n)$ is not informative to predict $y_{n}[t]$ . This motivates the following definition of causality, which embodies the spirit of Granger causality (see Sec. I): $\{y_{n^{\prime}}[t]\}_{t}$ VAR-causes $\{y_{n}[t]\}_{t}$ whenever $n^{\prime}\!\in\!\mathcal{N}(n)$ . Equivalently, $\{y_{n^{\prime}}[t]\}_{t}$ VAR-causes $\{y_{n}[t]\}_{t}$ if $\bm{a}_{n,n^{\prime}}\!\neq\!\bm{0}_{P}$ . A detailed comparison with Granger causality lies out of scope, yet it is worth mentioning that the main distinction lies in the prediction horizon333Whereas VAR causality just pertains to prediction 1 time instant ahead, Granger causality involves prediction of all future samples $y_{n}[t^{\prime}],~{}t^{\prime}\geq t$ , given the ones up to a certain time instant $\{y_{n^{\prime}}[\tau],~{}n^{\prime}=1,\ldots,N,~{}\tau<t\}$ . Therefore VAR causality implies Granger causality, but the converse is false.; see [22, Sec. 2.3.1] for a more detailed comparison. VAR causality relations among the $N$ time series can be represented using a causality graph where $\mathcal{E}\triangleq\{(n,n^{\prime})\!:\!\bm{a}_{n,n^{\prime}}\!\neq\!\bm{0}_{P}\}$ . Clearly, in such a graph, $\mathcal{N}(n)$ is the in-neighborhood of node $n$ . To quantify the strength of these causality relations, a weighted graph can be constructed by assigning e.g. the weight $\|\bm{a}_{n,n^{\prime}}\|_{2}$ to the edge $(n,n^{\prime})$ .

With these definitions, the batch problem of identifying a VAR causality graph reduces to estimating the VAR coefficient matrices $\{\bm{A}_{p}\}_{p=1}^{P}$ given $P$ and the observations $\{\bm{y}[t]\}_{t=0}^{T-1}$ . To simplify notation, form the tensor $\mathcal{A}$ by stacking the matrices $\{\bm{A}_{p}\}_{p=1}^{P}$ along the third dimension as shown in Fig. 1.

II-B Batch Estimation Criterion for Topology Identification

This section presents an estimation criterion to address the batch problem formulated in Sec. II-A. A natural estimate could be pursued through least-squares by minimizing [22]

[TABLE]

This estimation task becomes underdetermined unless the number $NT$ of available data samples meaningfully exceeds the number of unknowns $PN^{2}$ , or, equivalently, $T\geq PN+P$ . Even more, to obtain a reasonable performance, one requires $T\gg PN+P$ which may not be possible in practice, especially if the parameters $\{\bm{A}_{p}\}_{p=1}^{P}$ remain constant only for short periods of time. To circumvent this limitation, one may note that most causality relations between two time series will be mediated by one or more time series. This means that the causality graph introduced in Sec. II-A is expected to be sparse, meaning that many of the vectors $\bm{a}_{n,n^{\prime}}$ equal zero. Such a sparsity structure can be promoted by properly regularizing the aforementioned least squares objective. To this end, the following criterion has been proposed in [29]:

[TABLE]

where $\lambda>0$ is a regularization parameter that can be adjusted e.g. via cross-validation [7, Ch. 1]. The second term in (3) is conventionally referred to as a group-lasso444Although other norms (such as the sum of infinity norms) can be used to enforce group sparsity, recoverability results associated with this norm are provided in [29]. regularizer and the solution to (3) as a group-lasso estimate [44]. This promotes a group-sparse structure in $\{\bm{A}_{p}\}_{p=1}^{P}$ to exploit the information that the number of edges in $\mathcal{E}$ is typically small. Self-connections ( $\bm{a}_{n,n}$ , $n=1,...,N$ ) are excluded from the regularization term so that the inferred causal relations pertain to the component of each time series that cannot be predicted using its own past. This is motivated by the improvement in consistency reported in [29]. The criterion (3) can be further motivated on the grounds of the consistency of group-lasso estimators [45].

Remarkably, (3) separates along $n$ . To see this, let $\bm{a}_{n}\triangleq[\bm{a}_{n,1}^{\top},\bm{a}_{n,2}^{\top},...,\,\bm{a}_{n,N}^{\top}]^{\top}\!\in\!\mathbb{R}^{NP}$ and

[TABLE]

and express $\mathcal{L}(\mathcal{A})$ as $\mathcal{L}(\mathcal{A})\!=\!{\sum_{n=1}^{N}{\ell}^{(n)}(\bm{a}_{n})}$ , where ${\ell}^{(n)}(\bm{a}_{n})\!\triangleq\!{1}/(T-P)\sum_{t=P}^{T-1}\ell_{t}^{(n)}(\bm{a}_{n})$ and $\ell_{t}^{(n)}(\bm{a}_{n})\!\triangleq\!{1}/{2}(y_{n}[t]-\bm{g}^{\top}[t]\bm{a}_{n})^{2}$ . Then, (3) becomes $\{{\bm{a}}_{n}^{*}\}_{n=1}^{N}\!=\!\operatorname*{arg\,min}_{\{\bm{a}_{n}\}_{n=1}^{N}}\!\sum_{n=1}^{N}\![{\ell}^{(n)}(\bm{a}_{n})\!\!+\!\!{\lambda}\!\sum_{\begin{subarray}{c}n^{\prime}=1,n^{\prime}\neq n\end{subarray}}^{N}\!\left\lVert\bm{a}_{n,n^{\prime}}\right\rVert_{2}],$ with

[TABLE]

for $n=1,\ldots,N$ . Thus, the VAR causality graph can be identified by separately estimating the VAR coefficients, and hence incoming edge weights, for each node.

The batch estimation criterion in (5) requires all data $\{\bm{y}[t]\}_{t=0}^{T-1}$ before an estimate can be obtained and cannot track changes. Furthermore, solving (5) eventually becomes prohibitively complex for sufficiently large $T$ . To address these challenges, this paper adopts the framework of online optimization, which is reviewed in the following subsection.

Remark: As seen in (3), $\lambda$ is the same for all candidate edges $(n,n^{\prime})$ . This can be readily replaced with an edge-dependent regularization parameter $\lambda_{n,n^{\prime}}$ without any complexity increase to exploit possibly available prior-information about edges.

II-C Background on Online Optimization

This section reviews the fundamental notions of online optimization from a general perspective, not necessarily applied to the problem of topology identification. To this end, consider the generic unconstrained optimization problem

[TABLE]

where $h_{t}(\bm{a})$ is a convex function, which in many applications depends on the data received at time $t$ . For example, in least squares $h_{t}(\bm{a})\!=\!\lVert\bm{X}[t]\bm{a}-\bm{y}[t]\rVert_{2}^{2}$ , where $\bm{y}[t]$ and $\bm{X}[t]$ are the data vector and matrix made available at time $t$ . To solve (6), it is necessary that all $\{h_{t}(\bm{a})\}_{t=0}^{T_{0}-1}$ be available. Approaches that process all data at once are termed batch and, hence, suffer from potentially long waiting times, which generally render them inappropriate for real-time operation. Besides, computational complexity and memory generally grow super-linearly with $T_{0}$ , which eventually becomes prohibitive.

Online algorithms alleviate these limitations. Let $\bm{a}[t+1]$ denote an estimate of the solution to (6) at time $t$ produced by an online algorithm. Online algorithms compute a new $\bm{a}[t+1]$ every time a new $(\bm{X}[t],\bm{y}[t])$ data element (or, more generally, a new $h_{t}(\bm{a})$ ) is processed. At every iteration, also known as update, $\bm{a}[t+1]$ is obtained from $\bm{a}[t]$ , $\bm{y}[t]$ , $\bm{X}[t]$ , and possibly some additional information carried from each update to the next. The memory requirements and number of arithmetic operations per iteration must not grow unbounded for increasing $t$ . This requirement rules out approaches involving solving (6) as a batch problem per update or carrying all the past data $\{(\bm{X}[\tau],\bm{y}[\tau])\}_{\tau=0}^{t-1}$ from the $(t\!-\!1)$ -th update to the $t$ -th update. Thus, online algorithms are especially appealing when data vectors are received sequentially or $T_{0}$ is so large that batch solvers are not computationally affordable. Additionally, online algorithms can track changes in the underlying model. When $\bm{a}$ represents a probabilistic model parameter that must be estimated, the estimate obtained through an online method is generally capable of tracking variations in $\bm{a}$ so long as they do not occur too rapidly.

The most common performance metric to evaluate online algorithms is the regret, which quantifies the cumulative loss incurred by an online algorithm relative to the loss corresponding to the optimal constant solution in hindsight. Formally, the (static) regret555The static regret is known simply as regret in earlier works, e.g. [46], and different types of regret were formalized later, see e.g. [47]. at iteration $T_{0}-1$ is given by [46]:

[TABLE]

where $\bm{a}^{\ast}[T_{0}]\triangleq\operatorname*{arg\,min}_{\bm{a}}~{}({1}/{T_{0}})\sum_{t=0}^{T_{0}-1}h_{t}(\bm{a})$ is the optimal constant hindsight solution, i.e., the batch solution after $T_{0}$ data vectors have been processed. Observe that the regret in (7) may be negative since the estimates $\{\bm{a}[t]\}_{t}$ are allowed to depend on $t$ and, hence, it may hold that $h_{t}(\bm{a}[t])\leq h_{t}(\bm{a}^{\ast}[T_{0}])$ for multiple (potentially all) values of $t$ . In practice, nevertheless, the regret will typically be positive and increase with $T_{0}$ . To be deemed admissible, online algorithms must yield a sublinear regret, i.e., $R{{}_{s}[T_{0}]}/T_{0}\!\rightarrow\!0$ as $T_{0}\!\rightarrow\!\infty$ . Thus, online algorithm with sublinear regret perform asymptotically as well as the batch solution on average. It is worth noting that the online learning framework does not involve statistical assumptions on the data, which can even be generated by an “adversary” [48].

In dynamic settings where the parameters of the data generating process vary over time, $\bm{a}^{\ast}[T_{0}]$ may not be a suitable reference since its computation involves potentially very old data, namely $\{h_{t}\}_{t\ll T_{0}}$ , which is informative about old values of the true parameters but not about the new values. In those cases, it is customary to compare against the instantaneous minimizer $\bm{a}^{\circ}[t]\!\triangleq\!\operatorname*{arg\,min}_{\bm{a}}h_{t}(\bm{a})$ by means of the so-called dynamic regret [47], [49]:

[TABLE]

More details about the dynamic regret are given in Sec. IV-C.

III Online Topology Identification

This section develops online algorithms for the considered problem of topology identification from time series. To this end, cast (5) for the $n$ -th node in the form (6) by setting

[TABLE]

for $t=0,...,T-P-1$ . The most immediate approach to solve (6) would be applying online subgradient descent (OSGD), whose updates are given by $\bm{a}_{n}[t+1]=\bm{a}_{n}[t]-\alpha_{t}\tilde{\mathbf{w}}_{n}[t]$ with $\tilde{\mathbf{w}}_{n}[t]$ a subgradient of $h_{t}$ at $\bm{a}_{n}[t]$ and $\alpha_{t}$ the step size at time $t$ . From (8), $\tilde{\mathbf{w}}_{n}[t]$ equals $\nabla\ell^{(n)}_{t+P}(\bm{a}_{n}[t])$ plus $\lambda$ times a valid subgradient of the form

[TABLE]

evaluated at $\bm{a}_{n}[t]$ . For example, for $\bm{x}\in\mathbb{R}^{P}$ , set $\tilde{\nabla}_{\bm{x}}\lVert\bm{x}\rVert_{2}\!=\!\bm{x}/\lVert\bm{x}\rVert_{2}$ for $\bm{x}\!\neq\!\bm{0}_{P}$ and $\tilde{\nabla}_{\bm{x}}\lVert\bm{x}\rVert_{2}\!=\!\bm{0}_{P}$ for $\bm{x}=\bm{0}_{P}$ . It is easy to see that the resulting iterates $\bm{a}_{n}[t]$ are not necessarily sparse; see also [42]. Since the solution to the batch problem is indeed sparse for a properly selected $\lambda$ , alternative approaches are required

To this end, note that OSGD fails to provide sparse iterates because it implicitly linearizes the instantaneous objective $h_{t}(\bm{a}_{n})$ . Since the regularizer (last term in (8)) is not differentiable, it is not well approximated by a linear function and, as a result, it fails to promote sparsity. To address this issue, composite algorithms decompose $h_{t}(\bm{a}_{n})$ as $h_{t}(\bm{a}_{n})\!=\!{f_{t}^{(n)}}(\bm{a}_{n})+{{\Omega}^{(n)}}(\bm{a}_{n})$ , where ${f_{t}^{(n)}}(\bm{a}_{n})$ is a convex loss function and ${{\Omega}^{(n)}}(\bm{a}_{n})$ is a convex regularizer, and linearize only ${f_{t}^{(n)}}(\bm{a}_{n})$ . Algorithms of this family, which include regularized dual averaging (RDA) [50] and composite objective mirror descent (COMID) [42], solve the generic problem

[TABLE]

by linearizing ${f_{t}^{(n)}}(\bm{a}_{n})$ but not ${\Omega}^{(n)}(\bm{a}_{n})$ . This work focuses on COMID since, unlike RDA, there exist bounds for its regret for constant step size when the regularizer is not strongly convex. The COMID update is

[TABLE]

where $\tilde{\nabla}{f_{t}^{(n)}}(\bm{a}_{n}[t])$ is a subgradient of ${f_{t}^{(n)}}$ at point $\bm{a}_{n}[t]$ (that is, $\tilde{\nabla}{f_{t}^{(n)}}(\bm{a}[t])\!\in\!\partial f_{t}^{(n)}(\bm{a}_{n}[t])$ ), $\alpha_{t}\!>\!0$ is a step size, and $B_{\psi}(\bm{w},\bm{v})\triangleq\psi(\bm{w})-\psi(\bm{v})-\nabla\psi^{\top}\left(\bm{v}\right)(\bm{w}-\bm{v})$ is the so-called Bregman divergence associated with a $\zeta$ -strongly convex and continuously differentiable function $\psi$ . The strong convexity condition means that $B_{\psi}(\bm{w},\bm{v})\geq({\zeta}/{2})\lVert\bm{w}-\bm{v}\rVert^{2}$ , which motivates using $B_{\psi}(\bm{w},\bm{v})$ as a surrogate of a distance between $\bm{w}$ and $\bm{v}$ . Thus, the Bregman divergence in (10) penalizes updates $\bm{a}_{n}[t+1]$ lying far from the previous one $\bm{a}_{n}[t]$ , which essentially smoothes the sequence of iterates.

Relative to each term in (9), the loss ${f_{t}^{(n)}}$ in (10) has been linearized but the regularizer ${{\Omega}^{(n)}}(\bm{a}_{n})$ has been kept intact. When ${{\Omega}^{(n)}}(\bm{a}_{n})$ is a sparsity-promoting regularizer, then the online estimate $\bm{a}_{n}[t+1]$ is therefore expected to be sparse.

In view of these appealing features, the algorithm proposed in Sec. III-A builds upon COMID to address the problem of online causality graph identification from time series.

III-A Topology Identification via Sparse Online optimization

This section proposes topology identification via sparse online optimization (TISO), an online algorithm for the problem in Sec. II-B that provides a causality graph estimate every time a new $\bm{y}[t]$ is processed. The key idea of this first algorithm is to refine the previous topology estimate with the information provided by the new data vector by means of a COMID update.

To this end, express $h_{t}$ in (8) in the form $h_{t}(\bm{a}_{n})={f_{t}^{(n)}}(\bm{a}_{n})+{{\Omega}^{(n)}}(\bm{a}_{n})$ by setting

[TABLE]

for $t\!=\!0,...,T-P-1.$

To choose $B_{\psi}(\bm{w},\bm{v})$ , note that (10) with ${f_{t}^{(n)}}(\bm{a}_{n})$ and ${{\Omega}^{(n)}}(\bm{a}_{n})$ given by (11) can be solved in closed form when $\psi(\cdot)=1/2\lVert\cdot\rVert_{2}^{2}$ . In that case, $B_{\psi}(\bm{w},\bm{v})\!=\!1/2\lVert\bm{w}-\bm{v}\rVert_{2}^{2}$ and $\bm{a}_{n}[t+1]$ can be found via a modified group soft-thresholding operator, as detailed next. With these expressions, the TISO update after processing $\{\bm{y}[\tau]\}_{\tau=0}^{t}$ is

[TABLE]

where

[TABLE]

and (using the vector $\bm{g}[t]$ defined in (4))

[TABLE]

To solve (12) in closed form, expand the squared norm in (13) to obtain

[TABLE]

where $\bm{v}_{n}[t]\triangleq[\,\bm{v}_{n,1}^{\top}[t],...,\bm{v}_{n,N}^{\top}[t]\,]^{\top}$ and $\bm{v}_{n,n^{\prime}}[t]\!\in\!\mathbb{R}^{P}~{}\forall n^{\prime}$ . From (15), it can be observed that the updates in (12) can be computed separately for each group $n^{\prime}=1,...,N$ .

For $n^{\prime}\neq n$ , the $n^{\prime}$ -th subvector of $\bm{a}_{n}[t+1]$ (or $n^{\prime}$ -th group) can be expressed in terms of the so-called multidimensional shrinkage-thresholding operator [51] as:

[TABLE]

where $\bm{a}_{n,n^{\prime}}^{\text{f}}\left[t\right]\!\triangleq\!\bm{a}_{n,n^{\prime}}[t]\!-\!\alpha_{t}\bm{v}_{n,n^{\prime}}[t]$ . Expression (16) is composed of two terms: whereas $\bm{a}_{n,n^{\prime}}^{\text{f}}[t]$ is the result of performing a gradient-descent step in a direction that decreases the instantaneous loss ${\ell}_{t}^{(n)}(\bm{a}_{n})$ , the second term promotes group sparsity by setting $\bm{a}_{n,n^{\prime}}[t+1]\!=\!\bm{0}_{P}$ for those groups $n^{\prime}$ with $\lVert\bm{a}_{n,n^{\prime}}^{\text{f}}[t]\rVert_{2}\leq\alpha_{t}\,\lambda$ . Recalling that each vector $\bm{a}_{n,n^{\prime}}$ corresponds to an edge in the estimated causality graph (see Sec. II-A), expression (16) indicates that only the relatively strong edges (i.e. causality relations) survive. In view of such a shrinkage operation, a larger $\lambda$ will result in sparser estimates.

On the other hand, when $n^{\prime}\!=\!n$ , the $n^{\prime}$ -th subvector of $\bm{a}_{n}[t+1]$ in (12) is given by:

[TABLE]

and, as intended, no sparsity is promoted on self-connections; see Sec. II-B. Combining (16) and (17), the estimate of the $n^{\prime}$ -th group at time $t+1$ is given by:

[TABLE]

The performance of TISO depends on the choice of the step-size sequence $\{\alpha_{t}\}_{t}$ , as discussed in Sec. IV. The overall TISO algorithm is listed as Procedure 1. It only requires $\mathcal{O}(N^{2}P)$ memory entries to store the last $P$ data vectors and the last estimate. On the other hand, each update requires $\mathcal{O}(N^{2}P)$ arithmetic operations, which is in the same order as the number of parameters to be estimated. Thus, TISO can arguably be deemed a low-complexity algorithm.

The next section will build upon TISO to develop an algorithm with increased robustness to input variability.

III-B Topology Identification via Recursive Sparse Online optimization

As seen in Sec. III-A, each update of TISO depends on the data through the instantaneous loss $\ell_{t}^{(n)}(\bm{a}_{n}[t])$ , which quantifies the prediction error of the newly received vector $\bm{y}[t]$ when the VAR parameters $\mathcal{A}$ are given by the previous estimate $\bm{a}_{n}[t]$ . Thus, the residual of predicting each data vector is used only in a single TISO update. Although this renders TISO a computationally efficient algorithm for online topology identification, it also increases sensitivity to noise and input variability. To this end, this section pursues an alternative approach at the expense of a moderate increase in computational complexity and memory requirements.

It is clear from (12) that $\bm{a}_{n}[t+1]$ is determined by $\bm{a}_{n}[t]$ and $\bm{v}_{n}[t]$ . The latter incorporates the residual only at time $t$ . The step size $\alpha_{t}$ controls how much variability in the input data propagates to the estimates $\{\bm{a}_{n}[t]\}_{t}$ . When a diminishing step-size sequence is adopted, the influence of each new $\bm{y}[t]$ on the estimate becomes arbitrarily small, and the variability of the estimates fades away. However, decreasing sequences cannot be utilized when the application at hand demands tracking changes in the coefficients $\mathcal{A}$ . In these settings, a constant step size $\alpha_{t}\!=\!\alpha$ is preferable. In such a scenario, a desire to reduce output variability would therefore force one to adopt a small $\alpha$ , but this would hinder TISO from tracking changes in the topology.

An approach to reduce output variability without sacrificing tracking capability will be developed next by drawing inspiration from the connections between TISO, the least mean squares (LMS) algorithm, and the recursive least squares (RLS) algorithm [52]. Indeed, observe that TISO generalizes LMS, which is recovered for $\lambda\!=\!0$ . To speed up convergence and reduce variability in the output of LMS, it is customary to resort to RLS, which accommodates the received data in a more sophisticated fashion, allowing to control the influence of each data vector on future estimates through forgetting factors.

Along these lines, the trick is to replace the instantaneous loss $\ell_{t}^{(n)}(\bm{a}_{n})$ in (11) with a running average loss. To maintain tracking capabilities, a heavier weight is assigned to recent data using the exponential window customarily adopted by RLS. Specifically, consider setting ${f_{t}^{(n)}}(\bm{a}_{n})\!=\!\tilde{\ell}_{t}^{(n)}(\bm{a}_{n})$ in (11) with

[TABLE]

where $\gamma\!\in\!(0,1)$ is the user-selected forgetting factor and $\mu\!=\!1-\gamma$ is set to normalize the exponential weighting window, i.e., $\mu\sum_{\tau=0}^{\infty}\gamma^{\tau}\!=\!1$ .

Having specified a loss function, the next step is to derive the update equation. In a direct application of COMID to solve (9) with ${f_{t}^{(n)}}(\bm{a}_{n})\!=\!\tilde{\ell}_{t}^{(n)}(\bm{a}_{n})$ , each iteration would involve the evaluation of the gradient of the $t\!-\!P\!+\!1$ terms of $\tilde{\ell}_{t}^{(n)}$ . The computational complexity per iteration would grow with $t$ and, therefore, the resulting updates would not make up a truly online algorithm according to the requirements expressed in Sec. II-C. To remedy this issue, the structure of (19) will be exploited next to develop an algorithm with constant memory and complexity per iteration. To this end, expand and rewrite (19) to obtain

[TABLE]

where

[TABLE]

The variables $\bm{\Phi}[t]$ and $\bm{r}_{n}[t]$ can be respectively thought of as a weighted sample autocorrelation matrix and a weighted sample cross-correlation vector. The key observation here is that, as occurs in RLS, these quantities can be updated recursively as $\bm{\Phi}[t]=\gamma\,\bm{\Phi}[t-1]+\mu\,\bm{g}[t]\,\bm{g}^{\top}[t]$ and $\bm{r}_{n}[t]=\gamma\,\bm{r}_{n}[t-1]+\mu\,y_{n}[t]\,\bm{g}[t]$ . Noting that

[TABLE]

and letting ${\tilde{\bm{v}}}_{n}[t]\triangleq[{\tilde{\bm{v}}}_{n,1}^{\top}[t],\ldots,{\tilde{\bm{v}}}_{n,N}^{\top}[t]]^{\top}\!\triangleq\!\nabla\tilde{\ell}_{t}^{(n)}(\bm{a}_{n}[t])$ , the estimate ${{\tilde{\bm{a}}}}_{n}[t+1]$ after receiving $\{\bm{y}[\tau]\}_{\tau=0}^{t}$ becomes

[TABLE]

where

[TABLE]

Proceeding similarly to Sec. III-A yields the update

[TABLE]

where ${\tilde{\bm{a}}}^{\text{f}}_{n,n^{\prime}}[t]\!\triangleq\!{\tilde{\bm{a}}}_{n,n^{\prime}}[t]\!-\!\alpha_{t}{\tilde{\bm{v}}}_{n,n^{\prime}}[t]$ . Due to the recursive nature of the updates for $\bm{\Phi}[t]$ and $\bm{r}_{n}[t]$ , the resulting algorithm is termed Topology Identification via Recursive Sparse Online optimization (TIRSO) and tabulated as Procedure 2.

The choice of the step size affects the convergence properties of TIRSO, as analyzed in Sec. IV. Regarding step size selection, natural choices include (i) constant step size, which is convenient in dynamic setups where changes in the coefficients $\mathcal{A}$ need to be tracked over time (see \threfth:dynamicregretbound) but also gives rise to performance guarantees in static scenarios (\threfprop:regrettiso and \threfprop:regrettirso in the supplementary material); (ii) diminishing step size, commonly in the form of $\mathcal{O}(1/\sqrt{t})$ or $\mathcal{O}(1/t)$ (see \threfth:strongconvexitytirso); or (iii) an adaptive step size that depends on the data, as discussed at the end of Sec. IV-C.

Observe that $\bm{\Phi}[t]$ only needs to be updated once per observed sample $t$ , whereas the vector $\bm{r}_{n}[t]$ need to be updated for each $n$ at every $t$ . The computational complexity is dominated by step 7, which is $\mathcal{O}(N^{3}P^{2})$ operations per $t$ . However, exploiting the group-sparse structure of ${{\tilde{\bm{a}}}}_{n}[t]$ may reduce the computation by disregarding the columns of $\bm{\Phi}[t]$ corresponding to the zero entries of ${\tilde{\bm{a}}}_{n}[t]$ . If, for instance, the number of edges is $\mathcal{O}(N)$ , then the complexity of TIRSO becomes $\mathcal{O}(N^{2}P^{2})$ per $t$ . Regarding memory complexity, TIRSO requires $N^{2}P^{2}$ memory positions to store $\bm{\Phi}[t]$ and $N^{2}P$ positions to store $\{\bm{r}_{n}[t]\}_{n=1}^{N}$ .

IV Theoretical Results

In this section, the performance of TISO and TIRSO is analyzed. The upcoming results will make use of one or more of the following assumptions:

A1.

Bounded samples: There exists ${\color[rgb]{0,0,0}B}_{y}\!>0\!$ such that $|y_{n}[t]|^{2}\leq{\color[rgb]{0,0,0}B}_{y}~{}\forall\,n,t$ . 2. A2.

Bounded minimum eigenvalue of $\bm{\Phi}[t]$ : There exists $\beta_{\tilde{\ell}}~{}>~{}0$ such that $\lambda_{\mathrm{min}}(\bm{\Phi}[t])\geq\beta_{\tilde{\ell}},~{}\forall\,t\geq P$ . 3. A3.

Bounded maximum eigenvalue of $\bm{\Phi}[t]$ : There exists $L~{}>~{}0$ such that $\lambda_{\mathrm{max}}(\bm{\Phi}[t])\leq L,~{}\forall\,t\geq P$ . 4. A4.

Asymptotically invertible sample covariance: There exists $T_{m}$ and $\beta$ such that

[TABLE]

Note that A1 entails no loss of generality in real-world applications, where

data are bounded and thus ${\color[rgb]{0,0,0}B}_{y}$ necessarily exists. A2 usually holds in practice unless the data is redundant, meaning that some time series can be obtained as a linear combination of the others. In general, the latter will not be the case e.g. if the data $\{\bm{y}[t]\}_{t}$ adheres to a continuous probability distribution, in which case $\bm{\Phi}[t]$ is positive definite for all $t\geq P$ with probability 1. A3 will also hold in practice since it can be shown that it is implied by A1. In particular, if A1 holds, then A3 holds with $L=PN{\color[rgb]{0,0,0}B}_{y}$ . Similarly, A26 will also generally hold since it is a weaker version of A2.

Next, the asymptotic equivalence of the batch solutions for TISO and TIRSO is established.

IV-A Asymptotic Equivalence between TISO and TIRSO

To complement the arguments given in Sec. III-B to support the decision of setting ${f_{t}^{(n)}}(\bm{a}_{n})\!\!=\!\!\tilde{\ell}_{t}^{(n)}(\bm{a}_{n})$ , which laid the grounds to develop TIRSO, we establish that the batch problems that TISO and TIRSO implicitly solve become asymptotically equivalent as $T\!\rightarrow\!\infty$ . To this end, let $\bm{a}_{n}^{*}[T]$ denote the hindsight solution for TISO, which is given by

[TABLE]

where

[TABLE]

Observe that (28) is identical to the objective in the batch criterion (5). Likewise, let ${\tilde{\bm{a}}}^{*}_{n}[T]$ denote the hindsight solution of TIRSO, which is given by

[TABLE]

with

[TABLE]

In this case, (30) no longer coincides with the objective in (5). Therefore, one can argue that the TIRSO algorithm is not pursuing the estimates that minimize the batch criterion (5). This idea is dispelled next by establishing the asymptotic equivalence between minimizing ${\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n})$ and minimizing ${\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n})$ , since the latter is identical to (5).

Theorem 1.

\thlabel

prop:asymptoticequivalence Under assumption A1:

It holds for all $\bm{a}_{n}$ that $\displaystyle\lim_{T\rightarrow\infty}|{\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n})-{\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n})|=0.$ 2. 2.

It holds that $\displaystyle\lim_{T\rightarrow\infty}\big{|}\inf_{\bm{a}_{n}}{\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n})-\inf_{\bm{a}_{n}}{\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n})\big{|}=0.$ 3. 3.

If, additionally, assumption A2 holds, then $\lim_{T\rightarrow\infty}\|\bm{a}_{n}^{*}[T]-{\tilde{\bm{a}}}_{n}^{*}[T]\|_{2}=0.$

Proof:

See Appendix A in the supplementary material. ∎

\thref

prop:asymptoticequivalence essentially establishes not only that the TISO and TIRSO hindsight objectives are asymptotically the same but also that their minima and minimizers asymptotically coincide. Since the TISO hindsight objective equals the batch objective (5), it follows that the TIRSO hindsight objective asymptotically approaches the batch objective (5). This observation is very important since the regret analysis from Sec. IV-B will establish that the TISO and TIRSO estimates asymptotically match their hindsight counterparts.

IV-B Static Regret Analysis

This section characterizes the performance of TISO and TIRSO analytically. Specifically, it is shown that the sequences of estimates produced by these algorithms yield a sublinear static regret, which is a basic requirement in online optimization; see Sec. II-C. Broadly speaking, this property means that, on average and asymptotically, the online estimates perform as well as their hindsight counterparts.

A general definition of the regret metric is given in (7). Since the problem at hand is separable across nodes, it is natural to separately quantify the regret for each node. The total regret will be the sum of the regret for all nodes. Applying this idea and shifting the time index to simplify notation, one can replace $R_{{s}}[T_{0}]$ in (7) with $R_{s}^{(n)}[T_{0}+P-1]$ , function $h_{t}$ with $h_{t+P}^{(n)}$ , and $T_{0}$ with $T-P+1$ to write the regret of TISO for the $n$ -th node at time $T$ as

[TABLE]

where $h_{t}^{(n)}(\cdot)={\ell_{t}^{(n)}}(\cdot)+{{\Omega}^{(n)}}(\cdot)$ and $\bm{a}_{n}^{*}[T]$ is defined in (27). For TIRSO, the regret for the $n$ -th node is given by

[TABLE]

where $\tilde{h}_{t}^{(n)}(\cdot)={\tilde{\ell}_{t}^{(n)}}(\cdot)+{{\Omega}^{(n)}}(\cdot)$ and ${\tilde{\bm{a}}}_{n}^{*}[T]$ is defined in (29).

Since constant step size sequences allow tracking time-varying topologies, one could think of seeking a sublinear bound for the regret. However, it is easy to see (cf. (17) and (18) in the case of TISO) that the sequences of estimates in this case are generally noisy, unless the innovation process $\bm{u}[t]$ in (1) is $\bm{0}_{N}$ . For this reason, a sublinear regret bound cannot be obtained for a constant $\alpha_{t}$ . However, it is possible to establish sublinear regret when the step size is “asymptotically constant,” as described next.

The idea is to run the selected algorithm in time windows of exponentially increasing length with a step size that differs across windows but is constant within each one. Specifically, let the $(m+1)$ -th window, $m=1,\ldots,M$ , comprise the time indices $t_{0}2^{m-1}<t\leq t_{0}2^{m}$ for some user-selected $t_{0}\geq P$ . Set $\alpha_{t}=\alpha_{[m]}$ for those $t$ satisfying $t_{0}2^{m-1}<t\leq t_{0}2^{m}$ . The following result proves sublinear regret for TISO.

Theorem 2.

\thlabel

cor:doublingtricktiso Let $\{\bm{a}_{n}[t]\}_{t=P}^{T}$ be generated by applying TISO (Procedure 1) with step size $\alpha_{t}=\alpha_{[m]}=\mathcal{O}(1/\sqrt{t_{0}2^{m-1}})$ in the window $t_{0}2^{m-1}<t\leq t_{0}2^{m}$ , $m=1,2,\ldots$ Then, the regret of TISO under assumptions A1 and A26 is

[TABLE]

where $B_{\bm{a}}=1/\beta({\color[rgb]{0,0,0}B}_{y}\sqrt{PN}+\sqrt{{\color[rgb]{0,0,0}B}_{y}^{2}PN+\beta{\color[rgb]{0,0,0}B}_{y}})$ .

Proof:

See Appendix B in the supplementary material. ∎

Similarly, the regret of TIRSO is characterized as follows:

Theorem 3.

\thlabel

cor:doublingtricktirso Let $\{{\tilde{\bm{a}}}_{n}[t]\}_{t=P}^{T}$ be generated by applying TIRSO (Procedure 2) with step size $\alpha_{t}=\alpha_{[m]}=\mathcal{O}(1/\sqrt{t_{0}2^{m-1}})$ in the window $t_{0}2^{m-1}<t\leq t_{0}2^{m}$ , $m=1,2,\ldots$ Then, the regret of TIRSO under assumptions A1, A2, and A3, is

[TABLE]

where $B_{{\tilde{\bm{a}}}}\triangleq 1/\beta_{\tilde{\ell}}({\color[rgb]{0,0,0}B}_{y}\sqrt{PN}+\sqrt{{\color[rgb]{0,0,0}B}_{y}^{2}PN+\beta_{\tilde{\ell}}{\color[rgb]{0,0,0}B}_{y}})$ .

Proof:

See Appendix D in the supplementary material. ∎

\thref

cor:doublingtricktirso has the same form as \threfcor:doublingtricktiso with the exception of (79), where the constant term multiplying $\sqrt{T}$ differs from the one in (33). However, it can be readily shown that $L\leq PN{\color[rgb]{0,0,0}B}_{y}$ , which implies that TIRSO also satisfies (33).

To sum up, both TISO and TIRSO behave asymptotically in the same fashion and provide, on average, the same performance as the hindsight solution of TISO, which coincides with the batch solution in (5). The difference between TISO and TIRSO is, therefore, in the non-asymptotic regime, where TIRSO can track changes in the estimated graph more swiftly than TISO. This is at the expense of a slight increase in the number of arithmetic operations and required memory. Note, however, that TIRSO offers an additional degree of freedom through the selection of the forgetting factor $\gamma$ . This enables the user to select the desired point in the trade-off between adaptability to changes and low variability in the estimates.

As demonstrated next, tighter regret bounds can be obtained when a diminishing step size sequence is adopted. Such sequences are of special interest when the VAR coefficients do not change over time. Even in this scenario, the application of online algorithms such as TISO or TIRSO is well-motivated when the number or dimension of the data vectors is prohibitively large to tackle with a batch algorithm.

Theorem 4.

\thlabel

*th:strongconvexitytirso Under assumptions A1, A2, and A3, let $\{{\tilde{\bm{a}}}_{n}[t]\}_{t=P}^{T}$ be generated by TIRSO (Procedure 2) with $\alpha_{t}=1/(\beta_{\tilde{\ell}}t)$ . Then, the static regret of TIRSO satisfies

[TABLE]

where $G_{\tilde{\ell}}\triangleq(1+\kappa_{\bm{\Phi}})\sqrt{PN}{\color[rgb]{0,0,0}B}_{y}$ with $\kappa_{\bm{\Phi}}=L/\beta_{\tilde{\ell}}$ and $B_{{\tilde{\bm{a}}}}$ is defined in \threfcor:doublingtricktirso.

Proof:

See Appendix F in the supplementary material. ∎

Next, we analyze the performance of TIRSO in dynamic environments.

IV-C Dynamic Regret Analysis of TIRSO

In this section, the performance of TIRSO is analyzed in dynamic settings. Specifically, a dynamic regret bound is derived for TIRSO, and its steady-state tracking error in dynamic scenarios is also discussed.

To characterize the performance of TIRSO in dynamic setups, the dynamic regret is defined as:

[TABLE]

where ${\tilde{\bm{a}}}_{n}[t]$ is the TIRSO estimate and ${\tilde{\bm{a}}}_{n}^{\circ}[t]=\arg\min_{{\tilde{\bm{a}}}_{n}}\tilde{h}_{t}^{(n)}({\tilde{\bm{a}}}_{n})$ .

The dynamic regret in (36) compares the estimate ${\tilde{\bm{a}}}_{n}[t]$ with ${\tilde{\bm{a}}}_{n}^{\circ}[t]$ in terms of the metric $\tilde{h}_{t}^{(n)}(\cdot)$ . As opposed to ${\tilde{\bm{a}}}_{n}^{\circ}[t]$ , estimate ${\tilde{\bm{a}}}_{n}[t]$ does not “know” $\tilde{h}_{t}^{(n)}(\cdot)$ since ${\tilde{\bm{a}}}_{n}[t]$ is obtained from $\{\bm{y}[\tau]\}_{\tau<t}$ whereas $\tilde{h}_{t}^{(n)}(\cdot)$ depends on both $\{\bm{y}[\tau]\}_{\tau<t}$ and $\bm{y}[t]$ . This means that the dynamic regret captures the ability of an algorithm to attain small future residuals. Furthermore, note that comparing with ${\tilde{\bm{a}}}_{n}^{\circ}[t]$ is highly meaningful in the present case since, by definition, ${\tilde{\bm{a}}}_{n}^{\circ}[t]=\arg\min_{{\tilde{\bm{a}}}_{n}}\mu\sum_{\tau=P}^{t}\gamma^{t-\tau}\ell_{\tau}^{(n)}({\tilde{\bm{a}}}_{n})+\lambda\textstyle\sum_{\begin{subarray}{c}n^{\prime}=1,n^{\prime}\neq n\end{subarray}}^{N}\left\lVert{\tilde{\bm{a}}}_{n,n^{\prime}}\right\rVert_{2}$ , which therefore minimizes a version of the batch (5) or hindsight (29) objectives where the more recent residuals are weighted more heavily. Thus, ${\tilde{\bm{a}}}_{n}^{\circ}[t]$ constitutes a significant estimator in dynamic setups and therefore the dynamic regret also quantifies the ability of an estimator to track changes.

It can be easily shown that the static regret is upper-bounded by the dynamic regret. The dynamic regret in (36) would coincide with the static regret if ${\tilde{\bm{a}}}_{n}^{\circ}[t]$ were replaced with $\arg\min_{{\tilde{\bm{a}}}_{n}}\sum_{t=P}^{T}\tilde{h}_{t}^{(n)}({\tilde{\bm{a}}}_{n})$ . Attaining a low dynamic regret is therefore more challenging because the estimator under consideration is compared with a time-varying reference.

This implies that a sublinear dynamic regret may not be attained if this time-varying reference changes too rapidly, which generally occurs when the tracked parameters vary too quickly. For this reason, the dynamic regret is commonly upper-bounded in terms of the cumulative distance between two consecutive instantaneous optimal solutions, known as path length:

[TABLE]

Next, we bound the dynamic regret of TIRSO.

Theorem 5.

\thlabel

th:dynamicregretbound Under assumptions A1, A2, and A3, let $\{{\tilde{\bm{a}}}_{n}[t]\}_{t=P}^{T}$ be generated by TIRSO (Procedure 2) with a constant step size $\alpha\in(0,1/L]$ . If there exists $\sigma$ such that

[TABLE]

then the dynamic regret of TIRSO satisfies:

[TABLE]

where $\kappa_{\bm{\Phi}}\triangleq L/\beta_{\tilde{\ell}}$ .

Proof:

See Appendix G in the supplementary material. ∎

The derivation of the dynamic regret bound above relies on the strong convexity of (20), and thus cannot be done for TISO. The derivation of a different dynamic regret bound in [53] relies on the strong convexity of the quadratic term of an elastic-net regularizer, which is not necessary here. Several remarks about \threfth:dynamicregretbound are in order. If the path length $W^{(n)}[T]$ is sublinear in $T$ , then the dynamic regret is also sublinear in $T$ .

When the path length is not sublinear, the dynamic regret may not be sublinear, but we can still bound the steady-state error under certain conditions:

Theorem 6.

\thlabel

*th:boundingerror Under assumptions A1, A2, and A3, let $\{{\tilde{\bm{a}}}_{n}[t]\}_{t=P}^{T}$ be generated by TIRSO (Procedure 2) with a constant step size $\alpha\in(0,1/L]$ . If there exists $\sigma$ such that (38) holds, then *

[TABLE]

Proof:

Following similar arguments as in the proof of \threfth:dynamicregretbound, (40) follows by applying [54, Lemma 4]. ∎

This theorem establishes that the steady-state error incurred by TIRSO with $\alpha\in(0,1/L]$ in dynamic scenarios eventually becomes bounded, which shows its tracking capability in time-varying environments. If $\alpha=1/L$ , then the upper bound on the steady-state error becomes $\sigma\kappa_{\bm{\Phi}}$ , where $\kappa_{\bm{\Phi}}\triangleq L/\beta_{\tilde{\ell}}$ is an upper bound on the condition number of $\bm{\Phi}[t],\,t\geq P$ . This clearly agrees with intuition. In practice, one may not know the value of $L$ and therefore selecting an $\alpha$ guaranteed to be in $(0,1/L]$ would not be possible. In those cases, it makes sense to compute a running approximation of $L$ given by $\hat{L}_{t}=\max_{P\leq\tau\leq t}\lambda_{\mathrm{max}}(\bm{\Phi}[\tau])$ and adopt the approximately constant step size $\alpha_{t}=c/\hat{L}_{t}$ , where $c\in(0,1]$ . However, in setups where the true VAR parameters change over time, the $\max$ operation may lead the algorithm to use an overly pessimistic approximation of $L$ . Thus, it may be preferable to directly adopt the adaptive step size $\alpha_{t}=c/\lambda_{\mathrm{max}}(\bm{\Phi}[t])$ , as analyzed in Sec. V.

Remark. None of the algorithms and analytical results in this paper require any probabilistic assumption or mention to probability theory, making our results fully compatible with the deterministic interpretation of the estimator at hand. This is because these results establish performance guarantees for the proposed online algorithms relative to the batch estimator or hindsight solutions. If one wished to obtain performance guarantees in terms of probabilistic metrics, such as consistency of the estimators, probabilistic assumptions would of course be required. For example, when $\lambda=0$ , the batch estimator in (3) boils down to the ordinary least squares estimator, which is consistent if the VAR process is stable and the noise is standard white [22, Lemma 3.1]. When $\lambda>0$ , consistency of (3) is discussed in [29]. Remarkably, consistency of the VAR coefficient estimates is not enough to ensure the correct identification of the true graph. Theorem 1 in [29] provides conditions that depend on the true VAR parameters that guarantee that the graph is successfully recovered.

V Numerical Results and Analysis

Simulation tests for the proposed algorithms are performed on both synthetic and real data. All code will be made public at the authors’ websites.

The proposed algorithms are evaluated based on the performance metrics described next, where expectations are approximated by the Monte Carlo method. For synthetic-data experiments, the normalized mean square deviation

[TABLE]

measures the difference between the estimates $\{{\hat{\bm{a}}}_{n}[t]\}_{t}$ and the (possibly time-varying) true VAR coefficients $\{\bm{a}_{n}^{\text{true}}[t]\}_{t}$ . The ability to detect edges of the true VAR-causality graph is assessed using the probability of miss detection

[TABLE]

for a given threshold $\delta$ , which is the probability of not identifying an edge that actually exists, and the probability of false alarm

[TABLE]

which is the probability of detecting an edge that does not exist. Another relevant metric is the edge identification error rate (EIER), which measures how many edges are misidentified relative to the number of possible edges [55]:

[TABLE]

Note that self-loops are excluded in these metrics. To quantify the forecasting performance, define recursively the $h$ -step ahead predictor given $\{\bm{y}[\tau]\}_{\tau\leq t}$ as:

[TABLE]

where $\{\mathbf{\hat{A}}_{p}[t]\}_{p=1}^{P}$ are the estimated VAR coefficients at time $t$ and $\hat{\bm{y}}[t+j|t]=\bm{y}[t+j]$ for $j\leq 0$ . The $h$ -step normalized mean square error is given by

[TABLE]

The values of all parameters involved in the experiments are listed in the captions and legends of the figures.

V-A Synthetic Data Tests

Throughout this section, unless otherwise stated, the expectations in (41) to (44) are taken with respect to realizations of the graph, VAR parameters, and innovation process $\bm{u}[t]$ . Similarly, the step size is set to $\alpha_{t}=1/(4\lambda_{\max}(\bm{\Phi}[t]))$ ; see Sec. IV-C. The regularization parameter is selected to approximately minimize NMSD.

V-A1 Stationary VAR Processes

An Erdős-Rényi random graph is generated with edge probability $p_{e}$ and self-loop probability 1. This graph determines which entries of the matrices $\{\bm{A}_{p}\}_{p=0}^{P}$ are zero. The rest of entries are drawn i.i.d. from a standard normal distribution. Matrices $\{\bm{A}_{p}\}_{p=0}^{P}$ are scaled down afterwards by a constant that ensures that the VAR process is stable [22]. The innovation process samples are drawn independently as $\bm{u}[t]\sim\mathcal{N}(\bm{0},\sigma_{u}^{2}\bm{I}_{N})$ .

The first experiment analyzes TISO and TIRSO in a stationary setting. Figs. 2(a) and 2(b) depict the NMSD and $\text{NMSE}_{1}$ for three different values of $\lambda$ . As a benchmark, Fig. 2(b) includes the $\text{NMSE}_{1}$ of the genie-aided predictor, obtained from (43) after replacing $\mathbf{\hat{A}}_{p}$ with $\bm{A}_{p}$ . It is observed that $\lambda=10^{-6}$ yields a better NMSD and $\text{NMSE}_{1}$ than lower and higher values of $\lambda$ . This corroborates the importance of promoting sparse solutions, as done in TISO and TIRSO. Furthermore, as expected, TIRSO generally converges faster than TISO. Fig. 2(c) shows the receiver operating characteristic (ROC) curve, composed of pairs $(\text{P}_{\text{FA}},\text{P}_{\text{MD}})$ for different values of the threshold $\delta$ . The values of these pairs are obtained by respectively averaging $\text{P}_{\text{FA}}[t]$ and $\text{P}_{\text{MD}}[t]$ over time in the interval $[T_{1},T_{2}]$ . Remarkably, both TISO and TIRSO can simultaneously attain $\text{P}_{\text{FA}}$ and $\text{P}_{\text{MD}}$ below 10%. This ability to satisfactorily detect edges is further investigated in Figs. 2(d-f), where $\delta$ is set for each algorithm so that $\text{P}_{\text{FA}}[t]$ and $\text{P}_{\text{MD}}[t]$ have the same average over the time interval $[T_{1},T_{2}]$ .

Fig. 3 analyzes different step size sequences. Because the true VAR parameters remain constant, the diminishing sequence yields the best performance; see \threfth:strongconvexitytirso. Besides, TISO and TIRSO are compared with benchmarks in Fig. 4, namely online subgradient descent (OSGD) and proximal gradient descent (PGD). The former obtains a minimizer for (5) in an online fashion (labeled as OSGDTISO since it uses the same information as TISO at each iteration). The latter approximates ${\tilde{\bm{a}}}_{n}^{\circ}[t]=\arg\min_{{\tilde{\bm{a}}}_{n}}\tilde{h}_{t}^{(n)}({\tilde{\bm{a}}}_{n})$ by using the (batch) algorithm PGD for $K_{\text{PGD}}$ iterations over $\tilde{h}_{t}^{(n)}({\tilde{\bm{a}}}_{n})$ (labeled as PGDTIRSO since it uses the same information as TIRSO at each iteration). Fig. 4 shows that TISO outperforms OSGDTISO in terms of NMSD, and TIRSO eventually attains better NMSD level than PGDTIRSO. Note that the computational complexity of PGDTIRSO is significantly larger than the complexity of TIRSO. Although the NMSD of TISO in Fig. 4 is close to that of OSGD, a more in-depth study reveals that the former yields sparse iterates without any thresholding; moreover, TIRSO offers a significantly improved edge-detection performance (EIER), see Fig. 5. Fig. 6 compares the true (left) and recovered (right) graphs via TIRSO and TISO by thresholding the average of the estimated VAR coefficients across the intervals $[k/(3T),(k+1)/(3T)]$ , $k=0,1,2$ . The threshold $\delta$ is selected to detect $p_{e}(N^{2}-N)$ edges. Note that this is displayed for a single graph and realization of the VAR process; in other words, this is not a Monte Carlo experiment. It is observed that both TIRSO and TISO can identify the true graph quite accurately and approximate the true VAR coefficients soon afterwards.

V-A2 Non-stationary VAR Processes

The next experiment analyzes TISO and TIRSO when $\bm{y}[t]$ is a (non-stationary) smooth-transition VAR process [56, Ch. 18] $\bm{y}[t]=\sum_{p=1}^{P}\big{(}\bm{A}_{p}+s_{f}[t](\bm{B}_{p}-\bm{A}_{p})\big{)}\bm{y}[t-p]+\bm{u}[t].$ The signal $s_{f}[t]$ determines the transition profile from a VAR model with parameters $\{\bm{A}_{p}\}_{p}$ to a VAR model with parameters $\{\bm{B}_{p}\}_{p}$ . In this experiment, $s_{f}[t]=1-\text{exp}(-\kappa([t-T_{B}]_{+})^{2}),$ where $\kappa>0$ controls the transition speed and $T_{B}$ denotes transition starting instant. Over an Erdős-Rényi random graph, $\{\bm{A}_{p}\}$ and $\{\bm{B}_{p}\}$ are generated independently as in Sec. V-A1. It is easy to show that the coefficients $\bm{A}_{p}+s_{f}[t](\bm{B}_{p}-\bm{A}_{p})$ yield a stable VAR process for all $t$ .

Figs. 7(a) and 7(b) illustrate the influence of the forgetting factor $\gamma$ , of critical importance in non-stationary setups. TISO and TIRSO are seen to satisfactorily estimate and track the model coefficients. As intuition predicts, the lower $\gamma$ is, the more rapidly TIRSO can adapt to changes, but after a sufficiently long time after the transition, a higher $\gamma$ is preferred.

Finally, to demonstrate that TISO and TIRSO successfully leverage sparsity to track time-varying topologies, Fig. 8 illustrates an approximately optimal point in the trade-off of selecting $\lambda$ .

V-B Real-Data Tests

The real data is taken from Lundin’s offshore oil and gas (O&G) platform Edvard-Grieg666https://www.lundin-petroleum.com/operations/production/norway-edvard-grieg. Each node corresponds to a temperature, pressure, or oil-level sensor placed in the decantation system that separates oil, gas, and water. The measured time series are physically coupled due to the pipelines connecting the system parts and due to the control systems. Hence, causal relations among time series are expected. Topology identification is motivated to forecast the short-term future state of the system and to unveil dependencies that cannot be detected by simple inspection. All time series are resampled to a common set of equally-spaced sampling instants using linear interpolation. Since the data was quantized and compressed using a lossy scheme, a significant amount of noise is expected. Each time series is normalized to have zero mean and unit sample standard deviation.

Here, the step size is set to $\alpha_{t}\!=\!1/(\lambda_{\max}(\bm{\Phi}[t]))$ and the NMSE is defined as $\text{NMSE}_{h}=1/(\sum_{t}\lVert\bm{y}[t+h]\lVert_{2}^{2})\sum_{t}\left\lVert\bm{y}[t+h]-\hat{\bm{y}}[t+h|t]\right\rVert_{2}^{2}.$

Fig. 9 shows the $\text{NMSE}_{h}$ vs. the prediction horizon $h$ for the time series in the data set. The temperature, pressure, and oil level time series are respectively denoted by T, P, and L and an identifying index. As expected, the prediction error increases with $h$ . The NMSE ranges from $10^{-4}$ to $1$ due to the different predictability of each time series.

Fig. 10 presents the graph obtained by thresholding the average coefficient estimates over a three-hour duration. The threshold is such that the number of reported edges is $4N$ . Self-loops are omitted for clarity, and arrow colors encode edge weights. It is observed that most identified edges connect sensors within each subsystem.

VI Conclusions

Two online algorithms were proposed for identifying and tracking VAR-causality graphs from time series. These algorithms sequentially accommodate data and refine their sparse topology estimates accordingly. The proposed algorithms offer complementary benefits: whereas TISO is computationally simpler, TIRSO showcases improved tracking behavior. Performance is assessed theoretically and empirically. Asymptotic equivalence of the hindsight solutions of the proposed algorithms is established and sublinear regret bounds are derived. Experiments with synthetic and real data validate the conclusions of the theoretical analysis. Future directions include explicitly modeling the variations in the VAR coefficients, possibly along the lines of [57, 58, 59], as well as identifying topologies whose adjacency matrix has a low-rank plus sparse structure along the lines of [60] to account for clusters.

Appendix A Proof of \threfprop:asymptoticequivalence

The first step is to rewrite (30) to be able to obtain a simple expression for ${\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n})-{\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n})$ . To this end, substitute (19) into (30) and exchange the order of the summations to obtain

[TABLE]

where $\theta_{\tau,T}\triangleq\sum_{t=\tau}^{T-1}\gamma^{t-\tau}$ . From the geometric series summation formula, which establishes that $\theta_{\tau,T}=({1-\gamma^{T-\tau}})/({1-\gamma})$ , and noting that $\mu=1-\gamma$ , the above equation becomes

[TABLE]

From (28) and the equation above, the difference $d_{T}(\bm{a}_{n})\triangleq{\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n})-{\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{a}_{n})$ between the TISO and TIRSO hindsight objectives is given by:

[TABLE]

To prove part 1, it suffices to show that $d_{T}(\bm{a}_{n})\rightarrow 0$ as $T\rightarrow\infty$ for all $\bm{a}_{n}$ . To this end, expand $\ell_{t}^{(n)}(\bm{a}_{n})$

[TABLE]

and apply Cauchy-Schwarz inequality to obtain

[TABLE]

On the other hand, the hypothesis $|y_{n}[t]|^{2}\leq{\color[rgb]{0,0,0}B}_{y}\forall n,t$ implies that $\lVert\bm{y}[t]\rVert_{2}^{2}\leq N{\color[rgb]{0,0,0}B}_{y}$ , and hence

[TABLE]

Substituting the upper bound of $\lVert\bm{g}[t]\rVert_{2}^{2}$ into (47) yields

[TABLE]

Applying the latter bound to (45) results in

[TABLE]

Taking the limit of the right-hand side clearly yields

[TABLE]

Noting from (45) that $d_{T}(\bm{a}_{n})\geq 0$ , it follows that $\lim_{T\rightarrow\infty}d_{T}(\bm{a}_{n})=0$ , which concludes the proof of part 1.

To prove part 2, note from (45) that $d_{T}(\bm{a}_{n})\geq 0$ , which in turn implies that

[TABLE]

for all $\bm{a}_{n}$ and $T>P$ . On the other hand, it follows from (29) that

[TABLE]

Thus, by combining (51) and (52),

[TABLE]

Similarly, from (27), it holds that ${\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n}^{*}[T])\leq{\color[rgb]{0,0,0}C}_{T}({\tilde{\bm{a}}}_{n}^{*}[T])$ . Subtracting ${\color[rgb]{0,0,0}\tilde{C}}_{T}({\tilde{\bm{a}}}_{n}^{*}[T])$ from both sides of the latter inequality yields

[TABLE]

By combining (53) and (A), it holds that

[TABLE]

Since $\lim_{T\rightarrow\infty}d_{T}({\tilde{\bm{a}}}_{n}^{*}[T])=0$ , (55) implies that

[TABLE]

Finally, to establish part 3, note that it follows from assumption A2, (20) and (30) that ${\color[rgb]{0,0,0}\tilde{C}}_{T}$ is $\tilde{\beta}$ -strongly convex for some $\tilde{\beta}>0,\forall\,T$ . Thus, from (29), one finds that

[TABLE]

By combining (51) and (57), it follows that

[TABLE]

or, equivalently,

[TABLE]

Taking limits gives rise to

[TABLE]

From (56) and the sandwich theorem applied to (60), we have

[TABLE]

which concludes the proof.

Appendix B Proof of \threfcor:doublingtricktiso

Consider first the regret of TISO with constant step size.

Lemma 1.

\thlabel

prop:regrettiso Let $\{\bm{a}_{n}[t]\}_{t=P}^{T}$ be generated by TISO (Procedure 1) with constant step size $\alpha_{t}\!=\alpha\!=\!\mathcal{O}\big{(}1/\sqrt{T}\big{)}$ . Under assumptions A1 and A26, we have

[TABLE]

Proof:

See Appendix C. ∎

Observe that the step size in \threfprop:regrettiso depends on $T$ and therefore (62) cannot be interpreted as directly establishing sublinear regret for TISO. To understand this result, consider a sequence of copies of TISO, each one for a value of $T$ . Each copy has a (potentially) different step size, but uses the same step size for all $t$ . Expression (62) bounds the regret of the $T$ -th copy at time $T$ . However, \threfprop:regrettiso can be used next to establish sublinear regret for step size sequences that remain constant over windows of exponentially increasing length; see the doubling trick [46].

To this end, let the regret in the window $[t_{1},t_{2}]$ be

[TABLE]

where $\{\bm{a}_{n}[t]\}_{t}\subset\mathbb{R}^{NP}$ is an arbitrary sequence and

[TABLE]

The next result establishes a bound on the static regret given the regret at each window.

Lemma 2.

\thlabel

lemma:individualregrets For $T=t_{0}2^{M}$ and for an arbitrary sequence $\{\bm{a}_{n}[t]\}_{t}\subset\mathbb{R}^{NP}$ , the regret in (31) is bounded as:

[TABLE]

Proof:

For $T=t_{0}2^{M}$ , expression (31) can be written as:

[TABLE]

On the other hand, it follows from (63) that (65) is equivalent to

[TABLE]

The inequality in (67) can also be rewritten as

[TABLE]

By comparing (66) and (68), proving (65) is equivalent to showing that

[TABLE]

From the definitions of $\bm{a}_{n}^{*}[T]$ in (27) and $\bm{a}_{n}^{*}[t_{1},t_{2}]$ in (64), the above inequality holds since $\inf_{\bm{x},\bm{y}}f(\bm{x},\bm{y})\leq\inf_{\bm{x}=\bm{y}}f(\bm{x},\bm{y})$ . ∎

The next step is to bound the regret at each window using \threfprop:regrettiso. To this end, one must set $\alpha_{[m]}$ as a function $\mathcal{O}(1/\sqrt{T_{m}})$ , where $T_{m}\triangleq t_{0}2^{m}-t_{0}2^{m-1}=t_{0}2^{m-1}$ is the length of the $(m+1)$ -th window, $m=1,\ldots,M$ . Invoking \threfprop:regrettiso, the regret for the $(m+1)$ -th window is given by $R_{s}^{(n)}[t_{0}2^{m-1}+1,t_{0}2^{m}]=\mathcal{O}(PN{\color[rgb]{0,0,0}B}_{y}B_{\bm{a}}^{2}\sqrt{2^{m-1}})$ . By \threflemma:individualregrets, the regret of TISO becomes

[TABLE]

which concludes the proof.

Appendix C Proof of \threfprop:regrettiso

First we present a lemma that establishes that the hindsight solution of TISO is bounded and then we will present the proof of \threfprop:regrettiso.

Lemma 3.

\thlabel

lemma:boundedhindsightsolutionTISO Under assumptions A1, A2, and A26, the hindsight solution of TISO $\bm{a}_{n}^{*}[T]$ given in (27) is bounded as

[TABLE]

Proof:

Note that $a_{n}^{*}[T]$ belongs to the sublevel set of TISO hindsight objective for $\bm{a}_{n}=\bm{0}_{NP}$ , given by

[TABLE]

where ${\color[rgb]{0,0,0}C}_{T}(\bm{0}_{NP})$ is upper bounded by

[TABLE]

This means that we can write:

[TABLE]

Next, we find a lower bound to ${\color[rgb]{0,0,0}C}_{T}(\bm{a}_{n}^{*}[T])$ that is an increasing function of $\lVert\bm{a}_{n}^{*}[T]\rVert_{2}$ as follows

[TABLE]

Therefore,

[TABLE]

Further, we can write

[TABLE]

with $B_{\bm{a}}\triangleq 1/\beta({\color[rgb]{0,0,0}B}_{y}\sqrt{PN}+\sqrt{{\color[rgb]{0,0,0}B}_{y}^{2}PN+\beta{\color[rgb]{0,0,0}B}_{y}})$ . Expression (75) implies that the TISO hindsight solution is bounded. ∎

Now, we present the proof of \threfprop:regrettiso. This proof is based on the idea that if the inequality $\lVert\nabla\ell_{t}^{(n)}(\bm{a}_{n})\rVert_{2}^{2}\leq 2PN{\color[rgb]{0,0,0}B}_{y}\,\ell_{t}^{(n)}(\bm{a}_{n}),\forall\,t,n$ holds and the strong convexity parameter of $\psi$ is 1, then it follows from [42, Corollary 5] that:

[TABLE]

where $B_{\bm{a}}$ is defined in (70). We still need to show that the inequality $\lVert\nabla\ell_{t}^{(n)}(\bm{a}_{n})\rVert_{2}^{2}~{}\leq~{}2PN{\color[rgb]{0,0,0}B}_{y}\,\ell_{t}^{(n)}(\bm{a}_{n})$ , $\forall\,t,n$ , holds. To this end, note from (14) that:

[TABLE]

On the other hand, the hypothesis $|y_{n}[t]|^{2}\leq{\color[rgb]{0,0,0}B}_{y}~{}\forall~{}n,t$ implies that $\left\lVert\bm{y}[t]\right\rVert_{2}^{2}\leq N{\color[rgb]{0,0,0}B}_{y}$ and, therefore:

[TABLE]

Combining (76) and (77) yields

[TABLE]

Thus, to satisfy

[TABLE]

it suffices to set $\rho=2PN{\color[rgb]{0,0,0}B}_{y}$ .

Appendix D Proof of \threfcor:doublingtricktirso

The first step is to obtain a bound for constant step size.

Lemma 4.

\thlabel

prop:regrettirso Let $\{{\tilde{\bm{a}}}_{n}[t]\}_{t=P}^{T}$ be generated by TIRSO (Procedure 2) with constant step size $\alpha_{t}\!=\!\alpha\!=\!\mathcal{O}\big{(}1/\sqrt{T}\big{)}$ . Under assumptions A1, A2, and A3, we have

[TABLE]

Proof:

See Appendix E. ∎

The rest of the proof proceeds along the lines of the proof of \threfcor:doublingtricktiso.

Appendix E Proof of \threfprop:regrettirso

First, we present a lemma that establishes that the hindsight solution of TIRSO is bounded. Then, we will present the proof of \threfprop:regrettirso.

Lemma 5.

\thlabel

lemma:boundedhindsightsolutionTIRSO Under the assumptions A1 and A2, the hindsight solution of TIRSO ${\tilde{\bm{a}}}_{n}^{*}[T]$ given in (29) is bounded as

[TABLE]

Proof:

The proof follows similar steps to those of \threflemma:boundedhindsightsolutionTISO. Consider the sublevel set of TIRSO hindsight objective for ${\tilde{\bm{a}}}_{n}^{*}[T]=\bm{0}_{NP}$ ,

[TABLE]

where ${\color[rgb]{0,0,0}\tilde{C}}_{T}(\bm{0}_{NP})$ is upper bounded as follows:

[TABLE]

This implies that

[TABLE]

Next, we find a lower bound to ${\color[rgb]{0,0,0}\tilde{C}}_{T}({\tilde{\bm{a}}}_{n}^{*}[T])$ that is an increasing function of $\lVert{\tilde{\bm{a}}}_{n}^{*}[T]\rVert_{2}$ as follows

[TABLE]

Therefore,

[TABLE]

Further, we can write

[TABLE]

Expression (86) implies that the TIRSO hindsight solution is bounded. ∎

Now, we present the proof of \threfprop:regrettirso. The proof has two parts. The first step is to prove that there exists $\tilde{\rho}>0$ such that

[TABLE]

holds for all $\bm{a}_{n}$ . The second step is to apply the result of [42, Corollary 5] in the present case. To prove the first part, from (20) and $\nabla\tilde{\ell}_{t}^{(n)}(\bm{a}_{n})=\bm{\Phi}[t]\bm{a}_{n}-\bm{r}_{n}[t]$ , it follows that (87) is equivalent to

[TABLE]

By expanding the left-hand side of (E), rearranging terms, and introducing $Z_{t}(\bm{a}_{n})$ as

[TABLE]

the condition in (87) is equivalent to $Z_{t}(\bm{a}_{n})\geq 0$ . So the goal becomes finding $\tilde{\rho}$ such that $Z_{t}(\bm{a}_{n})\geq 0$ for all $\bm{a}_{n}$ and $t$ . For this condition to hold, it is necessary that (a) $\inf_{\bm{a}_{n}}Z_{t}(\bm{a}_{n})$ is finite for all $t$ , and (b) $\inf_{\bm{a}_{n}}Z_{t}(\bm{a}_{n})\geq 0$ for all $t$ . It can be seen [61, Appendix A.5] that condition (a) holds iff (a1) the Hessian matrix $\bm{H}Z_{t}(\bm{a}_{n})={\tilde{\rho}}\bm{\Phi}[t]-2\bm{\Phi}^{\top}[t]\bm{\Phi}[t]$ is positive semidefinite, and (a2) $2\bm{\Phi}[t]\bm{r}_{n}[t]-\tilde{\rho}\bm{r}_{n}[t]\in\mathcal{R}(\bm{H}Z_{t}(\bm{a}_{n}))$ , where $\mathcal{R}(\bm{A})$ denotes the span of the columns of a matrix $\bm{A}$ . The first step is to find $\tilde{\rho}$ such that (a1) holds. To this end, consider the eigenvalue decomposition of $\bm{\Phi}[t]=\bm{U}\bm{\Lambda}\bm{U}^{\top}$ , where the index $t$ is omitted to simplify notation. Therefore,

[TABLE]

Let $\lambda_{\textrm{max}}(\bm{\Phi}[t])$ denote the maximum eigenvalue of $\bm{\Phi}[t]$ . It follows from (90) that $\bm{H}Z_{t}(\bm{a}_{n})$ is positive semidefinite if

[TABLE]

It remains to be shown that there exists $\tilde{\rho}>0$ such that (91), (a2), and (b) simultaneously hold. To this end, focus first on (a2), which can be rewritten as

[TABLE]

Clearly, if $\tilde{\rho}>2\lambda_{\textrm{max}}(\bm{\Phi}[t])$ , then ${\tilde{\rho}}\bm{I}-2\bm{\Phi}[t]$ is invertible and, hence, $\mathcal{R}(\bm{\Phi}[t]({\tilde{\rho}}\bm{I}-2\bm{\Phi}[t]))=\mathcal{R}(\bm{\Phi}[t])$ [62, Ch. 4]. Thus, (92) holds if $2\bm{\Phi}[t]\bm{r}_{n}[t]\in\mathcal{R}(\bm{\Phi}[t])$ and $\tilde{\rho}\bm{r}_{n}[t]\in\mathcal{R}(\bm{\Phi}[t])$ . The former condition is trivial. To verify the latter, define

[TABLE]

and $\bm{B}\triangleq\bm{G}\bm{\Gamma}^{1/2}$ ; note that $\bm{\Phi}[t]=\bm{G}\bm{\Gamma}\bm{G}^{\top}=\bm{B}\bm{B}^{\top}$ . It follows that $\bm{r}_{n}[t]=\bm{G}\bm{\Gamma}\bm{y}_{n}=\bm{B}\bm{\Gamma}^{1/2}\bm{y}_{n}\in\mathcal{R}(\bm{B})=\mathcal{R}(\bm{B}\bm{B}^{\top})=\mathcal{R}(\bm{\Phi}[t])$ . Therefore, $\tilde{\rho}\bm{r}_{n}[t]\in\mathcal{R}(\bm{\Phi}[t])$ holds and, consequently, (a2) holds whenever $\tilde{\rho}>2\lambda_{\textrm{max}}(\bm{\Phi}[t])$ .

So far, this proof has established that, if $\tilde{\rho}>2\lambda_{\textrm{max}}(\bm{\Phi}[t])$ , then both (a1) and (a2) hold. The next step is to show that (b) also holds when $\tilde{\rho}>2\lambda_{\textrm{max}}(\bm{\Phi}[t])$ . To this end, set the gradient of $Z_{t}(\bm{a}_{n})$ equal to zero and use $\tilde{\rho}>2\lambda_{\textrm{max}}(\bm{\Phi}[t])$ to obtain $\bm{\Phi}^{\dagger}[t]\bm{r}_{n}[t]\in\underset{\bm{a}_{n}}{\arg\min}~{}Z_{t}(\bm{a}_{n})$ , where the symbol $\dagger$ denotes pseudo-inverse. From this expression and (E), it follows that

[TABLE]

Applying the properties of the pseudoinverse and simplifying results in

[TABLE]

From this expression, note that the condition $\underset{\bm{a}_{n}}{\inf}~{}Z_{t}(\bm{a}_{n})\geq 0$ is equivalent to

[TABLE]

and, upon defining $\tilde{\bm{y}}_{n}\triangleq\bm{\Gamma}^{1/2}\bm{y}_{n}$ ,

[TABLE]

This inequality trivially holds when $\tilde{\bm{y}}_{n}=\bm{0}_{t-P+1}$ . Thus, assume without loss of generality that $\tilde{\bm{y}}_{n}\neq\bm{0}_{t-P+1}$ . By setting $\bm{A}\triangleq\bm{\Gamma}^{1/2}\bm{G}^{\top}(\bm{G}\bm{\Gamma}\bm{G}^{\top})^{\dagger}$ , one obtains $\bm{A}\bm{B}=\bm{\Gamma}^{1/2}\bm{G}^{\top}(\bm{G}\bm{\Gamma}\bm{G}^{\top})^{\dagger}\bm{G}\bm{\Gamma}^{1/2}$ and $\bm{B}\bm{A}=\bm{\Phi}[t]\bm{\Phi}^{\dagger}[t]$ .

Since the nonzero eigenvalues of $\bm{A}\bm{B}$ and $\bm{B}\bm{A}$ are the same [63, Sec. 3.2.11] and the maximum eigenvalue of $\bm{B}\bm{A}$ is 1, then the maximum eigenvalue of $\bm{A}\bm{B}$ is also 1. Therefore

[TABLE]

and, hence, (97) holds. To sum up, conditions (a) and (b) hold if $\tilde{\rho}>2\lambda_{\textrm{max}}(\bm{\Phi}[t])$ . In other words, (87) holds for any choice of $\tilde{\rho}$ such that $\tilde{\rho}>2\lambda_{\textrm{max}}(\bm{\Phi}[t])$ for all $t$ . This completes the first part of the proof. The second part of the proof consists of setting $\tilde{\rho}=\underset{t}{\textrm{sup}}~{}\lambda_{\textrm{max}}(\bm{\Phi}[t])+\epsilon$ with $\epsilon>0$ an arbitrary constant, and invoking [42, Corollary 5] to conclude that

[TABLE]

Using assumption A3 and substituting the upper bound on $\lVert{\tilde{\bm{a}}}_{n}^{*}[T]\rVert_{2}$ from (80) into the above expression completes the proof.

Appendix F Proof of \threfth:strongconvexitytirso

To prove \threfth:strongconvexitytirso, first we present two lemmas. Before presenting the result related to logarithmic regret of TIRSO, it is worth mentioning that a related result is presented in [42, Th. 7], which is applicable to strongly convex regularization functions. Note that in TIRSO, the data-fitting function is strongly convex. It can be easily shown that COMID applied to a problem with strongly convex regularizer produces different iterates than COMID applied to a strongly convex data-fitting function.

Lemma 6.

\thlabel

lemma:strongcvxlemma Under assumption A2, let the sequence $\{{\tilde{\bm{a}}}_{n}[t]\}_{t=P}^{T}$ be generated by TIRSO (Procedure 2) with a step size $\alpha_{t}$ , and let ${\tilde{\bm{a}}}_{n}^{*}[T]$ be the hindsight solution for TIRSO at time $T$ defined in (29). Then

[TABLE]

for $P\leq t\leq T$ , $\forall~{}\bm{g}_{t}^{\tilde{\ell}}\in\partial(\tilde{\ell}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t]))$ .

Proof:

For a strongly convex $\tilde{\ell}_{t}^{(n)}$ , by the subgradient inequality, we have

[TABLE]

$\forall\bm{g}_{t}^{\tilde{\ell}}\in\partial(\tilde{\ell}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t]))$ . On the other hand, since ${\Omega}^{(n)}$ is convex,

[TABLE]

$\forall~{}\bm{g}_{t+1}^{\Omega}\in\partial({\Omega}^{(n)}({\tilde{\bm{a}}}_{n}[t+1]))$ . Adding (100) and (101), scaling by $\alpha_{t}$ , and rearranging terms,

[TABLE]

where (a) results from adding and subtracting the term ${\tilde{\bm{a}}}_{n}^{\top}[t+1]\bm{g}_{t}^{\tilde{\ell}}+({\tilde{\bm{a}}}_{n}^{*}[T]-{\tilde{\bm{a}}}_{n}[t+1])^{\top}({\tilde{\bm{a}}}_{n}[t]-{\tilde{\bm{a}}}_{n}[t+1])$ followed by rearranging terms; in (b) the inequality $({\tilde{\bm{a}}}_{n}^{*}[T]-{\tilde{\bm{a}}}_{n}[t+1])^{\top}({\tilde{\bm{a}}}_{n}[t]-{\tilde{\bm{a}}}_{n}[t+1]-\alpha_{t}\bm{g}_{t}^{\tilde{\ell}}-\alpha_{t}\bm{g}_{t+1}^{\Omega})\leq 0$ is used, which is implied by the optimality of ${\tilde{\bm{a}}}_{n}[t+1]$ in (23), i.e., $(\bm{a}_{n}-{\tilde{\bm{a}}}_{n}[t+1])^{\top}(\tilde{\nabla}\tilde{J}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t+1]))\geq 0,\forall\,\bm{a}_{n}$ ; in (c) the Pythagorean theorem for Euclidean distance (i.e. $({\tilde{\bm{a}}}_{n}^{*}[T]-{\tilde{\bm{a}}}_{n}[t+1])^{\top}({\tilde{\bm{a}}}_{n}[t+1]-{\tilde{\bm{a}}}_{n}[t])=1/2\lVert{\tilde{\bm{a}}}_{n}^{*}[T]-{\tilde{\bm{a}}}_{n}[t]\rVert_{2}^{2}-1/2\lVert{\tilde{\bm{a}}}_{n}^{*}[T]-{\tilde{\bm{a}}}_{n}[t+1]\rVert_{2}^{2}-1/2\lVert{\tilde{\bm{a}}}_{n}[t+1]-{\tilde{\bm{a}}}_{n}[t]\rVert_{2}^{2}$ ) is used; in (d) the inequality $\langle\bm{x},\bm{y}\rangle\leq 1/2(\|\bm{x}\rVert_{2}^{2}+\lVert\bm{y}\rVert_{2}^{2})$ is used. Dividing both sides of (102) by $\alpha_{t}$ completes the proof. ∎

Next, we establish that TIRSO estimates ${\tilde{\bm{a}}}_{n}[t]$ are bounded and a bound on $\lVert\nabla\tilde{\ell}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])\rVert_{2}$ that depends on parameters of the algorithm, is derived.

Lemma 7.

\thlabel

lemma:boundingestiamtes Under assumptions A1 and A2, and let the sequence of iterates $\{{\tilde{\bm{a}}}_{n}[t]\}$ be generated by TIRSO (Procedure 2). Then

[TABLE]

Proof:

From the update expression of TIRSO, we have

[TABLE]

Now, we derive an upper bound on $\lVert\bm{r}_{n}[t]\rVert_{2}$ . By the definition of $\bm{r}_{n}[t]$ in (21b) and assumption A1, we have

[TABLE]

Substituting the upper bound of $\bm{r}_{n}[t]$ from (105b) into (104) completes the proof. ∎

Lemma 8.

\thlabel

lemma:boundg Under assumptions A1, A2, and A3, and let the sequence of iterates $\{{\tilde{\bm{a}}}_{n}[t]\}$ be generated by TIRSO (Procedure 2) with $\alpha_{t}=1/(\beta_{\tilde{\ell}}t)$ . Then

[TABLE]

Proof:

Invoking \threflemma:boundingestiamtes and setting $\alpha_{t}=1/(\beta_{\tilde{\ell}}t)$ in (103),

[TABLE]

Substituting the upper bound of $\lVert{\tilde{\bm{a}}}_{n}[t-1]\rVert_{2}$ using (108a), we have

[TABLE]

After $k$ substitutions, the above bound can be written in terms of $k$ as follows

[TABLE]

$1\leq k\leq t-P+1$ . The bound on $\lVert{\tilde{\bm{a}}}_{n}[t+1]\rVert_{2}$ in terms of the initial estimate $\lVert{\tilde{\bm{a}}}_{n}[P]\rVert_{2}$ is obtained for $k=t-P+1$ in the above inequality, given by

[TABLE]

This completes the proof of (106), the first part of the lemma. To prove the second part of the lemma, by taking the value of the gradient in (22), and by the triangular inequality,

[TABLE]

∎

Now, we are ready to prove \threfth:strongconvexitytirso. We start from the result presented in \threflemma:strongcvxlemma. Summing both sides of (99) from $t=P$ to $T$ results in

[TABLE]

where the inequality in (a) results from ignoring the term $1/(2\alpha_{T})\lVert{\tilde{\bm{a}}}_{n}^{*}[T]-{\tilde{\bm{a}}}_{n}[T+1]\rVert_{2}^{2}$ and combining similar terms. To relate the l.h.s. of (111) and the static regret in this case, consider the definition of the static regret for TIRSO in (32)

[TABLE]

Adding and subtracting the term ${\Omega}^{(n)}({\tilde{\bm{a}}}_{n}[T+1])$ to the r.h.s. of (113) and rearranging of terms results in

[TABLE]

where ${\Omega}^{(n)}({\tilde{\bm{a}}}_{n}[P])=0$ and ${\Omega}^{(n)}({\tilde{\bm{a}}}_{n}[T+1])\geq 0$ are used in the above inequality. Observe that the r.h.s. of the above inequality coincides with the l.h.s. of (111). Therefore, from (112) and (114), we have

[TABLE]

Setting $\alpha_{t}=1/(\beta_{\tilde{\ell}}t)$ in the above inequality yields

[TABLE]

where in (a) the bound on the gradient given in (107) is used; in (b) the inequality $\sum_{t=1}^{T}1/t\leq\mathrm{log}(T)+1$ and the fact ${\tilde{\bm{a}}}_{n}[P]=\bm{0}_{NP}$ is used, and (c) is obtained by using the bound from (80).

Appendix G Proof of \threfth:dynamicregretbound

We derive the dynamic regret of TIRSO. To this end, since $\tilde{h}_{t}$ is convex, we have by definition

[TABLE]

$\forall\,{\tilde{\bm{a}}}_{n}^{\circ}[t],{\tilde{\bm{a}}}_{n}[t]$ , where $\tilde{\nabla}\tilde{h}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])=\nabla\tilde{\ell}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])+\bm{u}_{t}$ with $\bm{u}_{t}\in\partial\Omega^{(n)}({\tilde{\bm{a}}}_{n}[t])$ . Rearranging (115) and summing both sides of the inequality from $t=P$ to $T$ results in:

[TABLE]

By applying the Cauchy–Schwarz inequality on each term of the summation in the r.h.s. of the above inequality, we obtain

[TABLE]

The next step is to derive an upper bound on $\lVert\tilde{\nabla}\tilde{h}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])\rVert_{2}$ . From the definition of $\tilde{\nabla}\tilde{h}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])$ and by the triangular inequality, we have

[TABLE]

To bound $\lVert\nabla\tilde{\ell}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])\rVert_{2}$ , we invoke \threflemma:boundingestiamtes and set $\alpha_{t}=\alpha$ to obtain

[TABLE]

where $\delta\triangleq 1-\alpha\beta_{\tilde{\ell}}$ . Observe that for $0<\alpha\leq 1/L$ , we have $0<\delta<1$ . Substituting (118b) recursively, we obtain

[TABLE]

where $1\leq k\leq t-P+1$ . For $k=t-P+1$ , the above inequality becomes

[TABLE]

which implies that $\lVert\nabla\tilde{\ell}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])\rVert_{2}\leq(1+L/\beta_{\tilde{\ell}})\sqrt{PN}{\color[rgb]{0,0,0}B}_{y}$ , as in the proof of \threflemma:boundg by following the same arguments as in (110a). Next, we need to find an upper bound on $\lVert\bm{u}_{t}\rVert_{2}$ in (117). To this end, we apply the result in [46, Lemma 2.6] to ${\Omega}^{(n)}$ , which establishes that all the subgradients of ${\Omega}^{(n)}$ are bounded by its Lipschitz continuity parameter $L_{\Omega^{(n)}}$ . In the following, we show that $L_{\Omega^{(n)}}=\lambda\sqrt{N}$ . Lipschitz smoothness of ${\Omega}^{(n)}$ means that there exists $L_{{\Omega}^{(n)}}$ such that

[TABLE]

for all $\bm{a},\bm{b}$ . By definition, we have ${\Omega}^{(n)}(\bm{x}_{n})=\lambda\sum_{\begin{subarray}{c}n^{\prime}=1,n^{\prime}\neq n\end{subarray}}^{N}\left\lVert\bm{x}_{n,n^{\prime}}\right\rVert_{2}$ with $\bm{x}_{n}=[\bm{x}_{n,1}^{\top},...,\bm{x}_{n,N}^{\top}]^{\top},\bm{x}_{n,n^{\prime}}\in\mathbb{R}^{P},n^{\prime}=1,...,N$ . Let $\bm{z}_{n}=[\bm{z}_{n,1}^{\top},...,\bm{z}_{n,N}^{\top}]^{\top},\bm{z}_{n,n^{\prime}}\in\mathbb{R}^{P},n^{\prime}=1,...,N$ and by taking the l.h.s. of (121), we have

[TABLE]

where the inequality in (122a) holds due to the triangle inequality for scalars ( $\lVert\bm{x}_{n,n^{\prime}}\rVert_{2}-\lVert\bm{y}_{n,n^{\prime}}\rVert_{2}$ as scalars); (122b) holds due to the reverse triangle inequality (given by $\lvert\lVert\bm{x}_{1}\rVert_{2}-\lVert\bm{x}_{2}\rVert_{2}\rvert\leq\lVert\bm{x}_{1}-\bm{x}_{2}\rVert_{2}$ ); and (122c) follows from the inequality $\lVert\bm{b}\rVert_{1}\leq\sqrt{N}\lVert\bm{b}\rVert_{2}$ with $\bm{b}\in\mathbb{R}^{N}$ [64, Sec. 2.2.2]. The inequality in (122c) implies that (121) is satisfied with $L_{{\Omega}^{(n)}}=\lambda\sqrt{N}$ , i.e., ${\Omega}^{(n)}$ is $\lambda\sqrt{N}$ -Lipschitz continuous. Thus, we have $\lVert\tilde{\nabla}\tilde{h}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])\rVert_{2}\leq(1+L/\beta_{\tilde{\ell}})\sqrt{PN}{\color[rgb]{0,0,0}B}_{y}+\lambda\sqrt{N}$ . Substituting this bound in (116) leads to:

[TABLE]

Next, we show that TIRSO for a constant step size can alternatively be derived by applying online proximal gradient descent to minimize $\tilde{\ell}_{t}^{(n)}+{\Omega}^{(n)}$ . With $\tilde{\ell}_{t}^{(n)}$ given by (19) and $\Omega^{(n)}$ is given by (11b), applying the online proximal gradient algorithm with a constant step size $\alpha$ yields:

[TABLE]

where the proximal operator of a function $\Psi$ at point $\bm{v}$ is defined by [65]:

[TABLE]

The parameter $\eta$ controls the trade-off between minimizing $\Psi(\cdot)$ and being close to $\bm{v}$ . According to the definition in Sec. III-B, ${\tilde{\bm{a}}}_{n}^{\text{f}}[t]\triangleq{\tilde{\bm{a}}}_{n}[t]-\alpha\nabla\tilde{\ell}_{t}^{(n)}({\tilde{\bm{a}}}_{n}[t])$ , and ${\tilde{\bm{a}}}^{\text{f}}_{n}[t]=[({\tilde{\bm{a}}}^{\text{f}}_{n,1}[t])^{\top},\ldots,({\tilde{\bm{a}}}^{\text{f}}_{n,N}[t])^{\top}]^{\top}$ , which enables us to write the above update expression as

[TABLE]

Observe that the above problem is separable and the solution to the $n^{\prime}$ -th problem is given by:

[TABLE]

which is the same as (25) with a constant step size $\alpha$ . Therefore, TIRSO can be equivalently derived by applying online proximal gradient descent method. Next, we apply Lemma 2 in [54] in order to bound $\sum_{t=P}^{T}\lVert{\tilde{\bm{a}}}_{n}[t]-{\tilde{\bm{a}}}_{n}^{\circ}[t]\rVert_{2}$ in (G). The hypotheses of Lemma 2 are Lipschitz smoothness of $\tilde{\ell}_{t}^{(n)}$ , Lipschitz continuity of ${\Omega}^{(n)}$ , and strong convexity of $\tilde{\ell}_{t}^{(n)}$ . Lipschitz continuity of ${\Omega}^{(n)}$ is proved in (122c) whereas strong convexity of $\tilde{\ell}_{t}^{(n)}$ is implied by the assumption A2. So we need to verify that ${\tilde{\ell}_{t}^{(n)}}$ is Lipschitz-smooth, which means that there is $L^{\prime}$ such that

[TABLE]

for all $\bm{a},\bm{b}$ . To this end, taking the l.h.s. of (127) and substituting the value of the gradient of ${\tilde{\ell}_{t}^{(n)}}$ from (22) results in:

[TABLE]

where $\lambda_{\mathrm{max}}(\cdot)$ denotes the maximum eigenvalue of the input matrix. Due to assumption A3, the inequality in (127) holds with $L^{\prime}=L$ . To apply Lemma 2 in [54], one can set $K$ in [54] as $T-P+1$ , $g_{k}$ as $\Omega^{(n)}$ , and $f_{k}$ as $\tilde{\ell}_{P+k-1}^{(n)}$ , it follows that $\bm{x}_{k}$ in [54] equals ${\tilde{\bm{a}}}_{n}[P+k-1]$ and $\bm{x}_{k}^{\circ}$ equals ${\tilde{\bm{a}}}_{n}^{\circ}[P+k-1]$ . Then, since we have already shown above that the hypotheses of Lemma 2 in [54] hold in our case, applying it to bound $\lVert{\tilde{\bm{a}}}_{n}[t]-{\tilde{\bm{a}}}_{n}^{\circ}[t]\rVert_{2}$ in (G) yields:

[TABLE]

Noting that ${\tilde{\bm{a}}}_{n}[P]=\bm{0}_{NP}$ concludes the proof.

Bibliography65

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Zaman, L. M. López-Ramos, D. Romero, and B. Beferull-Lozano, “Online topology estimation for vector autoregressive processes in data networks,” in Proc. IEEE Int. Workshop Comput. Advan. Multi-Sensor Adapt. Process. , Curaçao, Dutch Antilles, Dec. 2017.
2[2] E. D. Kolaczyk, Statistical Analysis of Network Data: Methods and Models , Springer, New York, 2009.
3[3] E. Isufi, A. Loukas, N. Perraudin, and G. Leus, “Forecasting time series with varma recursions on graphs,” ar Xiv preprint ar Xiv:1810.08581 , 2018.
4[4] P. Di Lorenzo, S. Barbarossa, P. Banelli, and S. Sardellitti, “Adaptive least mean squares estimation of graph signals,” IEEE Trans. Signal Info. Process. Netw. , vol. 2, no. 4, pp. 555–568, Dec. 2016.
5[5] C. Liu, S. Ghosal, Z. Jiang, and S. Sarkar, “An unsupervised spatiotemporal graphical modeling approach to anomaly detection in distributed CPS,” in ACM/IEEE Int. Conf. Cyber-Physical Syst. , Apr. 2016, pp. 1–10.
6[6] Y. Shen, P. A. Traganitis, and G. B. Giannakis, “Nonlinear dimensionality reduction on graphs,” in Proc. IEEE Int. Workshop Comput. Advan. Multi-Sensor Adapt. Process. , Curacao, Netherlands Antilles, Dec. 2017.
7[7] C. M. Bishop, Pattern Recognition and Machine Learning , Information Science and Statistics. Springer, 2006.
8[8] G. Mateos, S. Segarra, A. G. Marques, and A. Ribeiro, “Connecting the dots: Identifying network structure via graph signal processing,” ar Xiv preprint ar Xiv:1810.13066 , 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Online Topology Identification from Vector Autoregressive Time Series

Abstract

I Introduction

II Preliminaries

II-A Directed Causality Graphs

II-B Batch Estimation Criterion for Topology Identification

II-C Background on Online Optimization

III Online Topology Identification

III-A Topology Identification via Sparse Online optimization

III-B Topology Identification via Recursive Sparse Online optimization

IV Theoretical Results

IV-A Asymptotic Equivalence between TISO and TIRSO

Theorem 1**.**

Proof:

IV-B Static Regret Analysis

Theorem 2**.**

Proof:

Theorem 3**.**

Proof:

Theorem 4**.**

Proof:

IV-C Dynamic Regret Analysis of TIRSO

Theorem 5**.**

Proof:

Theorem 6**.**

Proof:

V Numerical Results and Analysis

V-A Synthetic Data Tests

V-A1 Stationary VAR Processes

V-A2 Non-stationary VAR Processes

V-B Real-Data Tests

VI Conclusions

Appendix A Proof of \threfprop:asymptoticequivalence

Appendix B Proof of \threfcor:doublingtricktiso

Lemma 1**.**

Proof:

Lemma 2**.**

Proof:

Appendix C Proof of \threfprop:regrettiso

Lemma 3**.**

Proof:

Appendix D Proof of \threfcor:doublingtricktirso

Lemma 4**.**

Proof:

Appendix E Proof of \threfprop:regrettirso

Lemma 5**.**

Proof:

Appendix F Proof of \threfth:strongconvexitytirso

Lemma 6**.**

Proof:

Lemma 7**.**

Proof:

Lemma 8**.**

Proof:

Appendix G Proof of \threfth:dynamicregretbound

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.

Theorem 6.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.