Topology Inference over Networks with Nonlinear Coupling

Augusto Santos; Vincenzo Matta; and Ali H. Sayed

arXiv:1906.09029·cs.MA·June 24, 2019

Topology Inference over Networks with Nonlinear Coupling

Augusto Santos, Vincenzo Matta, and Ali H. Sayed

PDF

Open Access

TL;DR

This paper addresses the challenge of inferring network topology in discrete-time nonlinear stochastic systems by establishing conditions for consistent graph learning, specifically for logistic-type dynamical systems.

Contribution

It introduces new theoretical conditions enabling accurate topology inference in nonlinear stochastic networks, expanding beyond linear models.

Findings

01

Conditions for consistent graph inference in nonlinear systems

02

Application to logistic-type dynamical systems

03

Theoretical framework for topology recovery

Abstract

This work examines the problem of topology inference over discrete-time nonlinear stochastic networked dynamical systems. The goal is to recover the underlying digraph linking the network agents, from observations of their state-evolution. The dynamical law governing the state-evolution of the interacting agents might be nonlinear, i.e., the next state of an agent can depend nonlinearly on its current state and on the states of its immediate neighbors. We establish sufficient conditions that allow consistent graph learning over a special class of networked systems, namely, logistic-type dynamical systems.

Equations211

y_{i, n + 1} = σ_{i} (g_{i} (y_{i, n}) j = 1 \sum N a_{ij} h_{j} (y_{j, n}) + x_{i, n + 1})

y_{i, n + 1} = σ_{i} (g_{i} (y_{i, n}) j = 1 \sum N a_{ij} h_{j} (y_{j, n}) + x_{i, n + 1})

F_{0} (n) ≜ E [F_{0} (y_{n})], F_{1} (n) ≜ E [F_{1} (y_{n}, y_{n + 1})] .

F_{0} (n) ≜ E [F_{0} (y_{n})], F_{1} (n) ≜ E [F_{1} (y_{n}, y_{n + 1})] .

A = F_{1} (n) [F_{0} (n)]^{- 1} [Our nonlinear case]

A = F_{1} (n) [F_{0} (n)]^{- 1} [Our nonlinear case]

y_{i, n + 1} = j = 1 \sum N a_{ij} y_{j, n} + x_{i, n + 1} .

y_{i, n + 1} = j = 1 \sum N a_{ij} y_{j, n} + x_{i, n + 1} .

A = R_{1} (n) [R_{0} (n)]^{- 1} [Classic linear case]

A = R_{1} (n) [R_{0} (n)]^{- 1} [Classic linear case]

\frac{1}{n} k = 0 \sum n - 1 F_{0} (y_{k}) ⟶ a.s. F_{0}, \frac{1}{n} k = 0 \sum n - 1 F_{1} (y_{k}, y_{k + 1}) ⟶ a.s. F_{1}

\frac{1}{n} k = 0 \sum n - 1 F_{0} (y_{k}) ⟶ a.s. F_{0}, \frac{1}{n} k = 0 \sum n - 1 F_{1} (y_{k}, y_{k + 1}) ⟶ a.s. F_{1}

A = F_{1} F_{0}^{- 1}

A = F_{1} F_{0}^{- 1}

∥ M ∥ = v \neq = 0 sup \frac{∥ M v ∥}{∥ v ∥} .

∥ M ∥ = v \neq = 0 sup \frac{∥ M v ∥}{∥ v ∥} .

y_{n} = [y_{1, n}, y_{2, n}, \dots, y_{N, n}]^{⊤} .

y_{n} = [y_{1, n}, y_{2, n}, \dots, y_{N, n}]^{⊤} .

y_{i, n + 1} = T_{i} (y_{n}, x_{n + 1}),

y_{i, n + 1} = T_{i} (y_{n}, x_{n + 1}),

y_{n + 1} = T (y_{n}, x_{n + 1}),

y_{n + 1} = T (y_{n}, x_{n + 1}),

y_{n + 1} = σ (g (y_{n}) ⊙ A h (y_{n}) + x_{n + 1})

y_{n + 1} = σ (g (y_{n}) ⊙ A h (y_{n}) + x_{n + 1})

σ (y)

σ (y)

g (y)

h (y)

z_{n + 1} = g (z_{n}) ⊙ A h (z_{n}) + x_{n + 1},

z_{n + 1} = g (z_{n}) ⊙ A h (z_{n}) + x_{n + 1},

g (y) = g (σ (y)), h (y) = h (σ (y)) .

g (y) = g (σ (y)), h (y) = h (σ (y)) .

ω (y) ≜ [\frac{1}{g _{1} ( y _{1} )}, \frac{1}{g _{2} ( y _{2} )}, \dots, \frac{1}{g _{N} ( y _{N} )}]^{⊤} .

ω (y) ≜ [\frac{1}{g _{1} ( y _{1} )}, \frac{1}{g _{2} ( y _{2} )}, \dots, \frac{1}{g _{N} ( y _{N} )}]^{⊤} .

Z = {y \in R^{N} : g (y) has at least one zero entry} .

Z = {y \in R^{N} : g (y) has at least one zero entry} .

F_{0} (y_{n}) ≜ h (y_{n}) h (y_{n})^{⊤},

F_{0} (y_{n}) ≜ h (y_{n}) h (y_{n})^{⊤},

F_{1} (y_{n + 1}, y_{n}) ≜ [ω (y_{n}) ⊙ σ^{- 1} (y_{n + 1})] h (y_{n})^{⊤} I (y_{n} \in / Z),

F_{1} (y_{n + 1}, y_{n}) ≜ [ω (y_{n}) ⊙ σ^{- 1} (y_{n + 1})] h (y_{n})^{⊤} I (y_{n} \in / Z),

F_{0} (n) = E [F_{0} (y_{n})], E [ω (y_{n}) h (y_{n})^{⊤}]

F_{0} (n) = E [F_{0} (y_{n})], E [ω (y_{n}) h (y_{n})^{⊤}]

F_{1} (n) ≜ E [F_{1} (y_{n + 1}, y_{n})]

F_{1} (n) ≜ E [F_{1} (y_{n + 1}, y_{n})]

F_{1} (n) = A F_{0} (n)

F_{1} (n) = A F_{0} (n)

A = F_{1} (n) [F_{0} (n)]^{- 1}

A = F_{1} (n) [F_{0} (n)]^{- 1}

ω (y_{n}) ⊙ σ^{- 1} (y_{n + 1}) I [y_{n} \in / Z]

ω (y_{n}) ⊙ σ^{- 1} (y_{n + 1}) I [y_{n} \in / Z]

F_{1} (y_{n + 1}, y_{n})

F_{1} (y_{n + 1}, y_{n})

F_{1} (n) = A F_{0} (n) + E [x_{n + 1} ⊙ ω (y_{n}) h (y_{n})^{⊤}] .

F_{1} (n) = A F_{0} (n) + E [x_{n + 1} ⊙ ω (y_{n}) h (y_{n})^{⊤}] .

E [x_{n + 1} ⊙ ω (y_{n}) h (y_{n})^{⊤}]

E [x_{n + 1} ⊙ ω (y_{n}) h (y_{n})^{⊤}]

B argmin E [ω (y_{n}) ⊙ σ^{- 1} (y_{n + 1}) - B h (y_{n})_{2}^{2}] .

B argmin E [ω (y_{n}) ⊙ σ^{- 1} (y_{n + 1}) - B h (y_{n})_{2}^{2}] .

F_{0} (n) ≜ \frac{1}{n} k = 0 \sum n - 1 F_{0} (y_{k}), F_{1} (n) ≜ \frac{1}{n} k = 0 \sum n - 1 F_{1} (y_{k}, y_{k + 1})

F_{0} (n) ≜ \frac{1}{n} k = 0 \sum n - 1 F_{0} (y_{k}), F_{1} (n) ≜ \frac{1}{n} k = 0 \sum n - 1 F_{1} (y_{k}, y_{k + 1})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGene Regulatory Network Analysis · Markov Chains and Monte Carlo Methods · Neural dynamics and brain function

Full text

Topology Inference over Networks with Nonlinear Coupling

Augusto Santos*⋆* , Vincenzo Matta†, and Ali H. Sayed*⋆* ⋆ A. Santos (email: [email protected]) and A. H. Sayed (email: [email protected]) are with the École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland. The work of A. H. Sayed was also supported in part by US NSF grant CCF-1524250 and Swiss NSF (SNSF) grant 205121_184999. $\dagger$ V. Matta is with DIEM, University of Salerno, via Giovanni Paolo II, I-84084, Fisciano (SA), Italy (email: [email protected]).

Abstract

This work examines the problem of topology inference over discrete-time nonlinear stochastic networked dynamical systems. The goal is to recover the underlying digraph linking the network agents, from observations of their state-evolution. The dynamical law governing the state-evolution of the interacting agents might be nonlinear, i.e., the next state of an agent can depend nonlinearly on its current state and on the states of its immediate neighbors. We establish sufficient conditions that allow consistent graph learning over a special class of networked systems, namely, logistic-type dynamical systems.

Index Terms:

Topology inference, causal inference, graph learning, structure estimation, nonlinear stochastic networked dynamical systems.

I Introduction

T he evolution of a networked dynamical system is determined by the local interactions among its neighboring agents. Graph or structure learning refers to the problem of estimating the underlying graph from observations collected at the agents. This is a challenging inverse problem, which would allow us to understand more fully the evolution of systems arising across several application domains including, e.g., epidemics [1], social networks [2], and brain activity [3]. Figure 1 provides an illustration of the topology inference problem with reference to the last application.

Very often, the dynamics of real-world phenomena is governed by highly nonlinear and possibly random forms of interaction. For this reason, one useful class of graph learning problems concerns the case of nonlinear stochastic dynamical systems. In this article, we focus on the following class of logistic-type systems, which can model several forms of nonlinear coupling between locally-interacting (i.e., neighboring) agents:

[TABLE]

In the above formulation, we use an index $n=0,1,\ldots$ to denote the $n$ -th time epoch, and an index $i=1,2,\ldots N$ to denote the $i$ -th agent of the network. The relevant quantities in (1) are as follows.

— The random variable $\bm{y}_{i,n}$ denotes the state of agent $i$ at time $n$ .

— The random variable $\bm{x}_{i,n}$ represents an input source of randomness or noise, affecting agent $i$ at time $n$ .

— A local nonlinear coupling between agents $i$ and $j$ at time $n$ is determined by the product $g_{i}(\bm{y}_{i,n})h_{j}(\bm{y}_{j,n})$ .

— We will be dealing with weighted directed graphs, where: $i)$ the graph accounts for the topology linking pairs of nodes; $ii)$ the interaction between pairs of nodes can be directional, e.g., a coupling effect can exist from $i$ to $j$ , but not from $j$ to $i$ ; and $iii)$ the graph weights are encoded in a combination or interaction matrix $A$ , whose $(i,j)$ -th entry, $a_{ij}$ , quantifies the strength of interaction in the directional coupling from $i$ to $j$ . We observe from (1) that, if $a_{ij}=0$ , then information about the state of agent $j$ does not flow to agent $i$ . Otherwise, if $a_{ij}\neq 0$ , then the state of agent $j$ directly impacts the state of agent $i$ , provided that the coupling functions $g$ and $h$ do not vanish.

— $\sigma_{i}:\mathbb{R}\rightarrow\mathbb{R}$ is a nonlinear invertible function that plays the role of an activation function (such as a sigmoidoscopy function). Depending on its particular shape, the function $\sigma_{i}$ can emphasize or de-emphasize its input values. At two extremes, a constant $\sigma_{i}$ kills the dynamics, whereas a linear $\sigma_{i}$ is basically transparent to its input.

Formulations of the type shown in (1) are often used to model interactions over nonlinear oscillators [4, 5], or population dynamics and epidemics over networks [6, 7, 8, 9, 10, 11, 12]. For example, in the context of epidemics over networks, the state ${\bm{y}}_{i,n}$ can model the likelihood of a node $i$ being infected at time $n$ at the individual/microscopical level [13]; at the aggregate/macroscopical level, the state can also represent the fraction of infected individuals within community $i$ in an epidemics across communities, as can be established through a thermodynamic or fluid-limit analysis — see [14, 15, 16, 17]. Likewise, in a general SIR (Susceptible Infected Recovered) formulation, the functions $g_{i}$ and $h_{j}$ in (1) can represent the so-called incidence rate of the infection, whereas the function $\sigma_{i}$ is simply chosen as the identity function [11, 12]. Within the aforementioned frameworks, a natural question is whether the underlying network of interactions can be inferred given the evolution of the infection across nodes or communities.

II Main Results

Our main goal is to answer the following identifiability question: Is it possible to learn the network graph over this class of models? We will see that the answer is in the affirmative.

To answer, we start by adopting a classic nonlinear regression approach to construct two nonlinear functions (see (20) and (21) further ahead for details) that depend on the node measurements. More specifically, stacking the observations $\bm{y}_{i,n}$ of all agents at time $n$ into the $N\times 1$ vector $\bm{y}_{n}$ , we will introduce a zero-lag function, $F_{0}(\bm{y}_{n})$ , which depends only on the current observation vector $\bm{y}_{n}$ , and a one-lag function, $F_{1}(\bm{y}_{n},\bm{y}_{n+1})$ , which depends on the interaction between observation vectors corresponding to adjacent time epochs. The expected values of these functions will play an important role:

[TABLE]

Under suitable conditions on the various nonlinearities $(\sigma,g,h)$ , the combination matrix $A$ admits the following closed-form representation:

[TABLE]

We shall call this relationship generalized Granger estimator for the following reason. In the special case where $\sigma_{i}(y)=y$ , $g_{i}(y)\equiv 1$ , and $h_{j}(y)=y$ , the model in (1) degenerates into the classical linear model (a.k.a. first-order vector autoregressive model):

[TABLE]

In this particular case, a well-known representation for $A$ is given by the one-step linear predictor, sometimes called Granger estimator in the context of causal analysis:

[TABLE]

where $R_{0}(n)=\mathbb{E}[\bm{y}_{n}\bm{y}_{n}^{\top}]$ and $R_{1}(n)=\mathbb{E}[\bm{y}_{n+1}\bm{y}_{n}^{\top}]$ are the zero and one-lag correlation matrices of the samples at time $n$ . The functions $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$ depend on the nonlinear functions that characterize the system in (1), in a way that highlights the role that these nonlinearities play on topology estimation, as we will show in Sec. V

The relationship in (3) actually suggests a strategy to estimate the topology. However, we see from (3) that $A$ depends upon expected values, which in turn depend on the knowledge of the distribution of the data. This leads to a circular argument since the distribution would require knowledge of the matrix $A$ . Accordingly, since only a sample path (i.e., one realization of the process) is observed, in order to show that (3) can be useful, we will prove that the aforementioned expected values can be consistently learned from the samples. To this aim, it will be critical to establish that the considered dynamical system possesses the following ergodic property (the symbol $\stackrel{{\scriptstyle\textnormal{a.s.}}}{{\longrightarrow}}$ denotes almost-sure convergence as $n\rightarrow\infty$ ):

[TABLE]

with:

[TABLE]

The main contributions of this work are as follows. We first answer an identifiability question, namely, we establish whether graph learning is possible over the considered class of logistic nonlinear dynamical systems. To this aim, we exploit a closed-form relationship existing between the interaction matrix $A$ and a pair of functionals of the samples that arise from a solution of a nonlinear regression problem. Through this closed-form representation, we characterize the various nonlinearities and attributes that determine the dynamics in (1), in order to ascertain under which conditions the graph can be learned consistently. This characterization is obtained through a set of transparent sufficient conditions, which allows relating the graph identifiability to the physical evolution of the dynamics. We note that the method and theory are consistent for arbitrary topologies (whether directed or undirected, and dense or sparse).

Second, to show that the graph is consistently learned as the number of samples grows, we will prove that the consistency result in (6) holds true, through a limiting characterization that builds on powerful results available for Markov chains in the general state space.

Finally, in order to enlarge the range of application, we explore numerically scenarios where some of the conditions used to carry out the technical analysis are relaxed. For example, while the aforementioned theoretical results apply in the regime of full observability (i.e., when all nodes of the network are accessible), in Sec. VIII we will show some examples that illustrate how the proposed method can work also in the case of partial observability, along the lines of what has been developed in [38, 39, 40, 41, 42, 43] for linear models.

III Related Work

The problem of graph learning arises in several disciplines, giving rise to different terminologies and relevant models, including structural equations, structural dynamical systems, graphical or vector autoregressive models. In all these models, the evolution of the observables is determined by some form of local interaction between neighboring nodes, and this structure of interaction is encoded in an underlying network graph. In the following, we provide a list of works that are relevant to the present treatment.

III-A Learning Graphical Models

Graphical models conform to an important and well-studied class of systems in the framework of topology inference. In a graphical model, the state of each agent is represented by the realization of a random variable, and a joint distribution among these variables encodes the dependency among the agents, and, hence, encodes the underlying topology. Then, inference about the topology is carried out by assuming that independent samples from the joint distribution can be collected.

There are several works on topology inference for specific classes of graphical models, including, e.g., Ising models [28, 30] and Gaussian graphical models [29]. These works deal with topology inference under the assumption that measurement from all network nodes can be gathered. Another relevant framework, especially over large networks, is that of partial observations. This paradigm corresponds to a challenging inverse problem, where one tries to figure out the graph connecting the observed nodes, despite the latent influence of the unobserved ones. With reference to the partial observations setting for graphical models, in [31] a technique to learn the topology over large-girth graphs (e.g., the bipartite Ramanujan graphs or the random Cayley graphs) is proposed. In [32], a consistent graph-learning algorithm is proposed under the assumption that the adjacency graph matrix is sparse, and that the error matrix associated to the latent-variables is low-rank. In [33], an approach based on influence maximization is adopted to establish when the graph learning problem is feasible over the class of restricted Boltzmann machines.

However, the independence among samples is one fundamental assumption in the framework of graphical models, i.e., it is often assumed that the previous system state does not affect the next state. For this reason, graphical models do not natively match the dynamical system framework that we need here. Likewise, the shape of proper graph estimators can differ substantially from those suited to a dynamical system. For example, while in a Gaussian graphical model, the inverse of the correlation matrix (a.k.a. precision matrix) contains full information on the graph topology, the Granger estimator $R_{1}R_{0}^{-1}$ that is the optimal solution for a first-order diffusion model must rely also on the dependency between subsequent samples as encoded in the one-lag correlation matrix $R_{1}$ .

III-B Graph Learning with Linear Dynamics

Many works on graph learning or causal relationship identification focus on linear models, such as diffusion or Vector AutoRegressive (VAR) models [18]. These linear systems are of practical interest since they arise in several applications, for example, they are used in economics [53]; they may represent well the linearized dynamics of more general nonlinear systems [54]; they can describe the dynamics of biological systems [19] and the operation of distributed algorithms such consensus or diffusion [20, 21, 22, 23].

The problem of topology inference over linear systems has been actively studied in the past several years. A great emphasis lies on exploiting the natural regression formulation that these linear systems exhibit and on reinforcing priors on the network structure (e.g., sparsity, smoothness) when available. For example, recent techniques include: spectral-domain techniques based on optimization with sparsity constraints [24]; graph signal processing techniques applied to causal graph processes [25]; directed information graphs [27] to infer causal dynamics; and approaches based on Wiener filtering to infer the topology [26].

There are also works addressing topology inference over linear systems under partial observations. There are results established with reference to particular graphs, such as polytrees [34, 35], as well as results for more general graphs [36, 37]. For the case of large-scale graphs, asymptotic statistical approaches are exploited, where the conditions on the graph are encoded into average summary indicators, such as the connection probability between any two nodes. The works [38, 39, 40, 41, 42, 43] pursue this approach to show that the graph of the observed component can be faithfully reconstructed as the network dimension scales to infinity, under different regimes of connectivity and/or graph models.

III-C Graph Learning with Nonlinear Dynamics

One common approach to tackle the nonlinear problem is to perform some form of linearization of the system. In some cases, by a proper change of coordinates, certain dynamics can be represented linearly. For example, linearization can be obtained by a variational characterization (under a small-noise assumption) [44]; by suitable augmentation of observable space dimension [45]; by appropriate exploitation of the vector-field Jacobian, in conjunction with a compressed sensing method to mitigate the course of dimensionality and the computational complexity [46].

However, linear or linearized models, while useful under some favorable (e.g., small-noise or small-deviations) conditions, in other cases offer only a convenient approximation of the real dynamics, failing to capture some important aspects thereof.

One relevant class of genuinely nonlinear models is given by nonlinear vector autoregressive models [47, 48]. In [47, 48], the nonlinear interaction is modeled through a sum of nonlinear univariate functions of the node variables, belonging to a certain reproducing kernel Hilbert space. Advanced methods are developed to learn the topology, reinforcing prior information in suitably formulated regression problems in order to boost sample-complexity performance. These methods are flexible enough to incorporate different regularization constraints (such as, e.g., smoothness or sparsity) as well as to cope with the presence of unknown nonlinearities. While fairly general, the class of dynamical systems treated in these references does not encompass the class of logistic systems.

Another relevant class of nonlinear systems (more closely related to our approach) focuses on proper modeling of the nonlinear coupling between pairs of node variables. These models arise across many domains, especially in fundamental physics applications, with one notable model being the Kuramoto model. In [4], a general class of continuous-time nonlinear dynamical systems of Kuramoto-type is considered to model oscillators. The vector field of the dynamical law is approximated in a certain complete orthogonal basis of a Hilbert space (e.g., Legendre/Chebyshev polynomials or Fourier series). The coefficients of this expansion encompass information about the topology. Linear regression (i.e., linear in the entries of the coefficients) is used to extract the coefficients and hence the topology.

Recent efforts aim at finding proper measures of influence or causality among the agents. The underlying structure of the directed network is then generally obtained under appropriate thresholding of the connectivity-measures. For example, functional dependency graphs are introduced in [49] for a fairly broad class of dynamical systems. In [50], a “causal information” measure is proposed to estimate the underlying network of causal relationships. In particular, a Kullback-Leibler based measure is devised for a neural network model.

Along the line of the present work, other works have been devoted to identify the underlying network structure of dynamical systems that model natural phenomena such as oscillatory systems or spread of diseases. For example, in [51] the object of inference is the phylogenetic tree that accounts also for evolutionary elements concerning the disease spread under study and a Bayesian method is proposed. In [8], a log-MLE estimator is devised to infer the underlying structure. The dynamical model assumed is of SIR (susceptible-infected-recovered) type, driven by a Poisson process. The stochastic dynamical model yields a particular distribution parametrized by (among other parameters) the network structure. In [5] (for Kuramoto type of models) an estimator based on a heuristic influence function that maps the relative phase difference between nodes’ outputs into an estimate of their pairwise connection is proposed.

Notation. We use boldface letters to denote random variables, and normal font letters for their realizations. Matrices are denoted by capital letters, and vectors by small letters. Sets and events are denoted by calligraphic letters. We denote by $\mathbb{P}[\mathcal{A}]$ the probability of event $\mathcal{A}$ . For a random variable $\bm{y}$ , the notation $\mathbb{E}[\bm{y}]$ denotes the expected value of $\bm{y}$ . When we say that $\mathbb{E}[\bm{y}]$ exists and is well-defined, we mean that $\mathbb{E}[|\bm{y}|]<\infty$ . The symbol $\odot$ denotes the Hadamard product. The symbol $\mathcal{N}_{i}$ stands for the set of nodes that point to node $i$ in a directed-graph. For a vector $v\in\mathbb{R}^{N}$ , we denote by $\|v\|$ a generic vector norm. When dealing with an $N\times N$ matrix $M$ , the symbol $\|M\|$ will denote the matrix norm induced by the particular vector norm $\|\cdot\|$ , which is defined as:

[TABLE]

We will be dealing with random variables defined at network nodes and evolving over time. The notation $\bm{y}_{i,n}$ will generally denote a (random) variable defined at time $n=0,1,\ldots$ , and at node $i\in\{1,2,\ldots,N\}$ . With the notation $\bm{y}_{n}$ we shall denote the vector that collects the variables of all nodes at time $n$ , namely,

[TABLE]

IV Nonlinear model

In a networked dynamical system, the state of each agent evolves over time as a result of their peer-to-peer interactions. In particular, in the context of discrete-time systems, the state of an agent $i$ at time $n+1$ depends upon its own current state at time $n$ and also on the state of its immediate neighbors at time $n$ . The underlying network defining the neighborhood plays critical role in the long-term properties of such dynamical systems. In its most general form, a discrete-time first-order time-homogeneous continuous state-space nonlinear stochastic system can be described by the following law (refer to Proposition $7.6$ in [52])

[TABLE]

or in vector form

[TABLE]

where $T_{i}\,:\,\mathbb{R}^{N}\rightarrow\mathbb{R}$ , is the vector-field (at node $i$ ) representing how the local interactions affect the evolution of node $i$ ; $\{{\bm{x}}_{n}\}$ is an i.i.d random process modeling exogenous input/perturbations across nodes over time $n$ . The process is time-homogeneous because the law $T$ depends only on the values of the states and of the noise, and does not change over time. The process is with continuous state space because the range of output values can in general vary in $\mathbb{R}^{N}$ .

One characterizing property of such a networked dynamical system is its locality: the state ${\bm{y}}_{i,n+1}$ of agent $i$ at time $n+1$ , only depends on its own state ${\bm{y}}_{i,n}$ and the states of the agents within its neighborhood $\{{\bm{y}}_{j,n}\}_{j\in\mathcal{N}_{i}}$ at time $n$ . In other words, ${\bm{y}}_{i,n+1}$ is independent of the state of all nodes outside its neighborhood, given the state of the nodes in its neighborhood. This property is referred to as a local Markov property and it is naturally induced by the characterizing vector field, $T_{i}$ , which must be sensitive only to the entries associated with the node $i$ and its neighbors in $\mathcal{N}_{i}$ , formally, for any vectors $y\in\mathbb{R}^{N}$ and $x\in\mathbb{R}^{N}$ , the function $T_{i}(y,x)$ depends only on $y_{i},\{y_{j}\}_{j\in\mathcal{N}_{i}}$ and $x$ . The underlying network topology critically determines the evolution of the vector field via this local Markov property.

In this paper, we are interested in the inverse problem of inferring the underlying network characterizing the local Markov property of $T$ given the samples $\{{\bm{y}}_{n}\}$ . It is however hard to devise a universal scheme to extract consistently information about the underlying network of interactions over such a general class of dynamical systems given by (11). It is obvious that not all models can allow consistent graph learning.

For this reason, in this work we focus on a subclass of discrete-time continuous-state systems with the vector field defined by (1), which can be compactly represented in vector form as111In order to avoid confusion, we note explicitly that the notation $g(\bm{y}_{n})\odot Ah(\bm{y}_{n})$ used in (12) denotes the Hadamard product between $g(\bm{y}_{n})$ and the vector $Ah(\bm{y}_{n})$ .:

[TABLE]

having defined the vector-valued functions of vector argument $y\in\mathbb{R}^{N}$ :

[TABLE]

Throughout this work, we assume that $\{\bm{x}_{n}\}$ are independent and identically distributed (i.i.d.) random vectors, with zero-mean and finite second moments, and independent of the initial condition $\bm{y}_{0}$ .

Finally, it is worth noting that, since $\sigma$ is an invertible mapping, by setting $\bm{z}_{n}=\sigma^{-1}(\bm{y}_{n})$ , the system in (12) can be also represented as:

[TABLE]

where we have introduced the composition of functions:

[TABLE]

The type of model in (16) is commonly referred to as additive noise model, and is extensively studied, e.g., in the literature of stochastic dynamical systems [56, 57, 58]. Throughout our treatment, we will generally work in terms of the original untransformed model in (12), since working in terms of the composed functions (17) might obfuscate the role of the different nonlinear functions. On the other hand, the representation in (16) will be useful to prove the technical results in Appendix C, where we will make appeal to existing results pertaining to the additive noise model [56, 57, 58].

V Generalized Granger Estimator

The proper scheme to process the samples with the goal of estimating either the graph-structure or more generally the interaction matrix relies critically on the nature of the samples. Our goal is to establish the proper structure retrieval scheme for the class of networked dynamical systems (1).

We highlight that for the most general class of networked dynamical systems (11), one should not expect to have a closed-form expression for the underlying structure in terms of a functional of the samples – in general, the inference is carried out by indirect means, e.g., as the solution of an optimization problem. By closed-form expression, we mean $A=\mathcal{F}\left({\bm{y}}_{n+1},{\bm{y}}_{n}\right)$ , for some functional $\mathcal{F}$ (e.g., expectation) that can be written in closed-form and expressed only in terms of the observable variables.

In what follows, we introduce a weighting function defined for any $y\in\mathbb{R}$ as:

[TABLE]

Since in principle $\omega(y)$ can be singular when $g(y)$ has some nonzero entries, it is useful to introduce the set:

[TABLE]

Let us also introduce the zero-lag matrix:

[TABLE]

and the one-lag matrix:

[TABLE]

where $\mathbb{I}[\cdot]$ is the indicator function, which takes on the value $1$ if the event under brackets is true, and [math] otherwise. The indicator has been included in the definition to assign a finite value (zero) to the one-lag matrix when the function $\omega$ is singular. The next lemma exploits the nonlinear auto-regressive structure of (12) to relate the matrix $A$ to the expected value of two functions of the samples.

Lemma 1 (Generalized Granger)

If all the entries in the vector $g(\bm{y_{n}})$ are nonzero with probability one, and if the following expectations are well-defined:

[TABLE]

then the following expectation is well-defined:

[TABLE]

and we have that:

[TABLE]

Moreover, if the matrix $\mathcal{F}_{0}(n)$ is invertible, we have that:

[TABLE]

Proof:

From (12) we have the identity:

[TABLE]

Multiplying both sides of (26) by $h(\bm{y}_{n})^{\top}$ , from (21) we get:

[TABLE]

Since by assumption all entries of vector $g({\bm{y}}_{n})$ are nonzero with probability one, we have that $\mathbb{P}[\bm{y}_{n}\notin\mathcal{Z}]=1$ . Thus, by taking expectations of both sides in (27), and using the definitions of $F_{0}(n)$ in (22), and of $F_{1}(n)$ in (23), we obtain:

[TABLE]

In view of the definition in (20), the first term on the RHS is automatically well-defined because we assume integrability of the random variable $F_{0}(\bm{y}_{n})$ — see the first relationship in (22). As to the second term, from the independence between $\bm{x}_{n+1}$ and $\bm{y}_{n}$ we can write:

[TABLE]

where the last equality holds since $\mathbb{E}\left[\bm{x}_{n+1}\right]=0$ , whereas the second relationship holds in view of the integrability assumption in (22), and since $\bm{x}$ has finite first moment [52]. ∎

Remark 1

The solution $\mathcal{F}_{1}(n)[\mathcal{F}_{0}(n)]^{-1}$ arises as the classical solution to the following nonlinear regression problem (where $\|\cdot\|_{2}$ is the Euclidean norm):

[TABLE]

Likewise, the empirical counterparts of $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$ :

[TABLE]

would give $\widehat{\mathcal{F}}_{1}(n)[\widehat{\mathcal{F}}_{0}(n)]^{-1}$ as the solution of the least-squares fitting problem:

[TABLE]

$\square$ **

The relationships in Lemma 1 are basically obtained through straightforward algebra. However, there are two fundamental issues that have been overlooked so far.

First, the relationship in (25) contains expected values. These expected values are obviously unknown (since they depend on the unknown distribution of the samples, which in turn depends on the object of the topology inference, the unknown matrix $A$ ). Nevertheless, Eq. (25) would become very useful if we can show that the matrix functions $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$ can be consistently estimated from the samples, e.g., using (31). This analysis requires addressing issues of stability and ergodicity, and will be carried out in Sec. VI. $\square$

Second, the statement of Lemma 1 is in a sense optimistic, since it relies on two generic assumptions like “assume this is well-defined” and “assume this is invertible”. More precisely, for the direct relationship (24) to hold, one needs that the expected values in (22) are well-defined. Once this condition is ascertained, one more condition is needed. Indeed, to get the fundamental inverse relationship (25), which allows retrieving the object of topology inference, $A$ , from the zero-lag and one-lag functions, one needs $\mathcal{F}_{0}(n)$ to be invertible. $\square$

We conclude that the inference problem treated in this work is not tractable for all possible functions $(\sigma,g,h)$ and statistical laws for the noise $\bm{x}$ characterizing the dynamical system. That is why establishing when the aforementioned conditions are met (we will provide next a set of sufficient conditions for that) is critical to establish whether or not full information about the topology is contained in the samples. Moreover, it is also relevant to understand the practical meaning of these assumptions in connection to the properties of the networked dynamical system under consideration. The forthcoming sections address all these fundamental concerns.

V-A Existence of $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$

Examining (1), we see that the map $g$ controls the flow of information among nodes. To give (an extreme) example, if $g\equiv 0$ then the information about the state of the neighboring agents does not flow or in particular, the network information (that is entailed in $A$ ) is lost. It is further natural to expect that if $g$ is generally too small then, even though information is technically flowing, it is hard to extract it.

Thus, first of all, we have to guarantee that the inverse vector $1/g$ is almost surely well defined, i.e., that the probability that $g$ has some zero entries is zero. One critical difficulty here is that $g$ is a function of $\bm{y}_{n}$ , and that the distribution of $\bm{y}_{n}$ depends in some intricate way on the evolution of the dynamical system.

Such physical property reflects into the following mathematical requirement. Since the matrix function $\mathcal{F}_{1}(n)$ in (22) contains the inverse vector function $1/g$ , one must guarantee that the latter behaves properly in terms of integrability. One example that shows how the result of Lemma 1 needs to be carefully checked, is shown in Fig. 2, where we used the simple linear function $g_{i}(y)=y$ for all $i=1,2,\ldots,N$ . We evaluate empirically the matrix functions $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$ , and we observe that they basically blow up.

The next assumption formally sets a condition on the aforementioned aspect.

Assumption 1 (Integrability conditions)

We assume bounded moments:

[TABLE]

for the initial condition $\bm{y}_{0}$ and the noise term $\bm{x}$ . We assume also that:

[TABLE]

and for some constant $K$ . $\square$

Moreover, since there are several functions involved in the definition of $\mathcal{F}_{1}(n)$ as well as in the definition of $\mathcal{F}_{0}(n)$ , we need a set of sufficient conditions to guarantee the existence of the latter two matrix functions.

Assumption 2 (Regularity of the nonlinearities)

We assume that $\sigma$ is a diffeomorphism222Actually, to prove Proposition 1 we need just $\sigma$ to be continuous and invertible.. We assume that $g$ and $h$ are continuous functions. Furthermore, let us introduce the diagonal matrix:

[TABLE]

We assume the following conditions on the nonlinearities $(\sigma,g,h)$ :

[TABLE]

for some nonnegative constants $\alpha_{\sigma},\alpha_{g},\alpha_{h},\beta_{\sigma},\beta_{g},\beta_{h}$ with $p\geq 0$ , $q\geq 0$ , and $p+q=1$ . $\square$

For instance, a condition like (36) holds when $\|\sigma(y)\|$ grows not faster than $\|y\|$ outside some compact set. Indeed, in this case the constant $\beta_{\sigma}$ can be given by the maximum of $\|\sigma(y)\|$ inside that compact set.

Likewise, when some of the functions $\sigma,h,g$ are bounded, we assume the pertinent $\alpha$ -constants equal to zero.

We have the following result.

Proposition 1 (Existence of $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$ )

Under Assumptions 1 and 2, the expectations

[TABLE]

are well-defined, and the random variable $g(\bm{y}_{n})$ has all nonzero entries with probability $1$ .

Proof:

The proof is given in Appendix A. ∎

V-B Invertibility of $\mathcal{F}_{0}(n)$

A critical requirement for retrieving $A$ from the matrix functions $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$ is invertibility of $\mathcal{F}_{0}(n)$ . However, it is important to remark that there might be dynamical systems where this condition can be violated.

In Fig. 3, we show an example where the function $\sigma$ has different behaviors for different agents. In particular, for all agents we have that the component-wise functions have a hyperbolic tangent shape. For agents $1$ and $2$ , however, they are shifted (upward and downward, respectively). The function $g$ is chosen so as to fulfill the conditions for integrability. The function $h$ is the limiter (i.e., linear saturating function) displayed in Fig. 3. We can give to these nonlinearities some physical meaning. The nonlinear shape is typical of several applications (for instance, neural networks). The saturation effects present both in $\sigma$ and $h$ are as well typical. For instance, the saturation in $h$ takes on the practical meaning of limiting the interaction when the sample $\bm{y}_{i,n}$ is too large. This behavior can be present, e.g., in a social learning problem where the agents might use saturation to filter observations that look like outliers.

In the example reported in Fig. 3, we have computed empirically the matrices $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$ . In particular, we have verified that $\mathcal{F}_{0}(n)$ is not invertible, which gives rise to the singular behavior of the topology estimator observed in the lowermost-rightmost panel. This behavior is due to the fact that the vector function $h$ (uppermost-rightmost panel) becomes constant along the two coordinates corresponding to nodes $1$ and $2$ , since the output range of $\sigma_{1}$ and $\sigma_{2}$ (uppermost-leftmost panel) forces $h_{1}$ and $h_{2}$ to operate only in the saturation region (i.e., $y<-1$ for node $1$ , and $y>1$ for node $2$ ). This yields singularity of the matrix:

[TABLE]

Indeed, since the matrix $\mathbb{E}[h(\bm{y}_{n})h(\bm{y}_{n})^{\top}]$ is always positive semi-definite, in order to grant invertibility we must exclude the condition that there exists some (deterministic) vector $v$ such that:

[TABLE]

However, the above condition would correspond to say that:

[TABLE]

which basically means that $h(\bm{y}_{n})$ must not be a low-rank map. The following assumption (which is, e.g., usually encountered in ergodic theory [56]) makes this statement precise.

Assumption 3 (Non-singularity of response $h$ )

Let $\mu_{\textnormal{Leb}}$ be the Lebesgue measure in $\mathbb{R}^{N}$ . We assume that $h\,:\,\mathbb{R}^{N}\rightarrow\mathbb{R}^{N}$ is such that

[TABLE]

In words, this assumption states that if the input set $\mathcal{A}$ has full-dimension, then its image $h(\mathcal{A})$ is non-degenerate as well. It can be verified that transformations that fulfill property (43) are: linear full-rank maps, open maps, diffeomorphisms, and differentiable maps with non-singular Jacobian almost everywhere [56].

On the other hand, a constant map does not have this property or, more generally, a low-rank linear map (e.g., a projection onto a lower-dimensional subspace) does not have this property as the image of any set necessarily lies in a lower-dimensional subspace (orthogonal to the kernel of the linear application). $\square$

Proposition 2 (Invertibility of $\mathcal{F}_{0}(n)$ )

Under Assumptions 1– 3, and if $\sigma$ is a diffeomorphism, then the matrix

[TABLE]

is invertible for all $n\in\mathbb{N}$ .

Proof:

The proof is given in Appendix B. ∎

VI Consistent Graph Learning

Lemma 1 states that the interaction matrix $A$ can be obtained in terms of two matrix functions, namely, the functions $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$ . However, this property alone would not be sufficient to guarantee that we can reliably estimate the topology from the samples $\{\bm{y}_{n}\}$ . For this to be possible, we should demonstrate that the aforementioned functions can be consistently learned from the samples. The requirement of consistency means that we should be able to converge to the desired matrix functions as the number of available samples grows. This property would yield a practical scheme to estimate the interaction matrix $A$ (and hence its support) from the samples.

In other words, we are requiring the system to be stable and ergodic. And indeed, inference problems over dynamical systems often rely on the stability of the system: it is hard to perform faithful inference over systems that blow up.

Technically speaking, the model in (12) corresponds to a Markov chain with a general state space, since, for instance, when the noise component $\bm{x}$ is absolutely continuous, the chain can walk over a continuous space [55]. For such type of Markov chains, the limiting results (e.g., stability and/or ergodicity) are much more involved than those corresponding to the classic case of finite or discrete state space. In the following, we will appeal to powerful results to prove that our dynamical system is in fact ergodic, which would be a critical property to enable consistent topology learning.

We have the following result.

Proposition 3 (Consistent Graph Learning)

Assume that Assumptions 1–3 are fulfilled. Assume further that the noise $\bm{x}_{n}\in\mathbb{R}^{N}$ is absolutely continuous with almost-everywhere (with respect to the Lebesgue measure in $\mathbb{R}^{N}$ ) positive density. Let us introduce the stability constant:

[TABLE]

Then, if $\kappa_{s}<1$ , we have that:

[TABLE]

where

[TABLE]

and:

[TABLE]

Proof:

The proof is given in Appendix C. ∎

Proposition 3 provides consistency (i.e., faithful estimation as the number of samples grows) of the generalized Granger within the scope of the class of nonlinear dynamical systems (1). Based on this result, we are now in the position of proposing the following graph learning algorithm to reconstruct the underlying (directed) graph from each realization (or sample-path) of the nonlinear system.

One special comment is deserved by the last step of the algorithm. Once we estimate the combination matrix $A$ , we need to reconstruct its support graph. However, due to finiteness of the number of samples, also the zero entries in $A$ would result in some (possibly small) nonzero entries. Accordingly, in the last step of the algorithm we apply a clustering algorithm (here the $k$ -means with $k=2$ ) to the entries of the estimated interaction matrix $\widehat{A}$ . Such clustering algorithm is aimed at devising a boundary between causality and non-causality among agents, i.e., to provide an automated data-driven classification threshold: $j$ affects $i$ if the entry $\widehat{a}_{ij}$ assumes a high(er) value, otherwise, if $a_{ij}$ assumes a weak(er) value then $j$ is deemed as not affecting $i$ .

In order for the clustering algorithm to work properly, one might question that the zero/nonzero entries of $A$ should inherently possess some clustering property. Indeed, such form of clustering has been proved to hold (for sufficiently large network sizes) when the underlying graph is an Erdős-Rényi random graph, for a relevant class of combination matrices [38, 39, 40, 41, 42, 43].

VII Illustrative Examples

We now present a number of examples aimed at illustrating the validity of the theoretical analysis conducted so far. In what follows, the ground-truth combination matrix, $A$ , is constructed through the following steps. First, we generate a realization of a binomial graph with connection probability $p$ , which means that any arrow (i.e., directed edge) exists with probability $p$ , and independently from all the other edges [59].

Once the underlying graph has been generated, the combination weights are assigned to each arrow of the graph to yield the combination matrix $A$ . We will consider the following assignment rule (a.k.a. uniform averaging rule):

[TABLE]

where $g_{ij}\in\{0,1\}$ is equal to $1$ only if there is a directed edge from $j$ to $i$ , and where $d_{i}=|\mathcal{N}_{i}|$ is the in-degree or number of agents directly influencing agent $i$ .

In the following experiments, we will consider a network of $N=50$ nodes. We start by examining the full observation case. In Fig. 4, we consider the following nonlinearities to drive our dynamical system, for all $i=1,2,\ldots,N$ :

[TABLE]

The parameter $\rho$ of the combination matrix is set equal to $0.5$ . In the lowermost row of Fig. 4, we display the Empirical Generalized Granger (EGG) algorithm and contrast its performance against three standard estimators, namely, the (linear) Granger estimator, the precision matrix (i.e., the inverse of the correlation matrix) and the correlation matrix. For the sake of a neater data-visualization, and in order to display the matrices values in a one-dimensional plot, we proceeded as follows. First, in all panels the true matrix $A$ is represented in black. In particular, we first vectorized the true matrix $A$ and removed the entries associated to its diagonal. As a result, each element in the abscissa of each plot, say $10$ , corresponds to a particular entry index, say $(2,6)$ . Then, we sorted the entries of the resulting vector in ascending order — that is why the black curves are nondecreasing. Still in the lowermost row of Fig. 4, the blue curves are obtained by arranging the off-diagonal entries of the pertinent matrix estimators as induced by the aforementioned ordering of $A$ . In this way, we are contrasting entry-by-entry the ground-truth matrix $A$ with the matrix-estimators in a one-dimensional ordered plot. The inset plots displayed in the various panels show just one portion (for the sake of clarity) of the whole network graph learned by the pertinent algorithm. We remark that in this analysis, the algorithm has access to the full network, whereas the panel is limited to a sub-graph just for a matter of visualization. In the inset plots, the red arrows represent edges erroneously detected by the learning algorithm (i.e., edges that are not present in the true graph).

Three major observations arise by inspection of Fig. 4. First, we see that the EGG algorithm is able to estimate faithfully the combination weights. Second, the clustering algorithm is able to properly reconstruct the network skeleton from the estimated combination matrix. Third, and perhaps not unexpectedly, the classical methods that work in the linear case (Granger), or in the Gaussian graphical model setting (precision), or over correlation-networks (correlation matrix), are essentially blind in our nonlinear framework.

In Fig. 5, we repeat the experiment on another set of nonlinearities, namely,

[TABLE]

We see that similar conclusions apply. One interesting difference is the better convergence of the EGG and of the one-lag-functional estimators, which could be ascribed to the fact that the activation function, $\sigma$ , is now bounded, a feature that concurs to increase the system stability, and, hence, the speed of convergence of the various empirical estimators.

VIII Beyond the Theoretical Results

The results summarized in Proposition 3 allow consistent estimation of the network graph under the specific setting and under a set of assumptions that we have extensively discussed. On the other hand, there exist relevant scenarios where the setting need to be enlarged and some of the assumptions relaxed. In this section, we show how the analysis can be helpful to address some of these scenarios.

VIII-A Regularization of the Weighting Function $\omega(y)$

From our analysis, it is apparent that one limitation is the integrability condition (34) imposed on the weighting function $\omega(y)$ . For example, in Fig. 2 we have seen that a linear $g(y)$ can lead to a weighting function $\omega(y)$ with infinite expectation (because of the too-fast singularity of $g(y)=y$ around the origin), resulting in a singular estimator. In order to remedy this issue, one could replace $\omega(y)$ with a regularized version, $\widetilde{\omega}(y)$ , in the evaluation of the one-lag function $F_{1}(\bm{y}_{n+1},\bm{y}_{n})$ . For example, we could set $\widetilde{\omega}(y)=\omega(y)$ outside some small neighborhoods of its singularities, and $\widetilde{\omega}(y)$ equal to some constant within these neighborhoods. The rationale behind this regularization is that, on one hand, we rule out the pathological behavior of the estimator since we remove the singularities; on the other hand, for sufficiently small perturbations of the original $\omega(y)$ , we expect that the deviations from the true combination matrices are small enough to let the clustering algorithm be still able to classify correctly the links between nodes.

We apply the proposed regularization to the example of Fig. 2, and the result is shown in Fig. 6. Some notable features emerge. We start by examining the leftmost panel. First, we see that the regularized weighting function removes the singular behavior of the estimator (blue curve), which is now capable to follow the true profile (black curve) of the combination matrix entries. This behavior is reasonable because our regularization has in fact removed the singularity.

Second, we observe that the estimator looks noisier as compared to the examples of Figs. 4 and 5. This augmented irregularity can be explained because we started from a singular weighting function. Third, we see that the estimator seems not to converge to the true combination matrix, which is expected because consistent reproduction of the combination matrix is not granted when we use a surrogate weighting function. More specifically, a trade-off arises between the degree of irregularity and the fidelity of reconstruction, and we expect that the smaller the perturbation of the original weighting function is, the higher the irregularity of the estimator and the fidelity of reconstruction will be. This trade-off is confirmed in the rightmost panel of Fig. 6, where we consider a stronger perturbation (i.e., we modify the weighting function in an ampler neighborhood). Comparing the rightmost panel against the leftmost panel, we see that the blue curve is now less wild, but that it is more distant from the true matrix (black curve). In summary, a sort of uncertainty principle is exhibited: one can either get a more precise knowledge (less oscillating curves) of a less precise matrix value; or a less precise knowledge (i.e., more oscillating curves) of a matrix value closer to the true value.

This notwithstanding, we should keep in mind that the basic goal of graph learning is retrieving the skeleton of the network, i.e., the support graph of the combination matrix. For this reason, it is not so crucial that the estimator is not able to reproduce exactly the values of the combination matrix. What is critical is that the identifiability gap between the estimated matrix entries corresponding to disconnected or connected node pairs is still well recognizable from the estimator, which would allow to the clustering algorithm to recover the correct graph. The inset panels of Fig. 6 show that this can be in fact possible. This notion of identifiability gap has been introduced and extensively discussed in [42, 43], and will play a role especially in the partial observation setting, as we will see in the forthcoming section.

VIII-B Partial Observation Setting

All the analysis conducted so far was based on a full-observability assumption, namely, on the assumption that samples from all nodes can be collected. We note however that in complex large-scale systems, we are often unable to probe the state-evolution of all nodes. Accordingly, in this section we consider the partial observation setting where the data can be collected from only a subset $\mathcal{S}$ of nodes. The objective of the learning becomes then reconstructing the support graph of the partial sub-matrix, $A_{\mathcal{S}}$ , namely, of the sub-matrix corresponding to subset $\mathcal{S}$ . Likewise, in the partial observation case we assume that the truncated matrix functions333We remark that the sub-matrices $[\mathcal{F}_{0}]_{\mathcal{S}}$ and $[\mathcal{F}_{1}]_{\mathcal{S}}$ can be constructed from the samples $\{\bm{y}_{n}\}$ because the $i$ -th component of the functions $\sigma(y),h(y),g(y),g(y)$ depends solely on the $i$ -th entry $y_{i}$ .:

[TABLE]

will replace the full matrices $\mathcal{F}_{0}$ and $\mathcal{F}_{1}$ .

However, since $A=\mathcal{F}_{1}\mathcal{F}_{0}^{-1}$ , in general the latent (unobserved) samples for nodes outside $\mathcal{S}$ affect the possibility of constructing $A_{\mathcal{S}}$ from $[\mathcal{F}_{0}]_{\mathcal{S}}$ and $[\mathcal{F}_{1}]_{\mathcal{S}}$ . For example, we have:

[TABLE]

For the special case of linear networked dynamical systems, the generalized Granger boils down to the Granger estimator, and the partial construction in (58) is obtained by considering the zero-lag and one-lag correlation matrices. Recent works have in fact established the structural consistency of this partial (i.e., applied only to a subset of nodes) Granger estimator in the linear case, for a class of regular symmetric combination matrices $A$ (that include, for instance the classic Laplacian and Metropolis matrices), and when the underlying graph is an undirected Erdős-Rényi graph under various regimes of connectivity [38, 39, 40, 41, 42, 43]. In particular, in [39] feasibility of such inverse problem is proved for sparse Erdős-Rényi graphs and increasing cardinality of the observable space; the analysis is extended in [40, 41] to cover the relevant case where the observed nodes have arbitrary connection structure and the cardinality of observed nodes is fixed (thus, the degree of observability is low); in [42, 43], the analysis is extended to cover the case of densely connected networks.

Motivated by these results holding for the linear case, we now test the partial Generalized Granger estimator in (58) with the more general nonlinear dynamical systems addressed in this work. Devising technical guarantees for structural consistency of the generalized Granger under this latent scenario appears to be a highly nontrivial task. Therefore, we are not in the position here to prove that the proposed method is consistent under this particular latent regime. Nevertheless, it is useful to see whether, under the conditions used to examine the nonlinear model under full-observation, the algorithm applied to partial observation can work.

In Fig. 7, we use the same settings adopted in Figs. 4 and 5, but for one essential detail. Now, only a subset $\mathcal{S}$ with $10$ out of $50$ nodes is accessible. According to the discussion in Sec. VIII, we implement two estimators, namely, the EGG with $[\widehat{F}_{0}(n)]_{\mathcal{S}}$ and $[\widehat{F}_{1}(n)]_{\mathcal{S}}$ , namely, with the matrix functions evaluated only on data coming from the observed network component. We see how both the proposed method is able to faithfully estimate the network graph.

We remark that the proposed analysis is by no means exhaustive, and is meant to show that there are cases where the generalized Granger method can work in the nonlinear regime. On the other hand, even if we do not have a definite answer, we have also evidences that it can fail, and that it is in particular sensitive to two features: $i)$ heterogeneity, namely, when the nonlinear components at different nodes behave very differently, the role of some latent nodes might become dominant and corrupt the fidelity of the graph reconstruction; $ii)$ level of noise, which is a distinguishing feature of the nonlinear setting, since in the linear case the noise variance played basically as a scale factor that does not affect the graph identifiability. Perhaps not unexpectedly, in the nonlinear case the level of noise can alter even substantially the overall qualitative behavior of the dynamics.

Appendix A Proof of Proposition 1

Lemma 2

Under Assumption 1, we have, for all $n\in\mathbb{N}$ :

[TABLE]

Proof:

Let $k(y)=g(y)\odot Ah(y)$ . In view of (12), we have that $\bm{y}_{n}=\sigma(k(\bm{y}_{n-1})+\bm{x}_{n})$ . We can use the tower property to write:

[TABLE]

However, since $\bm{x}_{n}$ is statistically independent from $\bm{y}_{n-1}$ , we have:

[TABLE]

where in the first inequality we used the fact that whenever $\bm{x}$ and $\bm{y}$ are independent random variables, and for any measurable map $q$ , the underlying function $r$

[TABLE]

is given by $r(y)=\mathbb{E}\left[q(\bm{x},y)\right]$ for all $y\in\mathbb{R}^{N}$ [52] and we have that $r(\bm{y})\leq\sup_{c\in\mathbb{R}^{N}}r(c)$ almost surely (which yields the first inequality in (61) with the proper choice of $q$ ). The latter inequality in (61) holds in view of (34). ∎

Remark 2

Note that the integrability condition on $\omega(\bm{y}_{n})$ implies that the vector $g(\bm{y}_{n})$ has some zero entry with zero probability. $\square$

Lemma 3

Let us define, for an arbitrarily small $\epsilon>0$ , the following constant:

[TABLE]

where $p$ , $q$ and the various $\alpha$ and $\beta$ constants have been introduced in (36)–(38). We have that:

[TABLE]

for some $\kappa>0$ .

Proof:

Using (36)–(38) we have the following chain of inequalities:

[TABLE]

Now, consider first the case where both $p>0$ and $q>0$ . In this case, since $p+q=1$ , we have that both $p$ and $q$ are strictly less than $1$ , implying that, for an arbitrarily small $\epsilon$ , and for a certain constant $c$ we can write:

[TABLE]

Accordingly, from (65) we get:

[TABLE]

where we have collected all the constant terms into the factor $\beta$ , and we have used the first definition of $\gamma$ in (63).

Next consider the case $q=0$ . This implies that $h(y)$ is bounded by a constant, and we accordingly have $\alpha_{h}=0$ . Therefore, Eq. (65) becomes:

[TABLE]

where we have collected the constant terms into $\beta$ , and where we have used the second definition of $\gamma$ in (63). The proof for the case $p=0$ follows similarly. ∎

Lemma 4

If the constant $\gamma$ in (63) is strictly smaller than $1$ , then the sequence of random variables $\{\|h(\bm{y_{n}})\|^{2}\}$ is uniformly integrable.

Proof:

First, we observe that:

[TABLE]

for a deterministic function $\varphi$ . In view of Lemma 3, we can write:

[TABLE]

Now, since the $\bm{x}_{i}$ ’s are i.i.d., the function:

[TABLE]

that is obtained by applying the function $\varphi$ to a reversed sequence $\bm{x}_{n},\bm{x}_{n-1},\ldots,\bm{x}_{1}$ has the same distribution of $\bm{u}_{n}$ . Applying (71) to (70) we can write:

[TABLE]

Now, in view of Kolmogorov two-series theorem [61], it makes sense to introduce the random variable:

[TABLE]

which has the first two moments bounded since $\gamma<1$ and $\mathbb{E}[\|\bm{x}\|^{2}]<\infty$ . Therefore, from (72) we have:

[TABLE]

This clearly shows that:

[TABLE]

where $\bm{u}^{\prime}$ is an integrable random variable that does not depend on $n$ . This implies that $\bm{u}^{\prime}_{n}$ are uniformly integrable. Since uniform integrability is a property of the distribution, also the original $\bm{u}_{n}$ are uniformly integrable, and the claim of the lemma follows. ∎

Proof:

We need to show that the following expectations are well-defined:

[TABLE]

That $\mathcal{F}_{0}(n)$ is well-posed follows directly from Lemma 4. For what concerns $\mathbb{E}[\omega(\bm{y}_{n})h(\bm{y}_{n})^{\top}]$ , we know that $\|\omega(\bm{y}_{n})\|^{2}$ is integrable in view of Lemma 2, while $\|h(\bm{y}_{n})\|^{2}$ is integrable in view of Lemma 4. Thus, the claim follows since the product of two $L_{2}$ functions is $L_{1}$ . In a more explicit form, if we take the $(i,j)$ -th entry of the matrix $\omega(\bm{y}_{n})h(\bm{y}_{n})^{\top}$ , we get:

[TABLE]

and:

[TABLE]

by simple application of the Cauchy-Schwartz inequality. ∎

Appendix B Proof of Proposition 2

Proof:

We will prove the claim by contradiction. Assume $\mathcal{F}_{0}(n)$ not invertible. In view of (42), this corresponds to saying that $h(\bm{y}_{n})^{\top}v=0$ almost surely, i.e.:

[TABLE]

where:

[TABLE]

On the other hand, we have that ( $\mu_{\textnormal{Leb}}$ is the Lebesgue measure in $\mathbb{R}^{N}$ ):

[TABLE]

where the first equality holds true because $\mathcal{V}^{\perp}$ is a lower-dimensional subspace of $\mathbb{R}^{N}$ ; the intermediate implication comes from Assumption 3; whereas the last implication comes from the fact that $\bm{y}_{n}$ is absolutely continuous with respect to the Lebesgue measure, as we will now show. Indeed, from (12) we can write:

[TABLE]

Now, the argument of the function $\sigma$ is absolutely continuous because it is the sum of two independent random variables, with one of these being absolutely continuous. Since the mapping $\sigma$ is a diffeomorphism, it preserves absolute continuity. Indeed, letting $\bm{y}=\sigma(\bm{z})$ , with $\bm{z}$ being absolutely continuous with a density $f_{z}$ , we have that ( ${\sf D}$ denotes the Jacobian):

[TABLE]

and absolute continuity of $\bm{y}=\sigma(\bm{z})$ follows by absolute continuity of $\bm{z}$ . We conclude that $\bm{y}_{n}$ is absolutely continuous. But we have the equality:

[TABLE]

which violates condition (79), yielding a contradiction. ∎

Appendix C Proof of Proposition 3

Proof:

In order to prove ergodicity, it is convenient to use the additive noise model representation in (16), which we report here for convenience:

[TABLE]

where we have used the definitions,

[TABLE]

and, for $z\in\mathbb{R}^{N}$ :

[TABLE]

In view of Example 7.4.6 in [58], if $F$ is continuous, if the noise $\bm{x}_{n}$ is absolutely continuous with respect to the Lebesgue measure with almost everywhere positive density with finite mean, and if there exist positive constants $\beta$ and $\gamma<1$ such that444Actually, condition (88) rephrases the condition reported in [58] in a way that is more convenient in our setting.:

[TABLE]

then the chain is $w$ -geometrically ergodic with weight function $\omega(z)=1+|z|$ [58]. Since by assumption continuity of $F$ and the properties of the noise are fulfilled, it remains to verify that (88) holds true in our setting. Reasoning as done in (65), we get:

[TABLE]

We shall consider the case $p>0$ , $q>0$ . The proof for the remaining cases is similar. Reasoning along the same lines as in (66), we can conclude that, for an arbitrarily small $\epsilon>0$ :

[TABLE]

where $c$ is a certain positive constant. On the other hand, in view of (36)–(38) we can further write:

[TABLE]

where $\gamma=\|A\|\alpha_{h}\alpha_{g}\alpha_{\sigma}+\epsilon^{\prime}$ , and $\beta$ is a proper constant. Therefore, we have proved that (88) is verified, which implies $w$ -ergodicity of $\bm{z}_{n}$ in light of Example 7.4.6 in [58]. Since $\bm{z}_{n}=\sigma^{-1}(\bm{y}_{n})$ , we have in fact proved $w$ -ergodicity for $\bm{y}_{n}$ .

Since we now have $w$ -geometric ergodicity, we also have convergence in total variation to the unique invariant measure $\pi_{y}$ , i.e.,

[TABLE]

for any initial distribution $\nu$ , where $P$ is the transition kernel characterizing the Markov process $\left\{\bm{y}_{n}\right\}$ , and $\|\cdot\|_{{\sf TV}}$ is the total variation norm [58]. In other words, the unique invariant measure $\pi_{y}$ is a global attractor of the dynamical system $\nu P^{n}$ in the space of probability measures on $\mathbb{R}^{N}$ .

Now, the aforementioned convergence in total variation will imply the strong law in (48) via application of the ergodic theorem [55, 57], if we prove that the following moment (the notation $\mathbb{E}_{\pi_{y}}[\cdot]$ means that the expectation of the random variable under brackets is computed under the measure $\pi_{y}$ ):

[TABLE]

is well-defined. To this end, let us first observe that the constant $\kappa$ appearing in (45) is strictly smaller than $1$ by assumption, which implies that, for a sufficiently small $\epsilon$ , it is possible to choose a constant $\gamma<1$ in Lemma 4. Therefore, we have that the sequence of random variables $\left|\left|h(\bm{y}_{n})\right|\right|^{2}$ is uniformly integrable. Now, since: $i)$ $\bm{y}_{n}$ converges in total variation (and, hence, in distribution); $ii)$ $h$ is a continuous mapping; $iii)$ the sequence $\left|\left|h(\bm{y}_{n})\right|\right|^{2}$ is uniformly integrable, we can conclude that the expectation in (93) is well-defined, and that the first convergence of expectations in (47) holds true [60]. We switch to the analysis of the one-lag matrix function. First, in view of (24), we have:

[TABLE]

since we have proved that $\mathcal{F}_{0}(n)\rightarrow\mathcal{F}_{0}$ . Second, we want to show that (49) holds true. By applying the definitions in (20) and (21) to (27), we get:

[TABLE]

where we have defined the quantity:

[TABLE]

In light of (31), from (95) we can also write:

[TABLE]

Since uniform integrability is not impaired by the presence of the indicator, and since the probability that $\bm{y}_{k}\notin\mathcal{Z}$ is equal to one, from the previous analysis it is clear that:

[TABLE]

Therefore, we need to establish that the second term on the RHS in (97) converges to zero almost surely. To this aim, let us consider the joint process $\bm{v}_{n}\in\mathbb{R}^{2N}$ :

[TABLE]

Since $\bm{y}_{n}$ and $\bm{x}_{n+1}$ are statistically independent, and since the vectors $\bm{x}_{n+1}$ are i.i.d. across time (with a certain measure $\pi_{x}$ ), it is immediately seen that $\bm{v}_{n}$ admits a unique invariant measure, given by the product measure $\pi_{v}=\pi_{x}\times\pi_{y}$ , between the invariant measure of $\bm{y}_{n}$ and the (stationary) measure of $\bm{x}_{n+1}$ . Moreover, for any $n$ we obvisouly have that $\pi_{v_{n}}=\pi_{x}\times\pi_{y_{n}}$ , where $\pi_{y_{n}}=\nu P^{n}$ is the distribution of $\bm{y}_{n}$ , for a certain initial distribution $\nu$ . Using then (92), we can conclude that:

[TABLE]

Therefore, we can apply the ergodic theorem [55, 57] to $\bm{v}_{n}$ . In particular, we can observe that:

[TABLE]

provided that the latter expectation, computed under the invariant distribution, exists. However, from (96) we can write:

[TABLE]

since integrability of the pertinent random variables has been already established in Lemma 1. Moreover, since we have also established that $\omega(\bm{y}_{n})$ and $h(\bm{y}_{n})$ are $L_{2}$ -integrable, with integrals that are bounded by a constant independent of $n$ , joint application of Skorohod’s representation theorem and by Fatou’s lemma [60] imply that also the expectation $\mathbb{E}_{\pi_{y}}[\omega(\bm{y})h(\bm{y})^{\top}]$ (i.e., computed under the invariant distribution $\pi_{y}$ ) exists. But since the invariant distribution $\pi_{v}$ has the product form $\pi_{x}\times\pi_{y}$ , we can write:

[TABLE]

which applied in (101) yields the desired claim.

In order to conclude the proof of the proposition, we have to prove invertibility of $\mathcal{F}_{0}$ . To this aim, let us consider an initial state $\bm{y}_{0}$ distributed according to the invariant measure $\pi_{y}$ . In view of (12), we have:

[TABLE]

and we observe that the state $\bm{y}_{1}$ will be still distributed according to $\pi_{y}$ due to invariance, implying that we can write

[TABLE]

Therefore, the fact that $\mathcal{F}_{0}$ is invertible follows by applying to the random variable in (104) the reasoning used to prove Proposition 2. ∎

Bibliography61

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Barrat, M. Barthélemy, and A. Vespignani, Dynamical Processes on Complex Networks . Cambridge University Press, 2012.
2[2] H. Salami, B. Ying, and A. H. Sayed, “Social learning over weakly connected graphs,” IEEE Trans. Signal Inf. Process. Netw. , vol. 3, no. 2, pp. 222–238, Jun. 2017.
3[3] C. J. Honey, O. Sporns, L. Cammoun, X. Gigandet, J. P. Thiran, R. Meuli, and P. Hagmann, “Predicting human resting-state functional connectivity from structural connectivity,” Proceedings of the National Academy of Sciences , vol. 106, no. 6, pp. 2035–2040, Feb. 2009.
4[4] S. Wang, E. D. Herzog, I. Z. Kiss, W. J. Schwartz, G. Bloch, M. Sebek, D. Granados-Fuentes, L. Wang, and J.-S. Li, “Inferring dynamic topology for decoding spatiotemporal structures in complex heterogeneous networks,” Proceedings of the National Academy of Sciences , vol. 115, no. 37, pp. 9300–9305, Sep. 2018.
5[5] F. Alderisio, G. Fiore, and M. di Bernardo, “Reconstructing the structure of directed and weighted networks of nonlinear oscillators,” Phys. Rev. E , vol. 95, no. 4, pp. 042302-1–042302-6, Apr. 2017.
6[6] L. J. S. Allen, “Some discrete-time SI, SIR, and SIS epidemic models,” Mathematical Biosciences , vol. 124, no. 1, pp. 83–105, Nov. 1994.
7[7] M. J. Keeling and K. T. D. Eames, “Networks and epidemic models,” Journal of the Royal Society Interface , vol. 2, pp. 295–307, Jun. 2005.
8[8] X. Wan, J. Liu, K.-W. Cheung, and T. Tong, “Inferring epidemic network topology from surveillance data,” Plo S one , vol. 9, no. 6, pp. e 100661-1–e 100661-9, Jun. 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Topology Inference over Networks with Nonlinear Coupling

Abstract

Index Terms:

I Introduction

II Main Results

III Related Work

III-A Learning Graphical Models

III-B Graph Learning with Linear Dynamics

III-C Graph Learning with Nonlinear Dynamics

IV Nonlinear model

V Generalized Granger Estimator

Lemma 1** (Generalized Granger)**

Proof:

Remark 1

V-A Existence of F0(n)\mathcal{F}_{0}(n)F0​(n) and F1(n)\mathcal{F}_{1}(n)F1​(n)

Assumption 1** (Integrability conditions)**

Assumption 2** (Regularity of the nonlinearities)**

Proposition 1** (Existence of F0(n)\mathcal{F}_{0}(n)F0​(n) and F1(n)\mathcal{F}_{1}(n)F1​(n))**

Proof:

V-B Invertibility of F0(n)\mathcal{F}_{0}(n)F0​(n)

Assumption 3** (Non-singularity of response hhh)**

Proposition 2** (Invertibility of F0(n)\mathcal{F}_{0}(n)F0​(n))**

Proof:

VI Consistent Graph Learning

Proposition 3** (Consistent Graph Learning)**

Proof:

VII Illustrative Examples

VIII Beyond the Theoretical Results

VIII-A Regularization of the Weighting Function ω(y)\omega(y)ω(y)

VIII-B Partial Observation Setting

Appendix A Proof of Proposition 1

Lemma 2

Proof:

Remark 2

Lemma 3

Proof:

Lemma 4

Proof:

Proof:

Appendix B Proof of Proposition 2

Proof:

Appendix C Proof of Proposition 3

Proof:

Lemma 1 (Generalized Granger)

V-A Existence of $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$

Assumption 1 (Integrability conditions)

Assumption 2 (Regularity of the nonlinearities)

Proposition 1 (Existence of $\mathcal{F}_{0}(n)$ and $\mathcal{F}_{1}(n)$ )

V-B Invertibility of $\mathcal{F}_{0}(n)$

Assumption 3 (Non-singularity of response $h$ )

Proposition 2 (Invertibility of $\mathcal{F}_{0}(n)$ )

Proposition 3 (Consistent Graph Learning)

VIII-A Regularization of the Weighting Function $\omega(y)$