Approximate Identification of the Optimal Epidemic Source in Complex   Networks

S. Jalil Kazemitabar; Arash A. Amini

arXiv:1906.03052·cs.SI·June 13, 2019

Approximate Identification of the Optimal Epidemic Source in Complex Networks

S. Jalil Kazemitabar, Arash A. Amini

PDF

TL;DR

This paper develops a statistical framework to accurately identify the source of an epidemic in complex networks, outperforming traditional methods especially in social and community-rich networks.

Contribution

It introduces a Bayesian optimal solution and two scalable algorithms for epidemic source detection applicable to general network topologies.

Findings

01

The proposed methods outperform geometric and spectral measures on real social networks.

02

The algorithms are scalable to large sparse networks.

03

Performance is robust across networks with high transitivity and community structure.

Abstract

We consider the problem of identifying the source of an epidemic, spreading through a network, from a complete observation of the infected nodes in a snapshot of the network. Previous work on the problem has often employed geometric, spectral or heuristic approaches to identify the source, with the trees being the most studied network topology. We take a fully statistical approach and derive novel recursions to compute the Bayes optimal solution, under a susceptible-infected (SI) epidemic model. Our analysis is time and rate independent, and holds for general network topologies. We then provide two tractable algorithms for solving these recursions, a mean-field approximation and a greedy approach, and evaluate their performance on real and synthetic networks. Real networks are far from tree-like and an emphasis will be given to networks with high transitivity, such as social networks…

Tables1

Table 1. Table 1 : Network statistics

Network	$n$	mean	max	clust.
Network	$n$	deg.	deg.	coeff.
Internet	10670	4.0	2312	0.01
Power	4941	3.0	19	0.1
Wiki vote	7066	29.0	1065	0.13
UCSC68	8979	50.0	454	0.17
UC64	6810	46.0	660	0.19
DC-SBM	1962	66.0	897	0.3

Equations83

\displaystyle\rho_{I\to O}:=\mathbb{P}\big{(}\text{$O$ is infected before $O^{c}$}\mid\text{$I$ is infected}\big{)}.

\displaystyle\rho_{I\to O}:=\mathbb{P}\big{(}\text{$O$ is infected before $O^{c}$}\mid\text{$I$ is infected}\big{)}.

p_{i} := P (i_{*} = i ∣ O) = \frac{ρ _{i \to O}}{\sum _{j \in O} ρ _{j \to O}} 1 {i \in O} .

p_{i} := P (i_{*} = i ∣ O) = \frac{ρ _{i \to O}}{\sum _{j \in O} ρ _{j \to O}} 1 {i \in O} .

\hat{σ}^{*} := σ : [n] \to [n] argmin E [σ (i_{*}) ∣ O] .

\hat{σ}^{*} := σ : [n] \to [n] argmin E [σ (i_{*}) ∣ O] .

ρ_{I \to O} = j \in O ∖ I \sum \frac{vol ( I , j )}{vol ( I , I ^{c} )} ρ_{I \cup j \to O}

ρ_{I \to O} = j \in O ∖ I \sum \frac{vol ( I , j )}{vol ( I , I ^{c} )} ρ_{I \cup j \to O}

ρ_{I \to O} = j \in O ∖ I \sum ρ_{I \to O ∖ j} \frac{vol ( O ∖ j , j )}{vol ( O ∖ j , ( O ∖ j ) ^{c} )} .

ρ_{I \to O} = j \in O ∖ I \sum ρ_{I \to O ∖ j} \frac{vol ( O ∖ j , j )}{vol ( O ∖ j , ( O ∖ j ) ^{c} )} .

\displaystyle\mathbb{P}\big{(}\text{path $\sigma$ observed}\mid\sigma_{1}=\bm{i}_{*}\big{)}=\prod_{k=1}^{|\sigma|-1}\frac{\operatorname{vol}(\sigma_{[k]},\sigma_{k+1})}{\operatorname{vol}(\sigma_{[k]},\sigma_{[k]}^{c})}

\displaystyle\mathbb{P}\big{(}\text{path $\sigma$ observed}\mid\sigma_{1}=\bm{i}_{*}\big{)}=\prod_{k=1}^{|\sigma|-1}\frac{\operatorname{vol}(\sigma_{[k]},\sigma_{k+1})}{\operatorname{vol}(\sigma_{[k]},\sigma_{[k]}^{c})}

P (O_{k - 1} = O \ j ∣ O_{k} = O) \propto ρ_{O \ j \to O} \cdot P (O_{k - 1} = O \ j) .

P (O_{k - 1} = O \ j ∣ O_{k} = O) \propto ρ_{O \ j \to O} \cdot P (O_{k - 1} = O \ j) .

ρ_{I \to O} = α_{0} j \in O \prod b_{j}^{x_{j}^{I} - 1}

ρ_{I \to O} = α_{0} j \in O \prod b_{j}^{x_{j}^{I} - 1}

vol (I, I^{c}) ρ_{I \to O} - j \in O ∖ I \sum vol (I, j) ρ_{I \cup {j} \to O} = 0.

vol (I, I^{c}) ρ_{I \to O} - j \in O ∖ I \sum vol (I, j) ρ_{I \cup {j} \to O} = 0.

\displaystyle\widehat{\bm{b}}\;\in\;\operatorname*{argmin}_{\bm{b}}\sum_{I:\,I\subset O}\Big{(}\operatorname{vol}(I,I^{c})-\sum_{j\in O\setminus I}\operatorname{vol}(I,j)\,b_{j}\Big{)}^{2}=\operatorname*{argmin}_{\bm{b}}\|Q\bm{b}-\bm{r}\|_{2}^{2}

\displaystyle\widehat{\bm{b}}\;\in\;\operatorname*{argmin}_{\bm{b}}\sum_{I:\,I\subset O}\Big{(}\operatorname{vol}(I,I^{c})-\sum_{j\in O\setminus I}\operatorname{vol}(I,j)\,b_{j}\Big{)}^{2}=\operatorname*{argmin}_{\bm{b}}\|Q\bm{b}-\bm{r}\|_{2}^{2}

Q_{I, j}

Q_{I, j}

\displaystyle\begin{split}S&=\Xi\big{(}A_{OO}+A_{OO}^{2}-A_{OO}\odot(\bm{u}\mathbf{1}^{T}+\mathbf{1}\bm{u}^{T})+\bm{u}\bm{u}^{T}\big{)}\in\mathbb{R}^{|O|\times|O|},\\ \bm{z}&=(\mathbf{1}^{T}\bm{u}+2\mathbf{1}^{T}\bm{v})\bm{u}-2\bm{v}\odot\bm{u}+2(A_{OO}\,\bm{v}+\bm{u})\end{split}

\displaystyle\begin{split}S&=\Xi\big{(}A_{OO}+A_{OO}^{2}-A_{OO}\odot(\bm{u}\mathbf{1}^{T}+\mathbf{1}\bm{u}^{T})+\bm{u}\bm{u}^{T}\big{)}\in\mathbb{R}^{|O|\times|O|},\\ \bm{z}&=(\mathbf{1}^{T}\bm{u}+2\mathbf{1}^{T}\bm{v})\bm{u}-2\bm{v}\odot\bm{u}+2(A_{OO}\,\bm{v}+\bm{u})\end{split}

I_{MAP}^{*} = I \subset O, ∣ I ∣ = s argmax ρ_{I \to O}

I_{MAP}^{*} = I \subset O, ∣ I ∣ = s argmax ρ_{I \to O}

\displaystyle\mathbb{P}\Big{(}T_{i}<\min_{j\neq i}T_{j}\Big{)}=\frac{\beta_{i}}{\sum_{j}\beta_{j}}.

\displaystyle\mathbb{P}\Big{(}T_{i}<\min_{j\neq i}T_{j}\Big{)}=\frac{\beta_{i}}{\sum_{j}\beta_{j}}.

ρ_{I \to O} = j \in O ∖ I \sum ρ_{I \to I \cup j} \cdot ρ_{I \cup j \to O}

ρ_{I \to O} = j \in O ∖ I \sum ρ_{I \to I \cup j} \cdot ρ_{I \cup j \to O}

ρ_{I \to I \cup j} = \frac{β vol ( I , j )}{\sum _{j^{'}} β vol ( I , j ^{'} )} = \frac{vol ( I , j )}{vol ( I , I ^{c} )} .

ρ_{I \to I \cup j} = \frac{β vol ( I , j )}{\sum _{j^{'}} β vol ( I , j ^{'} )} = \frac{vol ( I , j )}{vol ( I , I ^{c} )} .

S_{j j^{'}}

S_{j j^{'}}

z_{j}

\displaystyle\qquad+2\operatorname{vol}\big{(}\operatorname{adj}_{O}(j),(O{\setminus\,j})^{c}\big{)}.

(Q^{T} r)_{j}

(Q^{T} r)_{j}

= I \subset O ∖ {j} \sum vol (I, I^{c}) \cdot vol (I, j)

= I \subset O ∖ {j} \sum i, k, r \sum A_{ik} A_{r j} 1 {i \in I, k \in / I, r \in I}

= i, k, r \sum A_{ik} A_{r j} γ_{ik r}

γ_{ik r} := I \subset O ∖ {j} \sum 1 {i \in I, k \in / I, r \in I}

γ_{ik r} := I \subset O ∖ {j} \sum 1 {i \in I, k \in / I, r \in I}

γ_{ik r} = 0 ⎩ ⎨ ⎧ 2^{∣ O ∣ - 4} 2^{∣ O ∣ - 3} 2^{∣ O ∣ - 3} 2^{∣ O ∣ - 2} i \neq = r, k \in O_{∖ j} i = r, k \in O_{∖ j} i \neq = r, k \in / O_{∖ j} i = r, k \in / O_{∖ j}

γ_{ik r} = 0 ⎩ ⎨ ⎧ 2^{∣ O ∣ - 4} 2^{∣ O ∣ - 3} 2^{∣ O ∣ - 3} 2^{∣ O ∣ - 2} i \neq = r, k \in O_{∖ j} i = r, k \in O_{∖ j} i \neq = r, k \in / O_{∖ j} i = r, k \in / O_{∖ j}

(Q^{T} r)_{j}

(Q^{T} r)_{j}

\displaystyle+2^{|O|-3}(1+1\{i=r\})1\{k\notin O_{\setminus\,j}\}\big{]}

= 2^{∣ O ∣ - 4} i, r \sum d_{O ∖ {j, r}} (i) A_{r j} (1 + 1 {i = r})

+ 2^{∣ O ∣ - 3} i, r \sum d_{(O ∖ j)^{c}} (i) A_{r j} (1 + 1 {i = r})

r \sum d_{O ∖ {j, r}} (i) A_{r j}

r \sum d_{O ∖ {j, r}} (i) A_{r j}

= d_{O_{∖ j}} (i) d_{O_{∖ j}} (j) - vol_{O_{∖ j}}^{(2)} (i, j)

\displaystyle\sum_{i,r}d_{O\setminus\{j,r\}}(i)\,A_{rj}\big{(}1+1\{i=r\}\big{)}

\displaystyle\sum_{i,r}d_{O\setminus\{j,r\}}(i)\,A_{rj}\big{(}1+1\{i=r\}\big{)}

= i \sum d_{O_{∖ j}} (i) d_{O} (j)

= vol (O_{∖ j}) d_{O} (j)

i \in A \sum vol_{A}^{(2)} (i, j) = i \in A \sum r \in A \sum A_{i r} A_{r j} = r \in A \sum d_{A} (r) A_{r j}

i \in A \sum vol_{A}^{(2)} (i, j) = i \in A \sum r \in A \sum A_{i r} A_{r j} = r \in A \sum d_{A} (r) A_{r j}

i, r \sum d_{(O ∖ j)^{c}} (i) A_{r j} (1 + 1 {i = r})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Approximate Identification of the Optimal Epidemic Source in Complex Networks

S. Jalil Kazemitabar

Department of Statistics

University of California, Los Angeles

[email protected]

&Arash A. Amini

Department of Statistics

University of California, Los Angeles

[email protected]

Abstract

We consider the problem of identifying the source of an epidemic, spreading through a network, from a complete observation of the infected nodes in a snapshot of the network. Previous work on the problem has often employed geometric, spectral or heuristic approaches to identify the source, with the trees being the most studied network topology. We take a fully statistical approach and derive novel recursions to compute the Bayes optimal solution, under a susceptible-infected (SI) epidemic model. Our analysis is time and rate independent, and holds for general network topologies. We then provide two tractable algorithms for solving these recursions, a mean-field approximation and a greedy approach, and evaluate their performance on real and synthetic networks. Real networks are far from tree-like and an emphasis will be given to networks with high transitivity, such as social networks and those with communities. We show that on such networks, our approaches significantly outperform geometric and spectral centrality measures, most of which perform no better than random guessing. Both the greedy and mean-field approximation are scalable to large sparse networks.

1 Introduction

Modern transportation networks have had profound effects on geographical spread of infectious diseases [1, 2] giving rise to complicated epidemic evolutions [3]. These evolutions can be modeled as dynamic processes on transportation networks. The epidemic spread on networks can take other forms, such as outbreaks of foodborne diseases [4], intercontinental cascade of failures among financial institutions [5, 6], computer malware propagation on the interent and mobile networks [7, 8] spread of targeted fake news [9, 10] and rumors [11] on social media, especially during presidential elections [12, 13, 14]. In response to an adverse diffusion on a network, it is critical to trace back sources to enable appropriate prevention and containment of the spread [15]. Inferential methods have been developed to locate the source of foodborne diseases [16, 17] and influenza pandemics [18, 19]. In the context of online social networks, the spread of misinformation can be limited by the identification of influential users [20, 21]. Source recovery can also be used to assess the power of diffusions in generating anonymity in network protocols [22].

The epidemic source identification problem has received considerable attention in the past decade. Given a snapshot of the infected nodes in a network, the task is to discover who has originated the epidemics. Since the seminal work of Shah and Zaman [23], numerous attempts have been made to address the question and its extensions [24, 25, 26, 27, 28, 29]. By now, there are multiple methods that show satisfactory results in limited experimental setups or have proven guarantees in restricted network topologies [30]. However, identifying the source under general conditions still remains a difficult task. Even under fairly simple models such as the Independent Cascade (IC) dynamics, the problem of optimal recovery appears to be NP-hard in infection size [31]. The theoretical guarantees for optimal and consistent recovery are restricted to regular infinite trees [23, 26], and as we show in this paper, the popular and well-cited methods are quite unreliable in a wide range of real networks.

Source identification has remained largely unsolved and poorly understood for real complex networks. As we will show through experiments in Section 5, in real networks, even the optimal Bayes estimator applied to small infected sets has difficulty narrowing down to the true source. It is thus important to recover as much information from the likelihood of the model as possible. We develop techniques for computing the full likelihood of the infection, as opposed to identifying the most likely sample-path [26]. Moreover, we fully exploit the information from the boundary of the infection set, in addition to the structure inside the infected subgraph. This idea has been pointed out before [32], but has been mostly neglected by subsequent work; cf. [29, 24]. We develop all these ideas without restricting the structure of the network to trees. Our framework also easily extends to the case where there are multiple infecting sources (Appendix A).

In this paper, we develop statistical algorithms that outperform the state-of-the-art in a wide range of network topologies. Our contributions are distinct in several ways:

Our methods are parameter-free, meaning that they do not require knowing the duration of the epidemic or how fast it grows. 2. 2.

We show that the exact maximum likelihood estimator (MLE) of the source—or equivalently the Bayes optimal solution under uniform prior—can be written as a dynamic programming (DP), with easily computable coefficients based on the adjacency matrix of the network. 3. 3.

We develop two schemes to approximate the DP: an efficient greedy elimination (GE), and a novel mean-field approximation (MFA) of the likelihood, computed by solving a linear system. 4. 4.

Our approximations are more disciplined than existing approaches. They do not impose restrictions on the topology of the network. Nor do they appeal to the partial likelihood of the candidate infecting sets. This is in contrast to the use of spanning trees to deal with general topologies [23] or the path-based approaches that rely on the likelihood of individual paths from potential sources to the infected set [26].

We will show that when applied to real networks, both approximation schemes (MFA and GE) outperform various geometric and spectral approaches, most of which perform no better than random guessing. We also show that even for basic models of real networks, e.g., models with community structure, most existing methods dramatically fail. The improvement in performance is most significant for the networks with many cycles, including social networks that are known to have high transitivity. In terms of computational efficiency, both the greedy and mean-field approximations are superior to the state-of-the-art likelihood-based and spectral approaches and comparable to centrality-based methods. In addition, the mean-field algorithm is easily parallelizable through standard linear algebraic routines and can be used to tackle very large-scale epidemics on real networks.

Related work.

Most of the existing literature on the source identification problem are based on a SIR dynamic where the infection spreads with an exponential rate proportional to the number of infected neighbors. All nodes are susceptible to the infection and once infected may recover with a fixed exponential rate [33]. Moreover, the spread of infection through edges are mutually independent. Different variations of SIR may assume that no recovery is possible (SI) or the recovered is not immune to iterated infections (SIS).

Shah and Zaman [23] considered the SI dynamics and proposed the Rumor Centrality (RC), which counts the permitted permutations, a.k.a. infection paths, inside the infected subgraph. Their linear time algorithm is an optimal estimator in regular trees and enjoys strong theoretical properties in such idealized settings [34]. Zhou and Ying [26] consider SIR dynamics on a tree and show that the most likely infection path is rooted at a Jordan center (JC) of the infected set $O$ , that is, a node with minimum eccentricity (i.e., maximum distance to others). It has been shown [26, 34] that in regular trees, eccentricity ranking generates, with high probability, a confidence set containing the true source, whose size does not grow with the infection size.

The Dynamic Message Passing (DMP) was proposed in [25] as an approximation of the maximum likelihood estimator in discrete SIR epidemics, by assuming that the marginal probabilities of infection for each node are independent. Despite compelling performance, DMP is computationally intensive and impractical for large networks with moderately dense structures, even for small infection sets. A spectral algorithm, called Dynamical Age (DA) was introduced in [24], based on measuring the sensitivity of the maximum eigenvalue of the Laplacian matrix to the elimination of each node in the infection set. The algorithm was mainly developed to discover the initial node in a growing preferential attachment model. Another spectral method for the discrete SI model is proposed in [29].

2 Source detection in SI epidemics

We consider a continuous-time susceptible-infected (SI) epidemic [33] with rate of infection $\beta$ , on a static undirected network $G(V,E)$ with known edge set $E$ and $V=[n]$ . At time zero, all nodes but the source are in the susceptible state. Infection is a terminal state and susceptible nodes are exposed to the infection at an exponential rate proportional to the number of their infected neighbors. More precisely, given that nodes $I$ are infected at some time $t$ , we run exponential clocks $T_{j}\sim\operatorname{Exp}(\beta\operatorname{vol}(I,j))$ for all $j\in I^{c}$ and the first to expire determines the next infected node: If $\bm{j}^{*}=\operatorname*{argmin}_{j}T_{j}$ , then the dynamics move to the infected set $I\cup\{\bm{j}^{*}\}$ at time $t+T_{\bm{j}*}$ . It is clear that the contagion will eventually spread through the entire graph.

The infection source or patient zero, denoted as $\bm{i}_{*}$ , is unknown. What we observe is a snapshot of the contagion at some time $t$ , meaning the entire set of infected nodes at that time, which we denote by $O$ . The objective is to find $\bm{i}_{*}\in O$ or form a confidence set for $\bm{i}_{*}$ with desired false exclusion probability. Our focus here will be on the single source setting, but the analysis is extensible to the multi-source setting (cf. Section A).

Notation. We write $A\in\{0,1\}^{n\times n}$ for the adjacency matrix of the network and $\operatorname{vol}(I,J):=\sum_{i\in I,j\in J}A_{ij}$ for the volume of a cut in the network between subsets $I,J\subset[n]$ of nodes. For singleton subsets, we often drop the braces, e.g., $\operatorname{vol}(I,j):=\operatorname{vol}(I,\{j\})$ and $O\setminus j=O\setminus\{j\}$ .

2.1 Time and rate invariant analysis

We start by examining the probability of observing a particular set of infected nodes given a starting source. Let us introduce a parameter-free formulation of the problem (i.e. not dependent on rate $\beta$ and time $t$ ) that will be the foundation for our analysis of the continuous SI dynamics. The idea has been introduced in [32]. We generalize it to multiple sources and find forward and backward dynamic programming formulations that allow us to derive efficient approximations.

Suppose that, at some point in time, the infection reaches $I\subset[n]$ . Let $O\subset[n]$ be some superset of $I$ . We are interested in computing $\rho_{I\to O}$ , the chance that all the nodes in $O$ are infected before any node outside. More precisely, let

[TABLE]

We refer to $\rho_{I\to O}$ as the transition probabilities. Note that these transition probabilities are independent of the infection source. Given that in a snapshot of the contagion, nodes $I$ are infected, $\rho_{I\to O}$ determines how likely it is that in some future snapshot, $O$ is the set of infected nodes. The Markov property of (continuous-time) SI dynamics allows us to define $\rho_{I\to O}$ without reference to the source, or the time of the first snapshot. We will also show that these probabilities do not depend on the infection rate or the time of the second snapshot.

2.2 Statistical Inference

Given the observed (random) infected set $O$ , the function $I\mapsto\rho_{I\to O}$ is the likelihood of the model. Writing $L_{O}(I):=\rho_{I\to O}$ for this likelihood, we observe that $L_{O}(I)=0$ for all $I$ not contained in $O$ . So, we can restrict $L(\cdot)$ to all subsets of $O$ . When dealing with the single-source setup, we restrict the parameter space to $I=\{i\}$ and with some abuse of notation write $\rho_{i\to O}$ for $\rho_{\{i\}\to O}$ , and $L_{O}(i)=\rho_{i\to O},i\in[n]$ for the likelihood.

We can further consider a Bayesian setup by putting a uniform prior on the source (i.e., uniform over $[n]$ ). The Bayesian setup allows us to consider various notions of optimality by changing the loss function. Letting $\bm{i}_{*}$ be the random initial source, we have a joint distribution on $(\bm{i}_{*},O)$ . Then the posterior probability that the source is $i$ , given that we observed infected nodes $O$ is

[TABLE]

Therefore, the maximum a posteriori (MAP) estimate of the source is $\bm{i}^{*}_{\text{MAP}}=\operatorname*{argmax}_{i}\,\rho_{i\to O}$ which minimizes the probability of error. That is, $i^{*}_{\text{MAP}}$ minimizes $\mathbb{P}(\bm{\hat{i}}\neq\bm{i}_{*})$ for any estimator $\bm{\hat{i}}=\bm{\hat{i}}(O)$ . In some applications, the graph geodesic distance ( $d_{G}$ ) to the source determines the error of estimation. In that case, the Bayes optimal estimator is $\bm{i}^{*}_{\text{dist}}=\operatorname*{argmin}_{i}\sum_{j\in O}\text{dist}_{G}(i,j)\,\rho_{j\to O}.$ It is not hard to see that $\bm{i}^{*}_{\text{dist}}$ minimizes $\mathbb{E}[d_{G}(\bm{\hat{i}},\bm{i}_{*})]$ among all possible estimators $\bm{\hat{i}}$ .

A third choice is to output a ranking instead of a single source. In this case an estimator is formally a permutation $\bm{\hat{\sigma}}=\bm{\hat{\sigma}}_{O}$ on $[n]$ , suppressing the dependence on $O$ for simplicity. We can then consider the rank loss $\ell(\bm{\hat{\sigma}},\bm{i}_{*})=\bm{\hat{\sigma}}(\bm{i}_{*})$ , and we call the associated risk the expected (source) rank $=\mathbb{E}\bm{\hat{\sigma}}(\bm{i}_{*})$ . The corresponding optimal Bayes estimator is obtained by minimizing the posterior risk:

[TABLE]

Noting that $\mathbb{E}[\sigma(\bm{i}_{*})\mid O]=\sum_{i}\sigma(i)\,p_{i}$ , the optimal estimator in this case is the ranking that sorts $p_{i}$ into descending order, i.e., $\bm{\hat{\sigma}}^{*}(j_{i})=i$ where $p_{j_{1}}\geq p_{j_{2}}\geq\cdots\geq p_{j_{n}}$ .

Remark 1.

The distance loss might be suitable in some applications, but in general it is a poor measure if the goal is to reveal the actual source. This is especially true in small world networks, including most social networks, where the expected distance between any pair of nodes is small. On the other extreme, in terms of the precision in recovering the source, is the zero-one loss which is too stringent. The rank loss can be considered a more robust version of the zero-one loss, and we will be our main evaluation measure.

3 Exact likelihood computation

The Bayesian estimators introduced in Section 2.2 require us to evaluate the posterior probabilities $(p_{i})$ , or equivalently the likelihood values $\rho_{j\to O}$ for all $j\in O$ . The main difficulty of the source identification problem is that computing the likelihood is itself challenging. We now develop exact equations that allow us to recursively compute the likelihood values $L_{O}(I)$ for all subsets $I\subset O$ .

Dynamic programming.

To begin, note that $\rho_{O\to O}=1$ for any $O\subset[n]$ . In addition, $\rho_{I\to O}=1$ whenever $O$ corresponds to a connected component of $G$ . We develop two dynamic programming expressions for $\rho_{I\to O}$ for general $I\subset O$ :

Proposition 1.

For $I\subset O\subset[n]$ , the probabilities $\rho_{I\to J}$ defined in (1) satisfy the forward program

[TABLE]

and the backward program

[TABLE]

In the forward programming (2), $j$ effectively iterates over the boundary of $I$ in $O$ , as $\operatorname{vol}(I,j)=0$ if $j$ is outside that boundary. Therefore, the running time of the forward programming benefits from the sparsity of the network. Unlike the forward programming, the iteration over $j$ in (3) cannot be restricted to a smaller set. A corollary of Proposition 1 is that the transition probabilities $\rho_{I\to J}$ are not affected by the rate and the duration of the infection.

Let us now observe some connection with the path-based analysis. A permitted permutation or an infection path starting at a node $\bm{i}_{*}$ , refers to a permutation $\sigma$ of nodes with $\sigma_{1}=\bm{i}_{*}$ , and such that $\sigma_{k+1}$ is connected to at least one node in $\{\sigma_{1},\ldots,\sigma_{k}\}$ , for all $k\in[|\sigma|-1]$ . Notice that the probability of observing a given infection path is

[TABLE]

where $\sigma_{I}:=(\sigma_{i}\mid i\in I)$ . As noted in [32], one can obtain the transition probability $\rho_{\{\bm{i}_{*}\}\to O}$ by summing (4) over all infection paths $\sigma$ such that $\sigma_{1}=\bm{i}_{*}$ and $\{\sigma_{1},\ldots,\sigma_{k}\}=O$ . Our recursive representation is novel, avoids these explicit summations, and will be key in deriving approximation schemes for $\rho_{I\to O}$ in Section 4.

Path-based approaches such as Jordan center [26] forgo computing the complete likelihood (i.e., avoid summing the odds of all infection paths) and instead find the most probable path, that is, one that maximizes (4). In contrast, equations (2) and (3) compute the complete likelihood of the infection set, which has the following advantages over the path-based likelihood: It fully exploits the structure of the graph inside the infection set, not just a spanning tree or a permitted permutation of nodes in the infected subgraph. Moreover, it takes into account the boundary of the infected subgraph via $\operatorname{vol}(I,I^{c})$ .

NP-hardness. The recursions we derived are a dynamic programming (DP) solution that computes the likelihood more efficiently, by avoiding the direct summation over all possible paths. More precisely, one can obtain the optimal likelihood values by solving recursion (2) backwards: Starting from $\rho_{O\to O}=1$ , we can determine $\rho_{I\to O}$ for all proper subsets $I\subset O$ of maximal size (i.e., $|I|=|O|-1$ ) and then all the proper subsets of those at the previous stage and so on. However, this DP procedure may still take $O(2^{|O|})$ time in the worst case. It is an interesting open question whether polynomial-time solutions for computing the likelihood or its maximum exist.

Although it has been shown in [23] that for infinite regular trees, computing the likelihood reduces to enumerating all the infection paths from $i$ to $O$ , and therefore a polynomial-time algorithm exists in that case, the behavior of the exact likelihood under more general network topologies is much more difficult to investigate. Researchers have implicitly assumed the problem to be hard beyond trees and have resorted to approximations that are often based on heuristics. In the next section, we propose disciplined approximation schemes that yield much more accurate results under realistic classes of networks.

4 Approximations

We now provide two approximations to the likelihood function $L_{O}(I)$ based on the exact dynamic programming developed in Proposition 1.

Greedy Elimination (GE).

We can obtain a singleton source set $I=\{i\}$ that maximizes $\rho_{I\to O}$ with greedy elimination of elements in $O$ . The algorithm we propose is based on the backward recursion (3) and is detailed in Algorithm 1. We start with $O_{0}:=O$ and consider all maximal proper subsets of $O_{0}$ that induce a connected subgraph of $G$ . Among those, we choose the one that maximizes the transition probability to $O_{0}$ , i.e. $\rho_{O_{0}\setminus j\to O_{0}}=\operatorname{vol}(O_{0}\setminus j,j)/\operatorname{vol}(O_{0}\setminus j,(O_{0}\setminus j)^{c})$ . Suppose that $O_{1}:=O_{0}\setminus j^{*}$ is the maximizer. Next, we iterate the same procedure for $O_{1}$ and so forth, until we reach a singleton set $I:=O_{|O|-1}$ . The procedure has an $O(k^{2}m)$ runtime where $k=|O|$ and $m$ is the number of edges in the infected subgraph, $G_{O}$ .

GE has a Bayesian justification. Let $\widetilde{O}_{k}$ be the random infected set after $k$ steps. Suppose that we want to find the MAP for $\widetilde{O}_{k-1}$ given $\widetilde{O}_{k}$ . The Bayesian posterior probability is

[TABLE]

Whenever $G_{O\backslash j}$ is connected, the prior is positive. GE finds a proxy for MAP through maximizing the evidence and ensuring the prior is positive.

Algorithm 1 has similarities with finding the most likely path from a source to the observed snapshot. Chang et. al. [32] propose a similar path-based search called GSBA. They start from each node in $O$ and approximate the most likely path and use it as a proxy to the most likely source. Algorithm 1, however, does this greedy search in a backward fashion.

Mean-field Approximation (MFA).

We now approximate $\rho_{I\to O}$ by the mean-field technique. The idea is to treat the set function $I\mapsto\rho_{I\to O}$ as if it was a distribution (or measure) on $O$ and approximate it by the product of its marginals. Fix a subset $O\subset[n]$ . For any $I\subset O$ , let $\bm{x}^{I}=(x^{I}_{j})_{j\in O}$ be the binary representation of $I$ , i.e. $x^{I}_{j}=1\{j\in I\}$ for any $j\in O$ . We find $\alpha_{0}$ and $(b_{j})_{j\in O}$ such that

[TABLE]

is a good approximation to $\rho_{I\to O}$ for all $I\subset O$ , in the sense of minimizing the quadratic deviation from the solution of the recursion (2). First note that $\alpha_{0}=1$ since $\rho_{O\to O}=1$ . Next, we plug-in $\widehat{\rho}_{I\to O}$ into the forward recursion, to get

[TABLE]

Dividing both sides by $\prod_{j\in O\setminus I}b_{j}$ gives $\operatorname{vol}(I,I^{c})-\sum_{j\in O\setminus I}\operatorname{vol}(I,j)\,b_{j}=0.$ These equations in general cannot be satisfied exactly for all $I\subset O$ . Instead, letting $\bm{b}=(b_{j})_{j\in O}$ , we solve the following least-squares problem:

[TABLE]

where $Q\in\mathbb{R}^{(2^{|O|}-1)\times|O|}$ and $\bm{r}\in\mathbb{R}^{(2^{|O|}-1)\times 1}$ are defined as follows:

[TABLE]

The solution of (6) satisfies the normal equations $Q^{T}Q\widehat{\bm{b}}=Q^{T}\bm{r}$ . The following proposition shows that $Q^{T}Q$ and $Q^{T}\bm{r}$ can be computed efficiently. Let $A$ be the adjacency matrix of the network.

Proposition 2.

The solution $\widehat{\bm{b}}$ of (6) satisfies the linear system $S\widehat{\bm{b}}=\bm{z}$ with $S$ and $\bm{z}$ given by

[TABLE]

*where $\bm{u}:=A_{OO}\mathbf{1}$ and $\bm{v}=A_{OO^{c}}\mathbf{1}$ . Here $\odot$ is the element-wise matrix product, $\Xi(\cdot)$ is a matrix operator that returns the same matrix with double the diagonal entries, and $\mathbf{1}$ is the vector of all ones. *

See Appendix B.2 for the proof. Proposition 2 shows that the mean-field approach reduces to solving a linear system of equations in $|O|$ variables, a task with much better computational complexity than solving the original recursion. Both $S$ and $\bm{z}$ can be computed in at most $O(|O|^{2})$ time. In the cases where $A$ is sparse (which often the case for real networks), $S$ will be a rank-one perturbation of a sparse matrix (both $A_{OO}$ and $A_{OO}^{2}$ will be sparse), hence solving the resulting system is often much faster than the worst-case, i.e., faster than $O(|O|^{3}$ ).

5 Simulations

The methods proposed in this paper, the Greedy Elimination (GE) and the Mean Field Approximation (MFA), have reasonable runtimes compared to popular source identification procedures and show superior performance in source identification. In this section, we make a comparison based on these two measures on real and synthetic networks. As discussed in Section 2.2, we consider ranking estimators (i.e., those outputting a permutation of the nodes according to their likelihood of being the source) and focus on the rank loss. If the method does not return a ranking, we tweak it to do so. We evaluate the methods based on expected rank $R$ , the expectation of the rank of the actual source among the list of candidates (cf. Section 2.2). We normalize to get a number in [0,1], with zero corresponding to perfect recovery, i.e., we use $(R-1)/n$ .

We consider a variety of real and simulated networks. Our selection includes an Internet Autonomous System [35, 36], US west-coast power grid [37], two Facebook-100 networks [38, 39], called UC64 and UCSC68, and a Wikipedia voting network [40]. In addition, we present our results on a number of synthetic networks that are well studied in the literature, including regular trees, random trees, and degree-correlated stochastic block models (DC-SBM) [41].

Table 1 summarizes the statistics on the largest connected component of these networks. The regular tree is of degree 3 and depth 10. The random tree has 500 nodes. For the DC-SBM network, we generate from a $3$ -community planted partition version, i.e., $\mathbb{E}[A_{ij}]=\theta_{i}\theta_{j}P_{ij}$ where $P_{ij}=0.5$ if nodes $i$ and $j$ are in the same community and $P_{ij}=0.02$ if they are in different communities. The degree parameters $\theta_{i}$ are generated from a rescaled Pareto distribution with $\alpha=2$ and threshold $=1$ .

The results are illustrated in Figures 1, 2. The methods we consider besides the optimal Bayes solution (BO), the MFA, and the GE are the Rumor Centrality (RC), the Degree Centrality (DC) and the Jordan Center (JC). We have also run the Dynamical Age (DA), but due to its overall poor performance and its computational complexity, we have omitted it from the plots. Our selection of the methods loosely follows the methods surveyed in [30]. Each curve shows the performance of one method for different values of the infection size, $2\leq|O|\leq 300$ . Each point is an average over 500 infection paths rooted at random sources. To avoid an unreasonable computation time, we skip the optimal Bayes for the infected sets of size greater than 10. The BO curve serves as the benchmark for the best achievable performance. Note that even the optimal solution needs to output a large set to catch the source, signifying the inherent difficulty of the problem.

Rumor and Jordan centralities perform optimally on regular trees in Figure 1(a), as expected by the theory [23, 26], although the network is not exactly an infinite tree. Notice that RC, JC, and BO overlap for infection sizes not exceeding the depth of the tree. Degree centrality also appears as a close competitor in this figure. Moving to other networks, however, these popular methods do not perform better than random guessing. For all three, the expected relative rank is close to 0.5, even in a random tree. The plots in this section show that, despite their popularity, the RC and JC are quite unreliable for source recovery.

Among our proposed methods, MFA outperforms RC, JC, and DC in Figures 2(a), 2(b), 2(c), 2(d). MFA finds the true source, on average, in its top 30% guesses. The networks with suprior MFA performance have highest transitivity (aka clustering coefficient) in Table 1, that is, many triangles among triples of nodes. Transitivity has been studied extensively and it distinguishes human social networks from random trees and less cyclical networks, such as water distribution systems and traffic network. In this sense, MFA is suitable for rumor source detection in social networks.

GE is the global winner, except in regular trees (Figure 1(a)). We were surprised that a greedy algorithm had such a widespread success. GE not only performs well in highly transitive networks, but it also outperforms RC, JC, and DC on random trees (Figure 1(b)) and less transitive networks (Figures 1(c), 1(d)).

Figure 3 illustrates the runtimes (on the log scale) for a single run on the UC64 network, when the infection size is 10 (the maximum size for which the Bayesian results are available). Degree centrality is the fastest, followed by RC, JC, MFA and GE, all four having comparable speed, with RC and JC having a slight edge. BO is about 10 times slower and its runtime grows exponentially with infection size.

Based on these results, we advocate for the use of GE as the main tool for identifying sources of epidemics, regardless of the network topology or the nature of the epidemics (rumor propagation, disease contagion, etc.). MFA should be applied with caution. It is superior in social (transitive) networks, and attractive for its simplicity and the potential for parallelism.

Appendix A Multi-source Extension

The inference problem discussed in Section 2.2 immediately extends to the multi-source situations. Consider the case were more than one independent source, denoted by $\bm{I}_{*}$ , initiate the infection dynamics. Due to the Markovian nature of the dynamics, the infection path that leads to some set $I$ does not influence the value of $\rho_{I\to O}$ . Hence, Proposition 1 also describes the likelihood of the transition from the source set $\bm{I}_{*}$ to a snapshot $O$ .

If we know that there are $s$ original sources, e.g. $|\bm{I}_{*}|=s$ , with a uniform prior on the patient zeros, the Bayesian solution would be characterized by the optimization

[TABLE]

To compute this MAP estimate, we can still use the DP solution in Proposition 1, but we do not need to compute $\rho_{I\to O}$ for $|I|<s$ . Thus, the multi-source problem is in a sense “easier”, especially when $s\approx|O|$ , since one can terminate the recursion earlier (i.e., the case $s=1$ is the hardest).

Appendix B Proofs

B.1 Proof of Proposition 1

Let us first recall a known fact about the exponential distribution:

Fact 1.

Let $T_{i}\sim\operatorname{Exp}(\beta_{i})$ be a collection of independent exponential variables. Then,

[TABLE]

The forward programming (2) is an application of the law of total probability in the following sense: The event that nodes in $O\setminus I$ are infected before any other node in $I^{c}$ splits into sub-events that each node in $O\setminus I$ is infected before those in $O^{c}$ and we have

[TABLE]

where we have also used the Markov property of SI dynamics to split the probabilities on the RHS into the products. The ratio in (2) corresponds to the transition probability from $I$ to $I\cup j$ , that is $\rho_{I\to I\cup j}$ . Indeed, given that $I$ is infected, we run exponential clocks $T_{j}\sim\operatorname{Exp}(\beta\operatorname{vol}(I,j))$ and the first to expire determines the next infected node. By Fact 1, this happens for any node $j\in I^{c}$ with probability probability $\propto_{j}\beta\operatorname{vol}(I,j)$ . Thus,

[TABLE]

This proves the forward programming. The backward programming, on the other hand, connects $\rho_{I\to O}$ to $\rho_{I\to O\setminus j}$ and is proved similarly. Basically, the event of visiting $O$ can be divided into sub-events based on the last node in $O$ that is infected.

B.2 Proof of Proposition 2

We prove the following alternative expressions for $S=(S_{jj^{\prime}})^{|O|\times|O|}$ and $\bm{z}=(z_{j})^{|O|}$ ,

[TABLE]

Here, $d_{O}(i):=\sum_{j\in O}A_{ij}$ is the degree of node $i$ in $O$ , and $\operatorname{vol}^{(2)}(i,j):=\sum_{r\in O}A_{ir}A_{rj}$ is the number of paths of length 2 between nodes $i$ and $j$ that pass through $O$ . It is not hard to verify that these expressions are equivalent to the matrix form presented in (2).

Recall that $\operatorname{vol}(I,I^{c})=\sum_{i,k}A_{ik}1\{i\in I,k\notin I\}$ and similarity $\operatorname{vol}(I,j)=\sum_{r}A_{rj}1\{r\in I\}$ . Here, the indices, $i$ , $k$ and $r$ run over all nodes in the network, i.e. $i,k,r\in[n]$ . We have

[TABLE]

where the last equality follows by interchanging the order of summations and defining

[TABLE]

If $i$ or $r$ do not belong to $O\setminus\{j\}$ , or $k\in\{i,r\}$ , then $\gamma_{ikr}=0$ . Thus, it what follows assume that $i,r\in O_{\setminus\,j}:=O\setminus\{j\}$ and $k\notin\{i,r\}$ . Then,

[TABLE]

To see the second equality, note that we are counting subsets of the set $O\setminus\{j\}$ (of cardinality $|O|-1$ ) that contain or exclude certain elements. For example, when $k,i,r$ are pairwise distinct, and $k\in O\setminus\{j\}$ , looking at the binary representation of $I$ , we have two ones in the positions $i$ and $r$ and a zero in position $k$ , and the rest of $|O|-1-3$ positions are free to be zero or one.

Let $d_{S}(i)=\sum_{j\in S}A_{ij}$ be the degree of node $i$ in $S$ . We drop $S$ when $S=[n]$ . In what follows, $i$ and $r$ range over $O\setminus\{j\}$ (otherwise $\gamma_{ikr}=0$ ). Also, condition $k\notin\{i,r\}$ can be replaced with $k\neq r$ , since the $k\neq i$ is implicitly enforced by $A_{ik}=0$ if $k=i$ (no self-loops). We have

[TABLE]

where in the second term, we used the fact that if $k\notin O_{\setminus\,j}$ then we automatically have $k\neq r$ since $r$ ranges over $O_{\setminus\,j}$ . We have

[TABLE]

where $\operatorname{vol}_{O_{\setminus j}}^{(2)}(i,j):=\sum_{r\in O_{\setminus j}}A_{ir}A_{rj}$ is the number of paths of length two between $i$ and $j$ in $O_{\setminus j}$ . Note that $\operatorname{vol}_{O_{\setminus j}}^{(2)}(i,j)=\operatorname{vol}_{O}^{(2)}(i,j)$ and similarly $d_{O_{\setminus j}}(j)=d_{O}(j)$ since $A_{jj}=0$ . Thus,

[TABLE]

where $\operatorname{vol}(O_{\setminus j})=\operatorname{vol}(O_{\setminus j},O_{\setminus j})$ and the third equality follows since we have

[TABLE]

which was used with $A=O_{\setminus j}$ . Similarly, we have

[TABLE]

It follows that

[TABLE]

Calculating $Q^{T}Q$

Let us first take $j\neq j^{\prime}$ . Then, similar to the previous argument,

[TABLE]

where we have defined

[TABLE]

assuming $i,r\in O\setminus\{j,j^{\prime}\}$ , otherwise $\beta_{ir}=0$ . Thus, restricting summations over indices $i,r\in O\setminus\{j,j^{\prime}\}$

[TABLE]

Now consider the case $j=j^{\prime}$ . Then,

[TABLE]

assuming $i,r\in O\setminus j$ . It follows that

[TABLE]

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Andrew Cliff and Peter Haggett. Time, travel and infection. British medical bulletin , 69(1):87–99, 2004.
2[2] Mitchell L Cohen. Changing patterns of infectious disease. Nature , 406(6797):762, 2000.
3[3] Vittoria Colizza, Alain Barrat, Marc Barthélemy, and Alessandro Vespignani. The role of the airline transportation network in the prediction and predictability of global epidemics. Proceedings of the National Academy of Sciences , 103(7):2015–2020, 2006.
4[4] Laurence Slutsker, Sean F Altekruse, and David L Swerdlow. Foodborne diseases: emerging pathogens and trends. Infectious Disease Clinics , 12(1):199–216, 1998.
5[5] Matthew Elliott, Benjamin Golub, and Matthew O Jackson. Financial networks and contagion. American Economic Review , 104(10):3115–53, 2014.
6[6] Daron Acemoglu, Asuman Ozdaglar, and Alireza Tahbaz-Salehi. Systemic risk and stability in financial networks. American Economic Review , 105(2):564–608, 2015.
7[7] Suleyman Kondakci. Epidemic state analysis of computers under malware attacks. Simulation Modelling Practice and Theory , 16(5):571–584, 2008.
8[8] Chris Fleizach, Michael Liljenstam, Per Johansson, Geoffrey M Voelker, and Andras Mehes. Can you infect me now?: malware propagation in mobile phone networks. In Proceedings of the 2007 ACM workshop on Recurring malcode , pages 61–68. ACM, 2007.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Approximate Identification of the Optimal Epidemic Source in Complex Networks

Abstract

1 Introduction

Related work.

2 Source detection in SI epidemics

2.1 Time and rate invariant analysis

2.2 Statistical Inference

Remark 1**.**

3 Exact likelihood computation

Dynamic programming.

Proposition 1**.**

4 Approximations

Greedy Elimination (GE).

Mean-field Approximation (MFA).

Proposition 2**.**

5 Simulations

Appendix A Multi-source Extension

Appendix B Proofs

B.1 Proof of Proposition 1

Fact 1**.**

B.2 Proof of Proposition 2

Calculating QTQQ^{T}QQTQ

Remark 1.

Proposition 1.

Proposition 2.

Fact 1.

Calculating $Q^{T}Q$