Distributionally Robust Removal of Malicious Nodes from Networks

Sixie Yu; Yevgeniy Vorobeychik

arXiv:1901.11463·cs.LG·February 1, 2019

Distributionally Robust Removal of Malicious Nodes from Networks

Sixie Yu, Yevgeniy Vorobeychik

PDF

Open Access

TL;DR

This paper introduces a distributionally robust method for removing malicious nodes in networks, addressing uncertainty in maliciousness estimates, and demonstrates its effectiveness through theoretical and empirical analysis.

Contribution

It develops a novel distributionally robust framework for node removal, overcoming limitations of prior methods that assume accurate maliciousness probabilities.

Findings

01

The proposed algorithm is highly effective in practice.

02

It outperforms existing methods in robustness.

03

The approach is validated on synthetic and real data.

Abstract

An important problem in networked systems is detection and removal of suspected malicious nodes. A crucial consideration in such settings is the uncertainty endemic in detection, coupled with considerations of network connectivity, which impose indirect costs from mistakely removing benign nodes as well as failing to remove malicious nodes. A recent approach proposed to address this problem directly tackles these considerations, but has a significant limitation: it assumes that the decision maker has accurate knowledge of the joint maliciousness probability of the nodes on the network. This is clearly not the case in practice, where such a distribution is at best an estimate from limited evidence. To address this problem, we propose a distributionally robust framework for optimal node removal. While the problem is NP-Hard, we propose a principled algorithmic technique for solving it…

Tables1

Table 1. Table 1: Statistics of networks used in our experiments. r 𝑟 r is the exponent of the power-law degree distribution.

	$r$	density	#edges	clustering coeff.
BA-1	2.7167	0.0461	375	0.1340
BA-2	2.2789	0.0610	496	0.1504
BA-3	2.0374	0.0757	615	0.1646
SW-1		0.0787	640	0.3664
SW-2		0.1102	896	0.3875
SW-3		0.1575	1280	0.4059
Facebook		0.0106	1325	0.3930

Equations173

L (x) := α_{1} L_{1} i = 1 \sum N x_{i} E_{π \sim P} [\overset{π}{ˉ}_{i}] - α_{2} L_{2} i, j \sum N A_{i, j} x_{i} x_{j} E_{π \sim P} [\overset{π}{ˉ}_{i} \overset{π}{ˉ}_{j}] + α_{3} L_{3} i, j \sum N x_{i} x_{j} A_{i, j} E_{π \sim P} [π_{i} \overset{π}{ˉ}_{j}] .

L (x) := α_{1} L_{1} i = 1 \sum N x_{i} E_{π \sim P} [\overset{π}{ˉ}_{i}] - α_{2} L_{2} i, j \sum N A_{i, j} x_{i} x_{j} E_{π \sim P} [\overset{π}{ˉ}_{i} \overset{π}{ˉ}_{j}] + α_{3} L_{3} i, j \sum N x_{i} x_{j} A_{i, j} E_{π \sim P} [π_{i} \overset{π}{ˉ}_{j}] .

B (μ) :

B (μ) :

P (μ, Σ) :

M (μ, Σ) :

Q (μ, Σ) :=

Q (μ, Σ) :=

\displaystyle(\alpha_{2}/2)\big{[}\mathbf{P}(\mathbf{\mu},\mathbf{\Sigma})+\mathbf{P}(\mathbf{\mu},\mathbf{\Sigma})^{T}\big{]},

L (x; μ, Σ)

L (x; μ, Σ)

= x^{T} Q (μ, Σ) x + 2 x^{T} b (μ)

x min

x min

s . t .

x min F \sim Π sup

x min F \sim Π sup

s . t .

\displaystyle\begin{split}&(\mathbb{E}[\mathbf{\mu}_{\mathcal{F}}]-\hat{\mathbf{\mu}})^{T}\hat{\mathbf{\Sigma}}^{-1}(\mathbb{E}[\mathbf{\mu}_{\mathcal{F}}]-\hat{\mathbf{\mu}})\leq\gamma_{1}\\ &\mathbb{E}\big{[}(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})^{T}\big{]}\preceq\gamma_{2}\hat{\mathbf{\Sigma}},\end{split}

\displaystyle\begin{split}&(\mathbb{E}[\mathbf{\mu}_{\mathcal{F}}]-\hat{\mathbf{\mu}})^{T}\hat{\mathbf{\Sigma}}^{-1}(\mathbb{E}[\mathbf{\mu}_{\mathcal{F}}]-\hat{\mathbf{\mu}})\leq\gamma_{1}\\ &\mathbb{E}\big{[}(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})^{T}\big{]}\preceq\gamma_{2}\hat{\mathbf{\Sigma}},\end{split}

\displaystyle\begin{split}\Pi(\hat{\mathbf{\mu}},\hat{\mathbf{\Sigma}},&\gamma_{1},\gamma_{2}):=\\ &\Bigg{\{}\mathcal{F}\Bigg{|}\begin{array}[]{ll}(\mathbb{E}[\mathbf{\mu}_{\mathcal{F}}]-\hat{\mathbf{\mu}})^{T}\hat{\mathbf{\Sigma}}^{-1}(\mathbb{E}[\mathbf{\mu}_{\mathcal{F}}]-\hat{\mathbf{\mu}})\leq\gamma_{1}\\ \mathbb{E}\big{[}(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})^{T}\big{]}\preceq\gamma_{2}\hat{\mathbf{\Sigma}}\end{array}\Bigg{\}}\end{split}

\displaystyle\begin{split}\Pi(\hat{\mathbf{\mu}},\hat{\mathbf{\Sigma}},&\gamma_{1},\gamma_{2}):=\\ &\Bigg{\{}\mathcal{F}\Bigg{|}\begin{array}[]{ll}(\mathbb{E}[\mathbf{\mu}_{\mathcal{F}}]-\hat{\mathbf{\mu}})^{T}\hat{\mathbf{\Sigma}}^{-1}(\mathbb{E}[\mathbf{\mu}_{\mathcal{F}}]-\hat{\mathbf{\mu}})\leq\gamma_{1}\\ \mathbb{E}\big{[}(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})^{T}\big{]}\preceq\gamma_{2}\hat{\mathbf{\Sigma}}\end{array}\Bigg{\}}\end{split}

F \sim Π sup

F \sim Π sup

s . t .

\displaystyle\int_{\mathcal{S}}{\big{[}(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})^{T}\big{]}d\mathcal{F}(\mathbf{\mu}_{\mathcal{F}})}\preceq\gamma_{2}\hat{\mathbf{\Sigma}}

μ_{F} \in S, \forall μ_{F} .

\displaystyle l(\mathcal{F},t,\mathbf{K})=\bigg{[}t+Tr\bigg{(}\big{[}\gamma_{2}\hat{\mathbf{\Sigma}}+\hat{\mathbf{\mu}}\hat{\mathbf{\mu}}^{T}\big{]}\mathbf{K}\bigg{)}\bigg{]}+

\displaystyle l(\mathcal{F},t,\mathbf{K})=\bigg{[}t+Tr\bigg{(}\big{[}\gamma_{2}\hat{\mathbf{\Sigma}}+\hat{\mathbf{\mu}}\hat{\mathbf{\mu}}^{T}\big{]}\mathbf{K}\bigg{)}\bigg{]}+

\displaystyle\int_{\mathcal{S}}{\bigg{[}\underbrace{\mathbf{x}^{T}\mathbf{Q}(\mathbf{\mu}_{\mathcal{F}},\hat{\mathbf{\Sigma}})\mathbf{x}+2\mathbf{x}^{T}\mathbf{b}(\mathbf{\mu}_{\mathcal{F}})-t-\mathbf{\mu}_{\mathcal{F}}^{T}\mathbf{K}\mathbf{\mu}_{\mathcal{F}}}_{:=f(\mathbf{\mu}_{\mathcal{F}})}\bigg{]}},

t, K min

t, K min

s . t .

t \in R, K \in S_{+}^{N}

x, t, K min

x, t, K min

s . t .

f (μ_{F}) \leq 0, \forall μ_{F} \in S

t \in R, K \in S_{+}^{N}

x^{T} A_{1} x + 2 b_{1}^{T} x + c_{1} \leq 0 ⟹ x^{T} A_{2} x + 2 b_{2}^{T} x + c_{2} \leq 0

x^{T} A_{1} x + 2 b_{1}^{T} x + c_{1} \leq 0 ⟹ x^{T} A_{2} x + 2 b_{2}^{T} x + c_{2} \leq 0

x^{T} Q (μ_{F}, \hat{Σ}) x + 2 x^{T} b (μ_{F}) =

x^{T} Q (μ_{F}, \hat{Σ}) x + 2 x^{T} b (μ_{F}) =

μ_{F}

μ_{F}^{T}

\displaystyle\Bigg{[}\alpha_{1}\mathbf{1}^{T}\mathbf{x}-\mathbf{x}^{T}\bigg{(}(\alpha_{2}+\alpha_{3})\big{(}\mathbf{A}\odot\hat{\mathbf{\Sigma}}\big{)}+\alpha_{2}\mathbf{A}\bigg{)}\mathbf{x}\Bigg{]},

\forall μ_{F} \in S : f (μ_{F}) \leq 0 ⟺ μ_{F}^{T} R μ_{F} + μ_{F}^{T} r + z \leq 0

\forall μ_{F} \in S : f (μ_{F}) \leq 0 ⟺ μ_{F}^{T} R μ_{F} + μ_{F}^{T} r + z \leq 0

\displaystyle\mathbf{R}=-(\alpha_{2}+\alpha_{3})\bigg{(}\mathbf{A}\odot\big{(}\mathbf{x}\mathbf{x}^{T}\big{)}\bigg{)}-\mathbf{K}

\displaystyle\mathbf{R}=-(\alpha_{2}+\alpha_{3})\bigg{(}\mathbf{A}\odot\big{(}\mathbf{x}\mathbf{x}^{T}\big{)}\bigg{)}-\mathbf{K}

r = (α_{3} + 2 α_{2}) d ia g (x) \cdot A \cdot x - α_{1} x

\displaystyle z=\alpha_{1}\mathbf{1}^{T}\mathbf{x}-\mathbf{x}^{T}\bigg{(}(\alpha_{2}+\alpha_{3})\big{(}\mathbf{A}\odot\hat{\mathbf{\Sigma}}\big{)}+\alpha_{2}\mathbf{A}\bigg{)}\mathbf{x}-t,

f (μ_{F}) = μ_{F}^{T} R μ_{F} + μ_{F}^{T} r + z

f (μ_{F}) = μ_{F}^{T} R μ_{F} + μ_{F}^{T} r + z

\exists\lambda\geq 0:\begin{bmatrix}\mathbf{R}&\frac{1}{2}\mathbf{r}\\ \frac{1}{2}\mathbf{r}^{T}&z\end{bmatrix}\preceq\lambda\begin{bmatrix}\hat{\mathbf{\Sigma}}^{-1}&-\hat{\mathbf{\Sigma}}^{-1}\hat{\mathbf{\mu}}\\ -\hat{\mathbf{\mu}}^{T}\hat{\mathbf{\Sigma}}^{-1}&\big{(}\hat{\mathbf{\mu}}^{T}\hat{\mathbf{\Sigma}}^{-1}\hat{\mathbf{\mu}}-\gamma_{1}\big{)}.\end{bmatrix}

\exists\lambda\geq 0:\begin{bmatrix}\mathbf{R}&\frac{1}{2}\mathbf{r}\\ \frac{1}{2}\mathbf{r}^{T}&z\end{bmatrix}\preceq\lambda\begin{bmatrix}\hat{\mathbf{\Sigma}}^{-1}&-\hat{\mathbf{\Sigma}}^{-1}\hat{\mathbf{\mu}}\\ -\hat{\mathbf{\mu}}^{T}\hat{\mathbf{\Sigma}}^{-1}&\big{(}\hat{\mathbf{\mu}}^{T}\hat{\mathbf{\Sigma}}^{-1}\hat{\mathbf{\mu}}-\gamma_{1}\big{)}.\end{bmatrix}

\mathbf{x}\mathbf{x}^{T},\,\,diag(\mathbf{x})\mathbf{A}\mathbf{x},\,\,\mathbf{x}^{T}\bigg{(}(\alpha_{2}+\alpha_{3})(\mathbf{A}\odot\hat{\mathbf{\Sigma}})+\alpha_{2}\mathbf{A}\bigg{)}\mathbf{x}.

\mathbf{x}\mathbf{x}^{T},\,\,diag(\mathbf{x})\mathbf{A}\mathbf{x},\,\,\mathbf{x}^{T}\bigg{(}(\alpha_{2}+\alpha_{3})(\mathbf{A}\odot\hat{\mathbf{\Sigma}})+\alpha_{2}\mathbf{A}\bigg{)}\mathbf{x}.

\displaystyle(r1):\bigg{(}\mathbf{A}\odot\big{(}\mathbf{x}\mathbf{x}^{T}\big{)}\bigg{)}=\bigg{(}\mathbf{A}\odot\mathbf{X}\bigg{)}

\displaystyle(r1):\bigg{(}\mathbf{A}\odot\big{(}\mathbf{x}\mathbf{x}^{T}\big{)}\bigg{)}=\bigg{(}\mathbf{A}\odot\mathbf{X}\bigg{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFacility Location and Emergency Management · Infrastructure Resilience and Vulnerability Analysis · Risk and Portfolio Optimization

Full text

marginparsep has been altered.

topmargin has been altered.

marginparwidth has been altered.

marginparpush has been altered.

The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

Distributionally Robust Removal of Malicious Nodes from Networks

Anonymous Authors1

††footnotetext: 1Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country. Correspondence to: Anonymous Author [email protected].

Preliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute.

Abstract

An important problem in networked systems is detection and removal of suspected malicious nodes. A crucial consideration in such settings is the uncertainty endemic in detection, coupled with considerations of network connectivity, which impose indirect costs from mistakely removing benign nodes as well as failing to remove malicious nodes. A recent approach proposed to address this problem directly tackles these considerations, but has a significant limitation: it assumes that the decision maker has accurate knowledge of the joint maliciousness probability of the nodes on the network. This is clearly not the case in practice, where such a distribution is at best an estimate from limited evidence. To address this problem, we propose a distributionally robust framework for optimal node removal. While the problem is NP-Hard, we propose a principled algorithmic technique for solving it approximately based on duality combined with Semidefinite Programming relaxation. A combination of both theoretical and empirical analysis, the latter using both synthetic and real data, provide strong evidence that our algorithmic approach is highly effective and, in particular, is significantly more robust than the state of the art.

1 Introduction

One of the major problems in networked settings is to identify and remove potentially malicious nodes. For example, in social networks, malicious nodes may correspond to accounts created by malicious parties which spread social spam, hate speech, fake news, and the like, with considerable deliterious effects Allcott & Gentzkow (2017); Cheng et al. (2015). Major social network platforms consequently devote considerable efforts to identify and remove fake or malicious accounts Rodriguez (2018); Scott & Isaac (2017). Nevertheless, evidence suggests that the problem remains pervasive Andrade (2018); Narayanan et al. (2018). Similarly, in cyber-physical systems (e.g., smart grid infrastructure), computing nodes compromised by malware can cause catastrophic losses, and mitigation through detection and removal of such malicious nodes is a major problem Mo et al. (2012); Yang et al. (2017).

A central challenge faced in deciding which potentially malicious nodes to remove is to account for the combination of uncertainty about whether particular nodes are malicious, and the indirect (network) effects of the decision. This combination makes the decision about which nodes to remove fundamentally a subset selection problem—a challenging combinatorial optimization problem. Recently, Yu & Vorobeychik proposed an approach for solving it they term MINT, where the problem is captured by approximately minimizing loss which involves three terms: direct loss from removing benign nodes, indirect loss from cutting links in the benign subgraph, and indirect loss from maintaining connectivity between malicious and benign nodes.

This model is illustrated in Fig. 1, where we consider removing Jack and Emma, two benign nodes above the dotted blue line (and failing to remove the malicious node). Suppose that we pay a penalty of $\alpha_{1}$ for each benign node we remove, a penalty $\alpha_{2}$ for each link we cut between benign nodes, and $\alpha_{3}$ for each link between remaining malicious nodes and benign nodes. Since we remove $2$ benign nodes, cut $3$ links between benign nodes (one between Emma and Rachel, one between Emma and Ryan, and another between Jack and Ryan), and the malicious node is still connected to $2$ nodes (Rachel and Nancy), our total loss is: $2\alpha_{1}+3\alpha_{2}+2\alpha_{3}$ .

A major shortcoming of MINT is that it assumes that the distribution of node maliciousness is known. In practice, such a distribution is estimated from limited evidence, such as node behavior and other characteristics, and this estimation may be quite inaccurate (particularly if our modeling assumptions are poor, for example, if we erroneously assume that maliciousness probabilities of nodes are independent). More precisely, consider an unknown ground-truth $\mathcal{P}$ , as illustrated in Fig. 1 in green. Whereas MINT assumes we know $\mathcal{P}$ , in reality we only have an estimate $\hat{\mathcal{P}}$ (shown in red in Fig. 1). To address this issue, we propose a new approach, MINT_DRO, which is a distributionally robust framework for optimal node removal. We design an uncertainty set around the estimate $\hat{\mathcal{P}}$ and optimize with respect to the worst-case scenario. We propose a principled algorithmic approach for solving this problem approximately based on duality combined with Semidefinite Programming relaxation, and prove that the uncertainty set in our model contains the ground-truth distribution $\mathcal{P}$ with high probability. This in turn implies that with high probability MINT_DRO is robust with respect to the ground-truth distribution. Finally, we conducted extensive experiments using both synthetic and real data to show that our model is significantly more robust than MINT.

Related Work

There are several prior efforts considering a related problem of graph scan statistics and hypothesis testing Arias-Castro et al. (2011); Priebe et al. (2005); Sharpnack et al. (2013). These study the following problem: given a graph $G$ where each node is associated with a random variable with an exogenously specified probability distribution, find a subset of nodes that maximizes a scan statistic defined over subsets of nodes (for example, this statistic may generalize log-likelihood ratio). The recent MINT approach Yu & Vorobeychik (2018) can be viewed through this lens as well, but as it has been shown to have state-of-the-art performance, our comparison, our experimental evaluation focuses on comparing to MINT.

Also closely related to our problem is the broader literature on distributionally robust optimization (DRO) Scarf (1958). In the DRO framework one defines a set of probability distributions that is assumed to contain the true stochastic model of the problem. Many solutions have been proposed to solve specific problems under the DRO framework Xu & Mannor (2010); Calafiore & El Ghaoui (2006); Yue et al. (2006); Cheng et al. (2014); Wiesemann et al. (2014), although this framework has not been applied in the context of choosing which potentially malicious nodes to remove from a network.

Our design of the uncertainty set is inspired by the idea of moment-constrained uncertainty set Delage & Ye (2010); Popescu (2007); Calafiore & El Ghaoui (2006). Yet another related research strand is in using Semidefinite Programming (SDP) to approximate combinatorial optimization problems Goemans & Williamson (1995); Luo et al. (2010); Bertsimas & Sethuraman (2000), although such approaches are domain specific. Finally, our work bears some relationship to the burgeoning field of adversarial machine learning Vorobeychik & Kantarcioglu (2018), although we do not explicitly consider issues of adversarial response (such as evasion attacks) in our setting.

2 Model

We consider a network that is represented by a graph $G=(V,E)$ , where $V$ ( $|V|=N$ ) is the set of nodes and $E$ the set of edges connecting them. Each node $i\in V$ represents a user and each edge $(i,j)$ represents an edge (e.g., friendship on Facebook) between user $i$ and user $j$ . We focus our attention on undirected graphs. We denote the adjacency matrix of $G$ by $\mathbf{A}\in\mathbb{R}^{N\times N}$ . The elements of $\mathbf{A}$ are binary if the graph is unweighted, or some non-negative real numbers if the graph is weighted. To make expositioin easier we focus on unweighted graphs. Generalization to weighted graphs is straightforward.

We consider the problem of removing malicious nodes from the network $G$ . A configuration of the network is denoted by $\mathbf{\pi}\in\{0,1\}^{N}$ , with $\pi_{i}=1$ indicating that a node $i$ is malicious, with $\pi_{i}=0$ when $i$ is benign. For convenience, we also let $\bar{\pi}_{i}=1-\mathbf{\pi}_{i}$ to indicate that $i$ is benign. Consequently, $\mathbf{\pi}$ (and $\mathbf{\bar{\pi}}$ ) assigns malicious or benign label to each node. The identity of malicious and benign nodes are usually uncertain. So instead we have a probability distribution over the configurations. Formally, let $\mathbf{\pi}\sim\mathcal{P}$ , where $\mathcal{P}$ captures the joint probability distribution over node configurations.

Our work builds upon the following model proposed by Yu & Vorobeychik (2018). Let $S$ denote the set of nodes to remove. Define a vector $\mathbf{x}\in\{-1,1\}^{N}$ , where $\mathbf{x}_{i}=1$ if and only if node $i$ is removed ( $i\in S$ ), and $\mathbf{x}_{i}=-1$ if node $i$ remains in the network ( $i\in V\setminus S$ ). The goal of their model is to identify a subset of nodes $S$ to remove so as to minimize the impact of the remaining malicious nodes on the network, while at the same time minimizing disruptions caused to the benign subnetwork. This goal is naturally captured by the loss function given in Eq. (1).

[TABLE]

As we can observe, the loss function is composed of three components. The first component, $\mathcal{L}_{1}$ , of the loss function is the direct loss associated with removing benign nodes. The second component, $\mathcal{L}_{2}$ , penalizes cutting connections between benign nodes that are removed and benign nodes that remain; in other words, it penalizes the degradation of connectivity within the benign subgraph. The third component of the loss function, $\mathcal{L}_{3}$ , captures the consequence of failing to remove malicious nodes in terms of connections from these to benign nodes. The nonnegative trade-off parameters $\alpha_{1}$ , $\alpha_{2}$ , and $\alpha_{3}$ satisfy $\alpha_{1}+\alpha_{2}+\alpha_{3}=1$ , and weigh the relative importance of the three components of the loss function.

The configuration $\mathbf{\pi}$ is a random variable distributed according to $\mathcal{P}$ . Let $\mathbf{\mu}\in\mathbb{R}^{N}$ and $\mathbf{\Sigma}\in\mathbb{R}^{N\times N}$ denote its mean and covariance, respectively. The loss function defined in Eq. (1) depends on both $\mathbf{\mu}$ and $\mathbf{\Sigma}$ . To make the dependency explicit we define several matrices and re-write the loss function in a matrix-vector form. We define the matrices $\mathbf{B}(\mathbf{\mu}),\mathbf{P}(\mathbf{\mu},\mathbf{\Sigma}),\mathbf{M}(\mathbf{\mu},\mathbf{\Sigma})$ as follow.111 $diag(\mathbf{x})$ returns a diagonal matrix with diagonal elements equal to $\mathbf{x}$ .

[TABLE]

Note that the elements of these matrices are not constant, but depend on $\mathbf{\mu}$ and $\mathbf{\Sigma}$ (see the appendix for their detailed dependency).

Slightly abusing notation, we define two additional matrices, $\mathbf{Q}(\mathbf{\mu},\mathbf{\Sigma})$ and $\mathbf{b}(\mathbf{\mu})$ . Note that $\mathbf{Q}\in\mathbb{R}^{N\times N}$ is a symmetric matrix:

[TABLE]

and $\mathbf{b}(\mathbf{\mu}):=(\alpha_{1}/2)\mathbf{B}(\mathbf{\mu})\mathbf{1}$ . We can now rewrite the loss function in a compact matrix-vector form as the following:

[TABLE]

Optimizing the loss function above (as done by Yu & Vorobeychik) critically assumes that the maliciousness distribution $\mathcal{P}$ is known. In reality, this is typically not the case, and such a distribution is estimated from data. Let $\hat{\mathcal{P}}$ denote the estimated distribution. The mean of $\hat{\mathcal{P}}$ is denoted by $\hat{\mathbf{\mu}}$ , where $\hat{\mathbf{\mu}}_{i}$ is the estimated probability that node $i$ is malicious given its features from past data. Similarly, the estimated covariance matrix is represented by $\hat{\mathbf{\Sigma}}$ . The model proposed by Yu & Vorobeychik is called MINT, which is to solve the following optimization problem:

[TABLE]

Although MINT has been shown to perform well on several real-world datasets, its performance is strongly influenced by the estimation error of $\mathbf{\mu}$ . In fact, in Section 5 we show that even a small estimation error can severely undermine the performance of MINT.

In order to mitigate the sensitivity of MINT to estimation error, we propose a novel Distributionally Robust Optimization (DRO) approach for solving the problem posed above. The general idea is to design a distributional set to capture the uncertainty about the estimated mean $\hat{\mathbf{\mu}}$ and make decisions considering the worst-case scenario. Specifically, we propose a model named MINT_DRO, which aims to solve the following optimization problem:

[TABLE]

where the set $\Pi$ captures uncertainty about the true mean $\mathbf{\mu}$ . There are several fundamental differences between MINT_DRO and MINT. First, there is an additional inner maximization problem in MINT_DRO. The inner maximization is optimized over a set $\Pi$ , which contains a set of probability distributions, where $\mathcal{F}$ is any distribution sampled from $\Pi$ , and $\mathbf{\mu}_{\mathcal{F}}$ are random variables distributed according to $\mathcal{F}$ . Inspired by Delage & Ye (2010) and Cheng et al. (2014), we parametrize the set $\Pi$ by the first and second moments of the distributions in it. Specifically, let $\mathcal{F}$ be any distribution in $\Pi$ . Consider the following two constraints:

[TABLE]

where $\hat{\mathbf{\mu}}$ and $\hat{\mathbf{\Sigma}}$ are the mean and covariance matrix estimated from data. $\mathbf{\mu}_{\mathcal{F}}$ are random variables distributed according to $\mathcal{F}$ . The first constraint defines an ellipsoid, which indicates that the expectation of $\mathcal{F}$ lies in the ellipsoid centered at the estimate $\hat{\mathbf{\mu}}$ . The size of this ellipsoid is determined by $\gamma_{1}$ , which provides a natural measure to quantify our uncertainty about $\mathbf{\mu}$ given $\hat{\mathbf{\mu}}$ . Note that the second constraint also defines the support of the distribution $\mathcal{F}$ . The second constraint enforces the covariance matrix of $\mathcal{F}$ to lie in a positive semi-definite cone. Intuitively, the second constraint captures how likely it is that the random variable $\mathbf{\mu}_{\mathcal{F}}$ is close to $\hat{\mathbb{\mu}}$ . The set $\Pi$ is then characterized by Eq. (3):

[TABLE]

The set $\Pi$ is always non-empty, since it must contain the distribution $\hat{\mathcal{P}}$ . In Section 4 we provide probabilistic arguments to show that $\Pi$ contains ground-truth distribution $\mathcal{P}$ with high probability, which guarantees that with high probability our model MINT_DRO is robust with respect to the ground-truth distribution $\mathcal{P}$ . The choice of the two parameters $\gamma_{1}$ and $\gamma_{2}$ is important for the robustness of MINT_DRO. If their values are too small the benefit from the distributionally robust formulation is limited. In the extreme case where $\gamma_{1}$ and $\gamma_{2}$ are zeros our model MINT_DRO reverts to MINT. On the other hand if their values are too large, our model would make excessively conservative decisions. In Section 4 we show how to make sensible choice of these values.

3 Solution Approach

In this section we derive the algorithm to solve our model MINT_DRO. The optimization problem of MINT_DRO is a binary quadratic program, which is diffcult to solve even if the loss function $\mathcal{L}(\mathbf{x};\mathbf{\mu},\mathbf{\Sigma})$ is convex. Additionally, in our problem the loss function is nonconvex since the matrix $\mathbf{Q}$ is usually not positive (semi)-definite, further complicating the situation. Indeed, given that MINT, which was shown by Yu & Vorobeychik to be NP-Hard, is a special case, the following result is immediate.

Theorem 1.

Solving MINT_DRO is NP-Hard.

In what follows, we derive an approximation approach for solving MINT_DRO. We first apply duality to transform the inner maximization into a minimization problem, which can be jointly minimized with the outer minimization over $\mathbf{x}$ . At this stage the optimization problem is still a NP-hard combinatorial optimization problem. Next, we apply Semidefinite Programming (SDP) to obtain a convex relaxation of our problem which can be solved efficiently.

The support of the distributions in $\Pi$ is $\mathcal{S}$ , which is defined as $\mathcal{S}:=\big{\{}\mathbf{\mu}_{\mathcal{F}}\,\big{|}\,(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})^{T}\hat{\mathbf{\Sigma}}^{-1}(\mathbf{\mu}_{\mathcal{F}}-\hat{\mathbf{\mu}})\leq\gamma_{1}\big{\}}$ , where the subscript of $\mathbf{\mu}_{\mathcal{F}}$ indexes the distribution associated with this random variable. Note that $\mathbf{\mu}_{\mathcal{F}}\in\mathcal{S}$ is sufficient for the first constraint in Eq. (2) to be true, since $\mathbb{E}[\mathbf{\mu}_{\mathcal{F}}]$ is a convex combination of the instantiations of $\mathbf{\mu}_{\mathcal{F}}$ and $\mathcal{S}$ is a convex set. We rewrite the inner maximization problem as Eq. (4):

[TABLE]

The constraint Eq.(4b) ensures that $\mathcal{F}$ is a valid probability distribution. The constraints Eq.(4c) guarantee that $\mathcal{F}$ is in $\Pi$ . The constraint Eq. (4d) ensures that any random variable $\mathbf{\mu}_{\mathcal{F}}\sim\mathcal{F}$ must reside in $\mathcal{S}$ . Consequently, this constraint is actually an infinite dimensional constraint on the optimizer $\mathcal{F}$ . Later we introduce a technique called S-Lemma to convert it to a finite dimensional constraint. We derive the lagrange function of Eq. (4), where we temporily omit constraint Eq. (4d), and pull the terms that are independent of $\mathcal{F}$ out of the integral:

[TABLE]

where $t\in\mathbb{R}$ , and $\mathbf{K}$ is a real symmetric positive semi-definite matrix, and $Tr(\mathbf{X})$ returns the trace of the matrix $\mathbf{X}$ . where $f(\mathbf{\mu}_{\mathcal{F}})\leq 0,\forall\mathbf{\mu}_{\mathcal{F}}\in\mathcal{S}$ holds, since otherwise the solution to Eq.(4) is unbounded.

By duality (Shapiro, 2001; Delage & Ye, 2010; Cheng et al., 2014), the dual problem of Eq. (4) is formulated as the following minimization problem:

[TABLE]

where $\mathbb{S}^{N}_{+}$ is the positive semi-definite cone. Strong duality holds between Eq. (5) and the original inner maximization problem. This is because for any $\gamma_{1}>0$ and $\gamma_{2}>0$ , the estimated distribution $\hat{\mathcal{P}}$ is always in the relative interior of $\Pi$ . Consequently, by Proposition 3.4 in Shapiro (2001) strong duality holds. Since Eq. (5) is a minimization problem, we can jointly minimize it with the outer minimization over $\mathbf{x}$ , which results in the following:

[TABLE]

where constraint Eq. (6b) is equivalent to $\mathbf{\mu}_{\mathcal{F}}\in\mathcal{S},\forall\mathbf{\mu}_{\mathcal{F}}$ . We write it this way in order to emphasize its quadratic form. Constraints Eq.(6b) and (6c) are infinite dimensional constraints. We apply a technique called S-Lemma to transform them to finite dimensional constraints. We first introduce the S-Lemma:

Lemma 3.1 (S-Lemma Boyd & Vandenberghe (2004)).

Let $\mathbf{A}_{1},\mathbf{A}_{2}\in\mathbb{S}^{n}$ , $\mathbf{b}_{1},\mathbf{b}_{2}\in\mathbb{R}^{n}$ , $c_{1},c_{2}\in\mathbb{R}$ , where $\mathbb{S}^{n}$ is the subspace of symmetrix matrices in $\mathbb{R}^{n\times n}$ . Suppose there exists an $\hat{\mathbf{x}}\in\mathbb{R}^{n}$ such that: $\hat{\mathbf{x}}^{T}\mathbf{A}_{1}\hat{\mathbf{x}}+2\mathbf{b}_{1}^{T}\hat{\mathbf{x}}+c_{1}<0$ . Then the following implication holds for any $\mathbf{x}\in\mathbb{R}^{n}$ :

[TABLE]

*if and only if, $\exists\lambda\geq 0:\begin{bmatrix}\mathbf{A}_{2}&\mathbf{b}_{2}\\ \mathbf{b}_{2}^{T}&c_{2}\end{bmatrix}\preceq\lambda\begin{bmatrix}\mathbf{A}_{1}&\mathbf{b}_{1}\\ \mathbf{b}_{1}^{T}&c_{1}\end{bmatrix}$ . *

Note that S-Lemma only requires $\mathbf{A}_{1}$ and $\mathbf{A}_{2}$ to be real symmetric matrices. In order to apply S-Lemma we need to have two quadratic functions. Constraint Eq. (6b) is a quadratic function in $\mathbf{\mu}_{\mathcal{F}}$ . Thus, what remains is to convert Eq. (6c) to a quadratic function in $\mathbf{\mu}_{\mathcal{F}}$ . Recall that the term, $\mathbf{x}^{T}\mathbf{Q}(\mathbf{\mu}_{\mathcal{F}},\hat{\mathbf{\Sigma}})\mathbf{x}+2\mathbf{x}^{T}\mathbf{b}(\mathbf{\mu}_{\mathcal{F}})$ in $f(\mathbf{\mu}_{\mathcal{F}})$ , is implicitly a quadratic function of $\mathbf{\mu}_{\mathcal{F}}$ . We re-formulate $\mathbf{Q}$ and $\mathbf{b}$ according to $\mathbf{\mu}_{\mathcal{F}}$ , which results in Eq.(7) (see the Appendix for details about this reformulation):

[TABLE]

where $diag(\mathbf{x})$ returns a diagonal matrix with diagonal elements equal to $\mathbf{x}$ . We substitute Eq. (7) back to $f(\mathbf{\mu}_{\mathcal{F}})$ , which results in the following equivalence:

[TABLE]

where:

[TABLE]

which results in a compact form of $f(\mathbf{\mu}_{\mathcal{F}})$ :

[TABLE]

Note that for any $\gamma_{1}>0$ the inequality in constraint Eq. (6b) is strict when $\mathbf{\mu}_{\mathcal{F}}=\hat{\mathbf{\mu}}$ . Consequently, by S-Lemma, for any $\mathbf{\mu}_{\mathcal{F}}\in\mathcal{S}$ the implication, $Eq.~{}\eqref{b:1}\implies\mathbf{\mu}_{\mathcal{F}}^{T}\mathbf{R}\mathbf{\mu}_{\mathcal{F}}+\mathbf{\mu}_{\mathcal{F}}^{T}\mathbf{r}+z$ , is equivalent to Eq.(8):

[TABLE]

The two infinite dimensional constraints Eq.(6b) and (6c) are thereby converted into a finite dimensional constraint Eq. (8). Additionally, the objective function in Eq. (6) is linear in its optimizer.

The last issue is that we still have two sources of non-convexity in Eq. (6): first, $\mathbf{x}$ is binary, and second, the constraint represented by Eq. (8) is not convex in $\mathbf{x}$ because of three terms involving in $\mathbf{R}$ , $\mathbf{r}$ and $z$ :

[TABLE]

To deal with the first issues, we relax the feasible region of $\mathbf{x}$ to $[-1,1]^{N}$ . To address the second, we next apply SDP relaxation to transform Eq. (6) into a convex optimization problem.

First, let us introduce a matrix $\mathbf{X}=\mathbf{x}\mathbf{x}^{T}$ . Then the following three relationships hold (see the Appendix for detailed proof):

[TABLE]

One problem is that the feasible regions involving $\mathbf{X}$ and $\mathbf{x}$ are nonconvex because of the equality $\mathbf{X}=\mathbf{x}\mathbf{x}^{T}$ . In order to transform the feasible regions to be convex, we apply a two-step relaxation. The first step is to relax the equality and enforce the diagonal elements of $\mathbf{X}$ equal to one, which results in: $\mathbf{X}\succeq\mathbf{x}\mathbf{x}^{T}$ and $\mathbf{X}_{ii}=1,\forall i=1,\cdots,N$ . This step transforms the feasible region of $\mathbf{X}$ to a positive semi-definite cone, which is a convex set. However, we still have a nonconvex term $\mathbf{x}\mathbf{x}^{T}$ . To handle this, in the second step we apply Schur Complement to transform $\mathbf{X}\succeq\mathbf{x}\mathbf{x}^{T}$ to the linear matrix inequality: $\begin{bmatrix}\mathbf{X}&\mathbf{x}\\ \mathbf{x}^{T}&1\end{bmatrix}\succeq 0$ . Combining the relationships in Eq. (10) with the results of the two-step relaxation above, the three nonconvex terms in Eq. (9) can be represented as the following convex set:

[TABLE]

With a slight abuse of notation, the operator $diag(\mathbf{A}\mathbf{X})$ in $(\ast)$ extracts the diagonal elements of $\mathbf{A}\mathbf{X}$ as a column vector. Finally, by substituting $\hat{\mathbf{R}}$ , $\hat{\mathbf{r}}$ and $\hat{z}$ to the corresponding matrices in Eq. (8) we obtain the following Semidefinite Program which approximately solves MINT_DRO (after we project the optimal solution $\mathbf{x}$ of this problem into $\{0,1\}^{N}$ , for example, by rounding):

[TABLE]

4 Theoretical Analysis

In this section we present a probabilistic argument that the uncertainty set $\Pi$ defined in Eq. (3) contains the ground-truth distribution $\mathcal{P}$ with high probability. This, in turn, implies that with high probability our model MINT_DRO is robust with respect to the unknown ground-truth distribution.

We show that the ground-truth distribution $\mathcal{P}$ belongs to $\Pi$ with high probability in two steps, arguing first that (C1) and, subsequently, that (C2) below hold with high probability, where (C1) and (C2) are defined as follows:

[TABLE]

The arguments in the first step are based on Lemma 4.1. For space limitation we defer its proof to the appendix.

Lemma 4.1.

Let $\mathbf{\mu}$ and $\mathbf{\Sigma}$ denote the mean and covariance matrix of the ground-truth distribution $\mathcal{P}$ , and suppose that $\hat{\mathbf{\mu}}$ is estimated from $M$ samples, $\hat{\mathbf{\mu}}=\frac{1}{M}\sum_{i=1}^{M}{\zeta_{i}}$ , where $\zeta_{i}$ is bounded: ${\|\mathbf{\Sigma}^{-1/2}(\zeta_{i}-\mathbf{\mu})\|}_{2}^{2}\leq R^{2},\forall i$ . Then $\hat{\mathbf{\mu}}$ satisfies the following constraint with probability at least $1-\delta_{1}$ :

[TABLE]

where $\beta(\delta_{1})=\frac{R^{2}}{M}\bigg{(}2+\sqrt{2\log{\frac{1}{\delta_{1}}}}\bigg{)}^{2}$ .

We assume the estimated covariance matrix $\hat{\mathbf{\Sigma}}$ is close to $\mathbf{\Sigma}$ . Then, if we let $\gamma_{1}>\beta(\delta_{1})$ and note that $\mathbf{\mu}=\mathbb{E}_{\mathbf{\pi}\sim\mathcal{P}}[\mathbf{\pi}]$ , a direct application of Lemma 4.1 implies that (C1) holds with probability at least $1-\delta_{1}$ .

The arguments in the second step rely on the result due to Delage & Ye (2010):

Lemma 4.2 (Delage & Ye (2010)).

Suppose that $\zeta_{i}$ is distributed according to $\mathcal{G}$ , and the mean $\mathbf{\mu}$ of the distribution is known and used to formulate the estimated covariance matrix $\hat{\mathbf{\Sigma}}$ , which is estimated from $M$ samples: $\hat{\mathbf{\Sigma}}=(1/M)\sum_{i=1}^{M}{\big{(}\zeta_{i}-\mathbf{\mu}\big{)}\big{(}\zeta_{i}-\mathbf{\mu}\big{)}^{T}}$ , where $\zeta_{i}$ is bounded: ${\|\mathbf{\Sigma}^{-1/2}(\zeta_{i}-\mathbf{\mu})\|}_{2}^{2}\leq R^{2},\forall i$ . Then with probability at least $1-\delta_{2}$ :

[TABLE]

where $\alpha(\delta_{2})=(R^{2}/\sqrt{M})\bigg{(}\sqrt{1-N/R^{4}}+\sqrt{\log{1/\delta_{2}}}\bigg{)}$ , $M>R^{4}\bigg{(}\sqrt{1-N/R^{4}}+\sqrt{\log{1/\delta_{2}}}\bigg{)}^{2}$ and $N$ is the dimensions of $\mathbf{\mu}$ .

In order to use Lemma 4.2 we assume that the estimated mean $\hat{\mathbf{\mu}}$ is close to the ground-truth $\mathbf{\mu}$ . Given this assumption, showing that (C2) holds with high probability is equivalent to show that the following holds with high probability:

[TABLE]

by Lemma 4.2, the above is true with high probability when: $\frac{1}{1-\alpha(\delta_{2})}\hat{\mathbf{\Sigma}}\preceq\gamma_{2}\hat{\mathbf{\Sigma}}+\mathbf{\mu}\mathbf{\mu}^{T}$ . Consequently, by setting $\gamma_{2}>\frac{1}{1-\alpha(\delta_{2})}$ , such that the effects of $\mathbf{\mu}\mathbf{\mu}^{T}$ are negligible, we conclude that (C2) holds with probability at least $1-\delta_{2}$ .

Finally, by a union bound we obtain probabilistic guarantees that the uncertainty set $\Pi$ contains $\mathcal{P}$ .

Theorem 2.

With probability at least $1-\delta$ , where $\delta=\delta_{1}+\delta_{2}$ , the uncertainty set $\Pi$ defined in Eq. (3) contains the ground-truth distribution $\mathcal{P}$ .

Proof.

The detailed proof is deferred to the appendix. ∎

We now demonstrate how to utilize the probabilistic arguments to make sensible choice for $\gamma_{1}$ . The value of $\gamma_{2}$ can be similarly obtained. Note that $\gamma_{1}>\beta(\delta_{1})$ is necessary for (C1) to hold. Consider a network with $N=128$ nodes. Assume $\mathbf{\Sigma}$ is diagonal with diagonal elements equal to $0.01$ , which is reasonable when a single estimator is used to estimate $\mathcal{P}$ and the maliciousness probabilities of nodes are independent. A reasonable estimate of $R$ is $\sqrt{128\times 2}$ , which is the radius of the circumcircle sphere of a hypercube with length of side equal to one. If $M=5$ and $\delta_{1}=0.05$ , then $\beta(0.05)=1012$ . Therefore in order for $\Pi$ to contain $\mathcal{P}$ with probability $\geq 0.95$ , we need $\gamma_{1}\geq 1012$ . Similarly, for a network with $N=500$ nodes, we want $\gamma_{1}\geq 3956$ .

5 Experiments

In this section we present experimental results to show the effectiveness of our approach. Our experiments were conducted on both synthetic and real-world network structures, although in all cases the distribution $\mathcal{P}$ over maliciousness of nodes was derived using real data. We considered two types of network generative models to construct synthetic networks: Barabasi-Albert (BA) Barabási & Albert (1999) and Watts-Strogatz networks (Small-World) Watts & Strogatz (1998). BA is characterized by its power-law degree distribution, where the probability that a randomly selected node has $k$ neighbors is proportional to $k^{-r}$ . For the BA model we experimented with three variants, BA-1, BA-2, and BA-3, which differ in the value of the exponent $r$ of their power-law degree distributions. For Small-World networks we also experimented with three variants, SW-1, SW-2, and SW-3, that have different local clustering coefficients. For both networks we generated instances with $N=128$ nodes. For real-world networks, we used a network extracted from Facebook data Leskovec & Mcauley (2012) which consisted of $4039$ nodes and $88234$ edges. We experimented with randomly sampled sub-networks with $N=500$ nodes. For space limitation the statistics of the networks used in our experiments are listed in the appendix.

For fair comparison with MINT (the state-of-the-art alternative), we used the same experimental setup as Yu & Vorobeychik (2018). In all of our experiments, we derived the ground-truth distribution $\mathcal{P}$ as follow. We start with a dataset $\mathbf{D}$ which includes malicious and benign instances (the meaning of these designations is domain specific). The dataset $\mathbf{D}$ is partitioned into three subsets: $\mathbf{D}_{train}$ , $\mathbf{D}_{1}$ and $\mathbf{D}_{2}$ , with the ratio of $0.3:0.6:0.1$ . Our first step is to learn a probabilistic predictor of maliciousness as a function of a feature vector $\mathbf{x}$ , $\hat{p}(\mathbf{x})$ , on $\mathbf{D}_{train}$ . Then we randomly assign malicious and benign feature vectors from $\mathbf{D}_{2}$ to the nodes on the network, assigning $10\%$ of nodes with malicious features and $90\%$ with benign feature vectors. For each node we use its assigned feature vector $\mathbf{x}$ to obtain our estimated probability of this node being malicious, $\hat{p}(\mathbf{x})$ ; This gives us the estimated maliciousness probability distribution $\hat{\mathcal{P}}$ . This is the distribution used to solve the model MINT, and also the distribution used to construct the uncertainty set $\Pi$ in our model. To ensure that our evaluation reasonably reflects realistic limitations of the knowledge about the ground-truth distribution $\mathcal{P}$ , we train another predictor $p(\mathbf{x})$ usign $\mathbf{D}_{train}\bigcup\mathbf{D}_{1}$ . Applying this new predictor to the nodes and their assigned feature vectors, we obtain a distribution $\mathcal{P}^{\ast}$ which we use to evaluate effectiveness.

We conducted three sets of experiments. In the first set of experiments we used synthetic networks and used data from the Spam Cormack et al. (2008) dataset To simulate estimation error of $\mathcal{P}$ , we add white Gaussian noise to the evaluation distribution $\mathcal{P}^{\ast}$ . The standard deviation of the noise is increased from $0.1$ to $0.5$ to simulate different magnitudes of the estimation error.

In the second set of experiments we used real-world networks from Facebook and used Hate Speech data Davidson et al. (2017) collected from Twitter to obtain $\mathcal{P}$ as discussed above. We categorized this dataset into two classes in terms of whether a tweet represents Hate Speech. After categorization, the total number of tweets is $24783$ , of which $1430$ are Hate Speech. We add white Gaussian noise to $\mathcal{P}^{\ast}$ to simulate estimation error as discussed above. Note that in this set of experiments we used real data for both the networks and the maliciousness probabilities $\mathcal{P}$ .

In the third set of experiments we considered the scenario that instead of being random, the location of the malicious nodes on the network is strategically determined. This scenario is not vacuous: in reality, for example, the nodes that have high degrees (e.g., celebrities with lots of followers on Twitter) may be targeted in order to maximize the influence of commercial advertisements Kempe et al. (2003). We conducted this set of experiments on synthetic networks. A set of nodes is greedily selected from the network to maximize the number of unique neighbors connecting to them. Then we assign malicious feature vectors to these nodes.

Experiment Results

We compared our model with a state-of-the-art approach MINT. The average losses for our first set of experiments where $\mathcal{P}$ was simulated from Spam data are shown in Figures 2 and 3. The experimental results on BA are showed in Figure 2, with the three columns corresponding to BA-1, BA-2 and BA-3, respectively. The experimental results on Small-World are shown in Figure 3, where the three columns correspond to SW-1, SW-2, and SW-3. In both figures, each row corresponds to a combination of trade-off parameters $(\alpha_{1},\alpha_{2},\alpha_{3})$ ; for example, $(0.2,0.7,0.1)$ corresponds to $(\alpha_{1}=0.2,\alpha_{2}=0.7,\alpha_{3}=0.1)$ . Each bar was obtained by averaging over $30$ randomly generated network topologies.

The experimental results indicate that on both BA and Small-World networks our model MINT_DRO is significantly more robust than MINT. Additionally, when no noise is added to the evaluation distribution $\mathcal{P}^{\ast}$ (left-most bars in all subplots), MINT_DRO is more robust than MINT except for a few cases. this indicates that the generalization ability of MINT_DRO is better than MINT.

The average loss on Facebook data is showed in Figure 4, with the three columns corresponding to $(0.2,0.7,0.1)$ , $(0.7,0.2,0.1)$ , and $(\frac{1}{3},\frac{1}{3},\frac{1}{3})$ . In this experiment, both the networks and the data used to simulate maliciousness probabilities are real data. Each bar was averaged over $30$ randomly sampled networks. Our model MITN_DRO is significantly more robust than MINT except for the cases where no noise is added. In this case, MINT_DRO is only worse than MINT at the left-most bars in the middle figure, although the difference is not significant. This is actually expected since MINT_DRO considers the worst-case scenario, which results in a decision that may be slightly conservative in no noise setting. One observation is that the Facebook networks used in this experiment are dramatically different from the simulated networks in terms of graph statistics (see the appendix for the detailed statistics). Particularly, the Facebook networks are disconnected, highly sparse, and have approximately $16\%$ nodes that have zero degree. Therefore the robustness exhibited in Figure 4 provides strong evidence to the effectiveness of MINT_DRO.

The average loss on the third set of experiments are shown in Figures 5 and 6 for BA and Small-World networks, respectively. For both figures the three columns correspond to $(0.2,0.7,0.1)$ , $(0.7,0.2,0.1)$ , and $(\frac{1}{3},\frac{1}{3},\frac{1}{3})$ . The results show that MINT_DRO is more robust than MINT across all settings. Recall that the loss function of MINT and MINT_DRO depends on the estimated covariance matrix $\hat{\mathbf{\Sigma}}$ , which encodes correlation information of the distribution $\hat{\mathcal{P}}$ . When the actual maliciousness of nodes become correlated as we simulated in this experiment, the performance of MINT degrades since it is using the estimated distribution $\hat{\mathcal{P}}$ which now significantly deviates from the true distribution. When $\gamma_{1}$ and $\gamma_{2}$ are appropriately selected, $\Pi$ contains the distribution that characterizes the strategic correlation simulated in this experiment, resulting in significantly better robustness.

One may argue that instead of resulting from the robustness against correlation in the maliciousness distribution that comes from strategic decision about where to place the malicious nodes, the robustness exhibited in Figures 5 and 6 stems solely from the fact that MINT_DRO is more robust than MINT when no noise is added to $\mathcal{P}^{\ast}$ . However, consider the left-most bars in the lower-left subplot of Figure 2. In this setting MINT_DRO performs worse than MINT. Now, consider another setting where the experimetal setup is identical except that the malicious nodes are strategically chosen. This setting corresponds to the left-most bars in the right subplot of Figure 5 where MINT_DRO performs better than MINT. Similar observations can be found on Small-World networks. Consequently, we can see that a major advantage of MINT_DRO is in its robustness even when the location of the malicious nodes on the graph is itself chosen strategically.

6 Conclusion

We considered the problem of removing malicious nodes from a network under uncertainty. We designed a model that considers the uncertainty around the estimated maliciousness probabilities, and makes decision under the worst-case scenario. We then proposed a principled algorithmic technique for solving it approximately based on duality combined with Semidefinite Programming relaxation. We theoretically proved that our model is robust with respect to the ground-truth, and experimentally showed that our model is more robust than the state of the art.

Appendix

1 Proof of Lemma 4.1

The proof is a generalization of a result proved by Shawe-Taylor & Cristianini. For completeness we list their result in Lemma 1.1.

Lemma 1.1.

Shawe-Taylor & Cristianini (2003*)**

Assume $\zeta\in\mathbb{R}^{N}$ is a random variable satisfying:*

[TABLE]

where the last inequality bounds the support of $\zeta$ . Let $\{\zeta_{i}\}_{i=1}^{M}$ be a set of $M$ independently and ramdomly sampled instances of $\zeta$ . Then with probability at least $(1-\delta)$ , the following inequality holds:

[TABLE]

In what follows we prove Lemma 4.1:

Lemma 4.1.

Let $\mathbf{\mu}$ and $\mathbf{\Sigma}$ denote the mean and covariance matrix of the ground-truth distribution $\mathcal{P}$ , and suppose that $\hat{\mathbf{\mu}}$ is estimated from $M$ samples, $\hat{\mathbf{\mu}}=\frac{1}{M}\sum_{i=1}^{M}{\zeta_{i}}$ , where $\zeta_{i}$ is bounded: ${\|\mathbf{\Sigma}^{-1/2}(\zeta_{i}-\mathbf{\mu})\|}_{2}^{2}\leq R^{2},\forall i$ . Then $\hat{\mathbf{\mu}}$ satisfies the following constraint with probability at least $1-\delta_{1}$ :

[TABLE]

where $\beta(\delta_{1})=\frac{R^{2}}{M}\bigg{(}2+\sqrt{2\log{\frac{1}{\delta_{1}}}}\bigg{)}^{2}$ .

Proof.

Apply a standadization to the $\zeta_{i}$ , which results in a new random variable $\gamma_{i}:=\mathbf{\Sigma}^{-1/2}(\zeta_{i}-\mathbf{\mu})$ . It is clear that $\gamma_{i}$ satisfies Lemma 1.1. Let $\beta(\delta_{1})=\frac{R^{2}}{M}\bigg{(}2+\sqrt{2\log{\frac{1}{\delta_{1}}}}\bigg{)}^{2}$ , then we have:

[TABLE]

∎

2 Proof of Theorem 2

Theorem 2.

With probability at least $1-\delta$ , where $\delta=\delta_{1}+\delta_{2}$ , the uncertainty set $\Pi$ contains the ground-truth distribution $\mathcal{P}$ .

Proof.

We define two events $A_{1}$ and $A_{2}$ as follow:

[TABLE]

Then we have:

[TABLE]

where $A_{1}\cap A_{2}$ is the event that $\mathcal{P}\in\Pi$ . In other words, $\mathbb{P}\big{(}\mathcal{P}\in\Pi\big{)}\geq 1-\delta$ , which completes the proof. ∎

3 Detailed dependency of $\mathbf{B}(\mathbf{\mu}),\mathbf{P}(\mathbf{\mu},\mathbf{\Sigma}),\mathbf{M}(\mathbf{\mu},\mathbf{\Sigma})$ on their arguments

In the following we expand the definition of $\mathbf{B}(\mathbf{\mu}),\mathbf{P}(\mathbf{\mu},\mathbf{\Sigma}),\mathbf{M}(\mathbf{\mu},\mathbf{\Sigma})$ , which makes their dependency on $\mathbf{\mu}$ and $\mathbf{\Sigma}$ clear:

[TABLE]

4 Detailed forms of the matrices $\mathbf{Q}(\mathbf{\mu},\mathbf{\Sigma})$ and $\mathbf{b}(\mathbf{\mu})$

The matrices $\mathbf{Q}(\mathbf{\mu},\mathbf{\Sigma})$ and $\mathbf{b}(\mathbf{\mu})$ defined in the paper have the following forms:

[TABLE]

5 Detailed reformulation of Eq. (7)

In the paper in order to apply the S-Lemma to convert the two infinite dimensional constraints, Eq. (6b) and Eq. (6c), to a finite dimensional constraint, we need two functions in quadratic forms. Notice that Eq. (6b) is already a quadratic function in $\mathbf{\mu}_{\mathcal{F}}$ . So what remains is to convert Eq. (6c) to a quadratic function in $\mathbf{\mu}_{\mathcal{F}}$ . We first convert the following to a quadratic function in $\mathbf{\mu}_{\mathcal{F}}$ :

[TABLE]

From last section we know:

[TABLE]

The three terms
1 ,
2 and
3 , together with $\mathbf{x}$ , form three quadratic functions in $\mathbf{x}$ . In what follows, we convert them to quadratic functions in $\mathbf{\mu}_{\mathcal{F}}$ . Note that the operator $diag(\mathbf{x})$ returns a diagonal matrix with diagonal elements equal to $\mathbf{x}$ :

[TABLE]

where $(\ast)$ comes from the fact that $\mathbf{x}^{T}[\mathbf{A}\odot\mathbf{B}]\mathbf{x}=Tr[diag(\mathbf{x})\cdot\mathbf{A}\cdot diag(\mathbf{x})\cdot\mathbf{B}^{T}]$ . Similarly we have:

[TABLE]

and:

[TABLE]

where $(\diamond)$ comes from the following:

[TABLE]

Putting the above derivation together we obtain:

[TABLE]

So the function $f(\mathbf{\mu}_{\mathcal{F}})$ becomes:

[TABLE]

which is a quadratic function in $\mathbf{\mu}_{\mathcal{F}}$ . Define $\mathbf{R},\mathbf{r}$ and $z$ as the following:

[TABLE]

which results in a compact form of $f(\mathbf{\mu}_{\mathcal{F}})$ :

[TABLE]

6 Proof of Eq.(10) in the paper

The relation $(r1)$ is direct. To see why $(r2)$ holds, note that the $i$ -th element of $diag(\mathbf{x})\cdot\mathbf{A}\cdot\mathbf{x}$ is:

[TABLE]

which is equal to the $i$ -th element of $diag(\mathbf{A}\mathbf{X})$ :

[TABLE]

The relation $(r3)$ holds because:

[TABLE]

7 Statistics of the networks used in experiments

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allcott & Gentzkow (2017) Allcott, H. and Gentzkow, M. Social media and fake news in the 2016 election. Journal of Economic Perspectives , 31(2):211–36, 2017.
2Andrade (2018) Andrade, V. Facebook, whatsapp step up efforts in brazil’s fake news battle. Bloomberg , 2018. URL https://www.bloomberg.com/news/articles/2018-10-23/facebook-whatsapp-step-up-efforts-in-brazil-s-fake-news-battle .
3Arias-Castro et al. (2011) Arias-Castro, E., Candes, E. J., and Durand, A. Detection of an anomalous cluster in a network. The Annals of Statistics , pp. 278–304, 2011.
4Barabási & Albert (1999) Barabási, A.-L. and Albert, R. Emergence of scaling in random networks. science , 286(5439):509–512, 1999.
5Bertsimas & Sethuraman (2000) Bertsimas, D. and Sethuraman, J. Moment problems and semidefinite optimization. In Handbook of semidefinite programming , pp. 469–509. Springer, 2000.
6Boyd & Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex optimization . Cambridge university press, 2004.
7Calafiore & El Ghaoui (2006) Calafiore, G. C. and El Ghaoui, L. On distributionally robust chance-constrained linear programs. Journal of Optimization Theory and Applications , 130(1):1–22, 2006.
8Cheng et al. (2014) Cheng, J., Delage, E., and Lisser, A. Distributionally robust stochastic knapsack problem. SIAM Journal on Optimization , 24(3):1485–1506, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 Introduction

Related Work

2 Model

3 Solution Approach

Theorem 1**.**

Lemma 3.1** (S-Lemma Boyd & Vandenberghe (2004)).**

4 Theoretical Analysis

Lemma 4.1**.**

Lemma 4.2** (Delage & Ye (2010)).**

Theorem 2**.**

Proof.

5 Experiments

Experiment Results

6 Conclusion

Appendix

1 Proof of Lemma 4.1

Lemma 1.1**.**

Lemma 4.1**.**

Proof.

2 Proof of Theorem 2

Theorem 2**.**

Proof.

3 Detailed dependency of B(μ),P(μ,Σ),M(μ,Σ)\mathbf{B}(\mathbf{\mu}),\mathbf{P}(\mathbf{\mu},\mathbf{\Sigma}),\mathbf{M}(\mathbf{\mu},\mathbf{\Sigma})B(μ),P(μ,Σ),M(μ,Σ) on their arguments

4 Detailed forms of the matrices Q(μ,Σ)\mathbf{Q}(\mathbf{\mu},\mathbf{\Sigma})Q(μ,Σ) and b(μ)\mathbf{b}(\mathbf{\mu})b(μ)

5 Detailed reformulation of Eq. (7)

6 Proof of Eq.(10) in the paper

7 Statistics of the networks used in experiments

Theorem 1.

Lemma 3.1 (S-Lemma Boyd & Vandenberghe (2004)).

Lemma 4.1.

Lemma 4.2 (Delage & Ye (2010)).

Theorem 2.

Lemma 1.1.

Lemma 4.1.

Theorem 2.

3 Detailed dependency of $\mathbf{B}(\mathbf{\mu}),\mathbf{P}(\mathbf{\mu},\mathbf{\Sigma}),\mathbf{M}(\mathbf{\mu},\mathbf{\Sigma})$ on their arguments

4 Detailed forms of the matrices $\mathbf{Q}(\mathbf{\mu},\mathbf{\Sigma})$ and $\mathbf{b}(\mathbf{\mu})$