Vertex Nomination, Consistent Estimation, and Adversarial Modification

Joshua Agterberg; Youngser Park; Jonathan Larson; Christopher White,; Carey E. Priebe; and Vince Lyzinski

arXiv:1905.01776·stat.ML·April 15, 2020

Vertex Nomination, Consistent Estimation, and Adversarial Modification

Joshua Agterberg, Youngser Park, Jonathan Larson, Christopher White,, Carey E. Priebe, and Vince Lyzinski

PDF

TL;DR

This paper develops a theoretical framework for vertex nomination, introduces an adversarial contamination model, and proposes a regularization method to improve robustness against network contamination.

Contribution

It defines Bayes optimality and consistency classes for vertex nomination, and introduces a novel adversarial contamination model with mitigation strategies.

Findings

01

VN schemes perform well in uncontaminated networks

02

Adversarial contamination degrades VN performance

03

Regularization improves robustness in contaminated networks

Abstract

Given a pair of graphs $G_{1}$ and $G_{2}$ and a vertex set of interest in $G_{1}$ , the vertex nomination (VN) problem seeks to find the corresponding vertices of interest in $G_{2}$ (if they exist) and produce a rank list of the vertices in $G_{2}$ , with the corresponding vertices of interest in $G_{2}$ concentrating, ideally, at the top of the rank list. In this paper, we define and derive the analogue of Bayes optimality for VN with multiple vertices of interest, and we define the notion of maximal consistency classes in vertex nomination. This theory forms the foundation for a novel VN adversarial contamination model, and we demonstrate with real and simulated data that there are VN schemes that perform effectively in the uncontaminated setting, and adversarial network contamination adversely impacts the performance of our VN scheme. We further define a network regularization method for…

Tables1

Table 1. Table 1: Table of frequently used notation

Notation	Description
$[k]$	The set of integers ${1, 2, 3, \dots, k}$
$G = (V, E)$	A (random) graph with vertex set $V$ and edge set $E$
$G_{1} = (V_{1}, E_{1})$ , $G_{2} = (V_{2}, E_{2})$	Two random graphs with a presumed shared set of vertices
$C$	A core set of vertices shared between two graphs
$J_{1}$ , $J_{2}$	Junk vertices not shared between graphs
$𝒢_{n}$	The set of $n$ -vertex labeled graphs
$F_{c, θ}^{(n, m)}$	A nominatable distribution on $𝒢_{n} \times 𝒢_{m}$
$F_{c, θ}^{(n, m)}$	with $c$ shared vertices and parameter $θ$
$𝒩_{n, m}$	The set of nominatable distributions on $𝒢_{n} \times 𝒢_{m}$
$g, g_{1}, g_{2}$	Observed graphs
$V^{*}$	A vertex set of interest shared between two graphs
$v^{*}$	A single vertex of interest
$𝔬$	An obfuscation function changing observed vertex labels
$𝔒_{W}$	The set of obfuscating functions mapping a vertex set to $W$
$𝒯_{W}$	The set of total orderings of the elements of a set $W$
$ℐ (u; g)$	The set of vertices in $g$ topologically equivalent to $u$
$Φ (g_{1}, 𝔬 (g_{2}), V^{*})$	A vertex nomination scheme with vertex set of interest
$Φ (g_{1}, 𝔬 (g_{2}), V^{*})$	$V^{*}$ and observed graphs $g_{1}$ and $𝔬 (g_{2})$
$𝔯_{Φ} (g_{1}, g_{2}, 𝔬, V^{*}, S)$	The set of ranks of a set $S$ under $Φ (g_{1}, 𝔬 (g_{2}), V^{*})$

Equations4

y = \frac{mean # of v.o.i. with corresp. v.o.i. ranked in top x by VN \circ GMM \circ ASE}{mean # of v.o.i. with corresp. v.o.i. ranked in top x by chance algorithm}

y = \frac{mean # of v.o.i. with corresp. v.o.i. ranked in top x by VN \circ GMM \circ ASE}{mean # of v.o.i. with corresp. v.o.i. ranked in top x by chance algorithm}

{F_{c, θ}^{(n, m)} s.t. 0 \leq c \leq min (n, m) \in Z, θ \in Θ \subset R^{d (n, m)}}

{F_{c, θ}^{(n, m)} s.t. 0 \leq c \leq min (n, m) \in Z, θ \in Θ \subset R^{d (n, m)}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

$1$$1$ affiliationtext: Department of Applied Mathematics and Statistics, Johns Hopkins University $2$$2$ affiliationtext: Center for Imaging Sciences, Johns Hopkins University $3$$3$ affiliationtext: Microsoft AI and Research, Microsoft $4$$4$ affiliationtext: Department of Mathematics, University of Maryland

Vertex Nomination, Consistent Estimation, and

Adversarial Modification

Joshua Agterberg

Youngser Park

Jonathan Larson

Christopher White

Carey E. Priebe

Vince Lyzinski

Abstract

Given a pair of graphs $G_{1}$ and $G_{2}$ and a vertex set of interest in $G_{1}$ , the vertex nomination (VN) problem seeks to find the corresponding vertices of interest in $G_{2}$ (if they exist) and produce a rank list of the vertices in $G_{2}$ , with the corresponding vertices of interest in $G_{2}$ concentrating, ideally, at the top of the rank list. In this paper, we define and derive the analogue of Bayes optimality for VN with multiple vertices of interest, and we define the notion of maximal consistency classes in vertex nomination. This theory forms the foundation for a novel VN adversarial contamination model, and we demonstrate with real and simulated data that there are VN schemes that perform effectively in the uncontaminated setting, and adversarial network contamination adversely impacts the performance of our VN scheme. We further define a network regularization method for mitigating the impact of the adversarial contamination, and we demonstrate the effectiveness of regularization in both real and synthetic data.

1 Introduction and Background

Given graphs $G_{1}$ and $G_{2}$ and vertices of interest $V^{*}\subset V(G_{1})$ , the aim of the vertex nomination (VN) problem is to rank the vertices of $G_{2}$ into a nomination list with the corresponding vertices of interest concentrating at the top of the nomination list. In recent years, a host of VN procedures have been introduced (see, for example, [coppersmith2012vertex, marchette2011vertex, LeePri2012, FisLyzPaoChePri2015, patsolic2017vertex, yoder2018vertex]) that have proven to be effective information retrieval tools in both synthetic and real data applications. Moreover, recent work establishing a fundamental statistical framework for VN has led to a novel understanding of the limitations of VN efficacy in evolving network environments [lyzinski2017consistent]. Herein, we consider a general statistical model for adversarial contamination in the context of vertex nomination—here the adversary model can both randomly add or remove edges and/or vertices in the network —and we examine the effect of both these contaminations on VN performance. In addition, we extend existing theory on consistent vertex nomination to multiple vertices of interest and define and derive Bayes Optimal Classifiers in this setting. We further show that there are infinitely many classes of distribution for which a vertex nomination scheme is not consistent.

The practical additional value of this paper is to

extend the results of [lyzinski2017consistent] to the more realistic multiple VOI setting; 2. 2.

rigorously frame the concept of an adversary in the random graph framework; 3. 3.

develop theory showing how it is possible for an adversary to render vertex nomination schemes inconsistent; 4. 4.

demonstrate empirically that although an adversary can have a negative impact, regularization can succeed in recovering consistency.

The reason we do not prove that regularization succeeds is that the regularization scheme depends on the particular graph observation and introduces complex dependence structure into the problem. Such dependence, coupled with the already difficult spectral analysis problem, makes it unclear what exactly is even being estimated when using any spectral nomination scheme with regularization. Furthermore, the regularization scheme we consider is highly model-dependent, and our main theoretical contributions apply to any vertex nomination scheme and as such are necessary to begin to understand adversarial vertex nomination.

To motivate our mathematical and statistical results further, we first consider an illustrative real data example in Section 1.1 in which we demonstrate the following: A VN scheme that works effectively with network contamination adversely impacting the performance of our VN scheme. Note that we will provide a more thorough background of the relevant literature after the motivating example in Section 1.2.

1.1 Motivating example

Consider the pair of high school friendship networks in [mastrandrea2015contact]: The first, $G_{1}$ , has $156$ nodes, each representing a student, and has two vertices adjacent if the two students made contact with each other at school in a given time period; the second, $G_{2}$ , has $134$ vertices, again with each vertex representing a student, and has two vertices adjacent if the two students are friends on Facebook. There are $82$ students appearing in both $G_{1}$ and $G_{2}$ , and we pose the VN problem here as follows: given a student-of-interest in $G_{1}$ , can we nominate the corresponding student (if they exist) in $G_{2}$ . We note here that the vertex nomination approach outlined below easily adapts to the multiple vertices of interest (v.o.i.) scenario (i.e., given students-of-interest in $G_{1}$ , can we nominate the corresponding students, if they exist, in $G_{2}$ )—and we will provide the necessary details for handling both single and multiple v.o.i. below. Recall that the VN problem assumes there is a correspondence between the vertices but that the practitioner does not have access to this correspondence. To this end, we act as though we do not know the corresponding student in each graph.

In one idealized data setting, all students would appear in both graphs as this would potentially maximize the signal present in the correspondence of labels across graphs. This bears itself out in the following illustrative VN experiment. Consider the following simple VN scheme, which we denote $\text{VN}\circ\text{GMM}\circ\text{ASE}$ : Given vertex (or vertices) of interest $v^{*}$ in $G_{1}$ and seeded vertices $S\subset V_{1}\cap V_{2}$ (seeds here represent vertices whose identity across networks is known a priori), we proceed by embedding the graphs into a common Euclidean space $\mathbb{R}^{d}$ and clustering using Mahalanobis distances between the embeddings of the vertices (see Section LABEL:sec:asegmm for full detail).

We can consider running the $\text{VN}\circ\text{GMM}\circ\text{ASE}$ in the idealized data setting where we only consider the induced subgraphs of $G_{1}$ and $G_{2}$ containing the $82$ common vertices across graphs (call these graphs $G_{1}^{(i)}$ and $G_{2}^{(i)}$ ), and we can also consider running the procedure in the setting where the $52$ vertices in $G_{2}$ without matches across graphs are added to $G_{2}^{(i)}$ as a form of contamination. These unmatchable vertices can have the effect of obfuscating the correspondence amongst the common vertices across graphs, and thus can diminish VN performance. Indeed, we see this play out in Figure 1.

In Figure 1, we plot the performance of $\text{VN}\circ\text{GMM}\circ\text{ASE}$ averaged over $nMC=500$ random seed sets of size $s=10$ . In the left figure, the $x$ -axis shows the ranks in the nomination list and the $y$ -axis shows the mean ( $\pm$ 2s.e.) number of vertices $v\in G_{1}^{(i)}$ , when viewed as the lone v.o.i., that had their corresponding vertex of interest ranked in the top $x$ by $\text{VN}\circ\text{GMM}\circ\text{ASE}$ . The right figure shows the same results normalized by chance performance, where we plot

[TABLE]

versus $x$ . The blue line represents performance in the idealized networks $G_{1}^{(i)}$ and $G_{2}^{(i)}$ , and the red line represents performance in the contaminated network pair $(G_{1}^{(i)},G_{2})$ . We see that the contamination detrimentally affects the performance of $\text{VN}\circ\text{GMM}\circ\text{ASE}$ at all levels, as for all $x$ , the number of v.o.i. in $G_{1}^{(i)}$ with their corresponding v.o.i. ranked in the top $x$ in the second graph is larger in $(G_{1}^{(i)},G_{2}^{(i)})$ versus in $(G_{1}^{(i)},G_{2})$ . Note that the chance normalization is computed separately under the core and noisy models, and the seeming performance gain relative to chance in the contaminated setting is attributable to the fact that $G_{2}$ has significantly more vertices than the idealized $G_{2}^{(i)}$ , and chance is therefore significantly worse. We emphasize here the effect of the contamination on VN performance; indeed, the adversarial contamination greatly (negatively) effects the performance of our vertex nomination scheme, suggesting that perhaps the vertex nomination scheme is not consistent for this class of contaminated distributions. In effect, the adversary is knocking the networks out of the consistency class for $\text{VN}\circ\text{GMM}\circ\text{ASE}$ ; see Section LABEL:sec:CC for detail. While the results of Section LABEL:sec:verify show that we cannot verify (in an unsupervised manner, without the true labels) the extent to which the contamination negatively impacts the performance of VN, in Section LABEL:sec:regreg, we empirically explore the impact of regularization strategies for mitigating this contamination.

*Remark 1** (The role of seeds).*

Figure 1 shows performance of $\text{VN}\circ\text{GMM}\circ\text{ASE}$ averaged over $500$ randomly chosen seed sets of size $10$ . While performance, on the whole, increases with proper regularization, the story can vary wildly from seed set to seed set. While a full exploration of this is beyond the scope of the present text, this is an active area of our work.

1.2 Background

In modern statistics and machine learning, graphs are a common way to take into account the complex relationships between data objects, and graphs have been used in applications across the biological (see, for example, [neu1, neu2, neu3, bio1, bio2, bio3]) and social sciences (see, for example, [socnet1, socnet2, resp1, resp2]). In addition to more traditional statistical inference tasks such as clustering [rohe2011spectral, qin2013dcsbm, networks08:_v, newman2006modularity], classification [vogelstein2013graph, chen2016robust, neu3], and estimation [bickel2013asymptotic, BicChe2009, sussman2014consistent], there has been significant work in more network-specific inference tasks such as graph matching [ConteReview, foggia2014graph, yan2016short], and vertex nomination [marchette2011vertex, coppersmith2014vertex, FisLyzPaoChePri2015].

Recall that the vertex nomination problem can be stated loosely as follows: given graphs $G_{1}$ and $G_{2}$ and vertices of interest $V^{*}\subset V(G_{1})$ , rank the vertices of $G_{2}$ into a nomination list with the corresponding vertices of interest concentrating at the top of the nomination list (see Definition LABEL:def:VN for full detail). While vertex nomination has found applications in a number of different areas, such as social networks in [patsolic2017vertex] and data associated with human trafficking in [FisLyzPaoChePri2015], there are relatively few results establishing the statistical properties of vertex nomination. In [FisLyzPaoChePri2015], consistency is developed within the stochastic blockmodel random graph framework, where interesting vertices were defined via community membership. In [lyzinski2017consistent], the authors develop the concepts of consistency and Bayes optimality for a very general class of random graph models and a very general definition of what makes the v.o.i. interesting. In this paper, we further develop the ideas in [lyzinski2017consistent], with the aim of developing a theoretical regime in which to ground the notion of adversarial contamination in VN. In addition, their results are derived in the setting of a single vertex of interest; since many real application problems involve finding similar groups of nodes, we extend their results to multiple vertices of interest.

There has been significant recent attention towards better understanding the impact of adversarial attacks on machine learning methodologies (see, for example, [huang2011adversarial, cai2015robust, papernot2016limitations, adv1, adv2]). Herein, we define an adversarial attack on a machine learning algorithm to be a mechanism that changes the data distribution in order to negatively affect algorithmic performance; see Definition LABEL:def:Adv. From a practical standpoint, adversarial attacks model the very real problem of having data compromised; if an intelligent agent has access to the data and algorithm, the agent may want to modify the data or the algorithm to give the wrong prediction/inferential conclusion. Although there has been much work on adversarial modeling in machine learning, there has been less theory developed for adversarial attacks from a statistical perspective.

The adversarial framework we consider is similar to the model considered in [cai2015robust], and it is motivated by the example in the previous section in which the addition of the vertices without correspondences to $G_{2}$ negatively impacted VN performance. Suppose that we are interested in performing vertex nomination on a graph pair, but an adversary randomly adds and deletes some edges and/or vertices in the second graph. For example, suppose we are trying to find influencers on Instagram by vertex matching to Facebook. An influencer that has knowledge of our procedure may attempt to make our algorithm fail in its nominations, perhaps by friending and de-friending people on Facebook. Even if our vertex nomination scheme was working well prior to encountering the adversary, it may not be after modification by the adversary.

From a statistical standpoint, what can we say about the statistical consistency of our original vertex nomination rule? Our motivating example suggests that there are adversaries that can render our vertex nomination scheme no longer consistent, but theory is needed both to explain why that may be the case and to properly frame the problem. Hence, to answer these questions, we further develop the theory in [lyzinski2017consistent] to situate the notion of adversarial contamination within the idea of maximal consistency classes for a given VN rule (Section LABEL:sec:CC). In this framework, the goal of an adversary is to move a model out of a rule’s consistency class. We demonstrate with real and synthetic data examples how an adversary is able to move a model out of a rule’s consistency class. We finish with a brief discussion on how regularization can effectively recover consistency, though we leave this for future work.

Notation: See Table 1 for frequently used notation.

2 Vertex Nomination and Consistency

Before discussing how to define adversarial attacks, we discuss the previous work of [lyzinski2017consistent], the first of its kind to derive the Bayes Optimal vertex nomination scheme for one vertex. This work can be viewed as a follow-on of that work, in which we provide a groundwork for the rigorous framing of an adversary in vertex nomination.

First, we will situate our analysis of the VN problem in the very general framework of nominatable distributions.

Definition 2 (Nominatable Distribution).

For a given $n,m\in\mathbb{Z}>0$ , the set of Nominatable Distributions of order $(n,m)$ , denoted $\mathcal{N}_{n,m}$ , is the collection of all families of distributions $\mathbf{F}^{(n,m)}_{\Theta}$ of the following form

[TABLE]

where $F^{(n,m)}_{c,\theta}$ is a distribution on $\mathcal{G}_{n}\times\mathcal{G}_{m}$ parameterized by $\theta\in\Theta$ satisfying:

The vertex sets $V_{1}=\{v_{1},v_{2},...,v_{n}\}$ and $V_{2}=\{u_{1},u_{2},...,u_{m}\}$ satisfy $v_{i}=u_{i}$ for $0<i\leq c$ . We refer to $C=\{v_{1},v_{2},...,v_{c}\}=\{u_{1},u_{2},...,u_{c}\}$ as the core vertices. These are the vertices that are shared across the two graphs and imbue the model with a natural notion of corresponding vertices. 2. 2.

Vertices in $J_{1}=V_{1}\setminus C$ and $J_{2}=V_{2}\setminus C$ , satisfy $J_{1}\cap J_{2}=\emptyset$ . We refer to $J_{1}$ and $J_{2}$ as junk vertices. These are the vertices in each graph that have no corresponding vertex in the other graph 3. 3.

The induced subgraphs $G_{1}[J_{1}]$ and $G_{2}[J_{2}]$ are conditionally independent given $\theta$ .

The vertices in $C$ are those that have a corresponding paired vertex in each graph; where corresponding can be defined very generally. Corresponding vertices need not correspond to the same person/user/account, rather corresponding vertices are understood as those that share a desired property (for example, a role in the network) across graphs. In particular, we will assume that the vertices of interest in $G_{1}$ have corresponding vertices in $G_{2}$ , and that these corresponding vertices are the vertices of interest in $G_{2}$ .

Having access to the vertex labels would then render the VN problem trivial. To model the uncertainty often present in data applications, where the vertex labels (or correspondences) are unknown a priori we adopt the notion of obfuscation functions from [lyzinski2017consistent].

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Vertex Nomination, Consistent Estimation, and

Abstract

1 Introduction and Background

1.1 Motivating example

Remark 1* (The role of seeds).*

1.2 Background

2 Vertex Nomination and Consistency

Definition 2** (Nominatable Distribution).**

*Remark 1** (The role of seeds).*

Definition 2 (Nominatable Distribution).