Towards Optimal Strategy for Adaptive Probing in Incomplete Networks

Tri P. Nguyen; Hung T. Nguyen; Thang N. Dinh

arXiv:1702.01452·cs.SI·February 7, 2017

Towards Optimal Strategy for Adaptive Probing in Incomplete Networks

Tri P. Nguyen, Hung T. Nguyen, Thang N. Dinh

PDF

Open Access

TL;DR

This paper studies an adaptive network probing problem where an agent aims to maximize explored nodes with limited probes, establishing theoretical hardness and proposing learning-based strategies that outperform heuristics.

Contribution

It introduces a novel formulation of the adaptive probing problem, proves its strong inapproximability, and develops learning frameworks that effectively learn strategies for different networks.

Findings

01

Proves no finite approximation ratio algorithm exists unless P=NP.

02

Designs learning frameworks that outperform existing heuristics.

03

Demonstrates the effectiveness of learned strategies through extensive experiments.

Abstract

We investigate a graph probing problem in which an agent has only an incomplete view $G^{'} ⊊ G$ of the network and wishes to explore the network with least effort. In each step, the agent selects a node $u$ in $G^{'}$ to probe. After probing $u$ , the agent gains the information about $u$ and its neighbors. All the neighbors of $u$ become \emph{observed} and are \emph{probable} in the subsequent steps (if they have not been probed). What is the best probing strategy to maximize the number of nodes explored in $k$ probes? This problem serves as a fundamental component for other decision-making problems in incomplete networks such as information harvesting in social networks, network crawling, network security, and viral marketing with incomplete information. While there are a few methods proposed for the problem, none can perform consistently well across different network types. In…

Tables4

Table 1. Table 1: Highest performance (Perf.) metric-based methods.

GnuTella		Collaboration		Road
Top 5	Perf.	Top 5	Perf.	Top 5	Perf.
CLC	2471	BC	2937	CLC	358
BC	2341	PR	2108	DEG	346
CC	1999	CC	2085	BC	346
PR	1994	CLC	2061	PR	342
DEG	1958	DEG	2048	CC	326

Table 2. Table 2: Set of features for learning

Factor	Description
$B C$	Betweenness centrality score [26] of $u$ in $G^{'}$
$C C$	Closeness centrality score [26] of $u$ in $G^{'}$
$E I G$	Eigenvalue centrality score [26] of $u$ in $G^{'}$
$P R$	Pagerank centrality score [26] of $u$ in $G^{'}$
$K a t z$	Katz centrality score [26] of $u$ in $G^{'}$
$C L C$	Clustering coefficient score of $u$ in $G^{'}$
$D E G$	Degree of $u$ in $G^{'}$
$B N u m$	Number of black nodes in $G^{'}$
$G N u m$	Number of gray nodes in $G^{'}$
$B D e g$	Total degree of black nodes in $G^{'}$
$B E d g$	Number of edges between black nodes in $G^{'}$

Table 3. Table 3: Statistics for the networks used in our experiments. ACC stands for Average Clustering Coefficient. Bold and underlined networks are used for training.

Name	Network Type	#Node	#Edges	ACC
Roadnw-CA	Road	$21 k$	$21 k$	${7.10}^{- 5}$
Roadnw-OL	Road	$6 k$	$7 k$	$0.01$
Roadnw-TG	Road	$18 k$	$23 k$	$0.018$
GnuTella04	p2p (GnuTella)	$11 k$	$40 k$	$0.006$
GnuTella05	p2p	$9 k$	$32 k$	$0.007$
GnuTella09	p2p	$8 k$	$26 k$	$0.009$
Ca-GrQc	Collaboration (CA)	$5 k$	$14 k$	$0.529$
Ca-HepPh	Collaboration	$12 k$	$118 k$	$0.611$
Ca-HepTh	Collaboration	$10 k$	$26 k$	$0.471$
Ca-CondMat	Collaboration	$23 k$	$93 k$	$0.633$
Ca-AstroPh	Collaboration	$18 k$	$198 k$	$0.630$

Table 4. Table 4: Number of new explored nodes at budget k = 300 𝑘 300 k=300 from all implemented probing methods in all datasets.

	GnuTella Net.	CA Net.	Road Net.
CLC	2471	3098	358
BC	2341	5052	346
DEG	1958	3471	346
CC	1999	3547	326
PR	1994	3639	342
RAND	2381	2911	329
MaxOutProbe	1820	902	28

Equations34

δ_{S^{i - 1}} (v_{ma x}^{i}) \geq δ_{S^{i - 1}} (v^{*})

δ_{S^{i - 1}} (v_{ma x}^{i}) \geq δ_{S^{i - 1}} (v^{*})

∣ P_{S^{i - 1}} (v_{ma x}^{i}) ∣ \cdot δ_{S_{i - 1}} (v_{ma x}^{i}) \geq j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ δ_{S^{i - 1}} (v_{j}^{*})

∣ P_{S^{i - 1}} (v_{ma x}^{i}) ∣ \cdot δ_{S_{i - 1}} (v_{ma x}^{i}) \geq j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ δ_{S^{i - 1}} (v_{j}^{*})

Δ_{S^{i - 1} (v_{ma x}^{i})} \geq j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ δ_{S^{i - 1}} (v_{j}^{*})

Δ_{S^{i - 1} (v_{ma x}^{i})} \geq j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ δ_{S^{i - 1}} (v_{j}^{*})

i = 1 \sum t Δ_{S^{i - 1} (v_{ma x}^{i})} \geq i = 1 \sum t j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ δ_{S^{i - 1}} (v_{j}^{*})

i = 1 \sum t Δ_{S^{i - 1} (v_{ma x}^{i})} \geq i = 1 \sum t j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ δ_{S^{i - 1}} (v_{j}^{*})

δ_{S^{i - 1}} (v_{j}^{*}) \geq δ_{\hat{S}} (v_{j}^{*}) \geq \frac{Δ _{\hat{S}} ( v _{j}^{*} )}{∣ P _{\hat{S}} ( v _{j}^{*} ) ∣} \geq \frac{Δ _{\hat{S}} ( v _{j}^{*} )}{r}

δ_{S^{i - 1}} (v_{j}^{*}) \geq δ_{\hat{S}} (v_{j}^{*}) \geq \frac{Δ _{\hat{S}} ( v _{j}^{*} )}{∣ P _{\hat{S}} ( v _{j}^{*} ) ∣} \geq \frac{Δ _{\hat{S}} ( v _{j}^{*} )}{r}

i = 1 \sum t j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ δ_{S^{i - 1}} (v_{j}^{*}) \geq i = 1 \sum t j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ \frac{Δ _{\hat{S}} ( v _{j}^{*} )}{r}

i = 1 \sum t j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ δ_{S^{i - 1}} (v_{j}^{*}) \geq i = 1 \sum t j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ \frac{Δ _{\hat{S}} ( v _{j}^{*} )}{r}

i = 1 \sum t j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ Δ_{\hat{S}} (v_{j}^{*}) = j = 1 \sum k Δ_{\hat{S}} (v_{j}^{*})

i = 1 \sum t j = ∣ S^{i - 1} ∣ \sum ∣ S^{i} ∣ Δ_{\hat{S}} (v_{j}^{*}) = j = 1 \sum k Δ_{\hat{S}} (v_{j}^{*})

= O P T - f (\hat{S})

f (\hat{S}) \geq \frac{O P T - f ( S ^ )}{r}

f (\hat{S}) \geq \frac{O P T - f ( S ^ )}{r}

f (\hat{S}) \geq \frac{O P T}{r + 1}

f (\hat{S}) \geq \frac{O P T}{r + 1}

\displaystyle y_{u}=\left\{\begin{array}[]{ll}1&\text{ if node }u\text{ is observed,}\\ 0&\text{ otherwise}.\end{array}\right.

\displaystyle y_{u}=\left\{\begin{array}[]{ll}1&\text{ if node }u\text{ is observed,}\\ 0&\text{ otherwise}.\end{array}\right.

\displaystyle x_{uj}=\left\{\begin{array}[]{ll}1&\text{ if node }u\text{ is selected at layer }j\text{ or earlier},\\ 0&\text{ otherwise}.\end{array}\right.

\displaystyle x_{uj}=\left\{\begin{array}[]{ll}1&\text{ if node }u\text{ is selected at layer }j\text{ or earlier},\\ 0&\text{ otherwise}.\end{array}\right.

max

max

x_{r 0} = 1, x_{u 0} = 0, u \in V, u \neq = r

x_{r 0} = 1, x_{u 0} = 0, u \in V, u \neq = r

u \in V \sum x_{u k} \leq k

y_{u} \leq v \in N (u) \sum x_{v k}, \forall u \in V,

x_{u j} \leq x_{u (j + 1)}, \forall u \in V, j = 0.. k - 1,

x_{u j} \leq x_{u (j - 1)} + v \in N (u) \sum x_{v (j - 1)}, \forall u \in V, j = 1.. k,

x_{u j}, y_{u} \in {0, 1}, \forall u \in V, j = 0.. k .

\displaystyle l=\left\{\begin{array}[]{ll}1&\mbox{if $w_{u}^{o}\geq w_{v}^{o}$};\\ 0&\mbox{if $w_{u}^{o}<w_{v}^{o}$}.\end{array}\right.

\displaystyle l=\left\{\begin{array}[]{ll}1&\mbox{if $w_{u}^{o}\geq w_{v}^{o}$};\\ 0&\mbox{if $w_{u}^{o}<w_{v}^{o}$}.\end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Caching and Content Delivery · Peer-to-Peer Network Technologies

Full text

Towards Optimal Strategy for Adaptive Probing in Incomplete Networks

Tri P. Nguyen, Hung T. Nguyen and Thang N. Dinh Computer Science Department, Virginia Commonwealth University. Email: {trinpm,hungnt,tndinh}@vcu.edu

Abstract

We investigate a graph probing problem in which an agent has only an incomplete view $G^{\prime}\subsetneq G$ of the network and wishes to explore the network with least effort. In each step, the agent selects a node $u$ in $G^{\prime}$ to probe. After probing $u$ , the agent gains the information about $u$ and its neighbors. All the neighbors of $u$ become observed and are probable in the subsequent steps (if they have not been probed). What is the best probing strategy to maximize the number of nodes explored in $k$ probes? This problem serves as a fundamental component for other decision-making problems in incomplete networks such as information harvesting in social networks, network crawling, network security, and viral marketing with incomplete information.

While there are a few methods proposed for the problem, none can perform consistently well across different network types. In this paper, we establish a strong (in)approximability for the problem, proving that no algorithm can guarantees finite approximation ratio unless P=NP. On the bright side, we design learning frameworks to capture the best probing strategies for individual network. Our extensive experiments suggest that our framework can learn efficient probing strategies that consistently outperform previous heuristics and metric-based approaches.

1 Introduction

In many real-world networks, complete network topology is almost intractable to acquire, thus, most decisions are made based on incomplete networks. The impossibility to obtain complete network may stem from various sources: 1) the extreme size of the networks, e.g., Facebook, Twitter with billions of users and connections or the Internet spanning the whole planet, 2) privacy or security concerns, e.g., in Online social networks, we may not be able to see users’ connections due to their own privacy settings to protect them from unwelcoming guests, 3) being hidden or undercover, e.g., terrorist networks in which only a small fraction is exposed and the rest is anonymous.

To support decision making processes based on local view and expand our observations of the networks, we investigate a network exploring problem, called Graph Probing Maximization (GPM). In GPM, an agent is provided with an incomplete network $G^{\prime}$ which is a subnetwork of an underlying real-world network $G\supsetneq G^{\prime}$ and wants to explore $G$ swiftly through node probing. Once a node $u\in G^{\prime}$ is probed, all neighbors $v\in G$ of $u$ will be observed and can be probed in the following steps. Given a budget $k$ , the agent wishes to identify $k$ nodes from $G^{\prime}$ to probe to maximize the number of newly observed nodes.

Real-world applications of GPM includes exploring terrorist networks to help in the planning of dismantling the network. Given an incomplete adversarial network, each suspect node can be “probed”, e.g., getting court orders for communication record, to reveal further suspects. In cybersecurity on Online social networks (OSNs), intelligent attackers can gather users’ information by sending friend requests to users [1]. Understanding the attacker strategies is critical in coming up with effective hardening policies. Another example is in viral marketing, from a partial observation of the network, a good probing strategy for new customers can lead to exploration of potential product sales.

While several heuristics are proposed for GPM [2, 3, 4, 5], they share two main drawbacks. First, they all consider selecting nodes in one batch. We argue that this strategy is ineffective as the information gained from probing nodes is not used in making the selection as shown recently [6, 7]. Secondly, they are metric-based methods which use a single measure to make decisions. However, real-world networks have diverse characteristics, such as different power-law degree distributions and a wide range of clustering coefficients. Thus, the proposed heuristics may be effective for particular classes of networks, but perform poorly for the others.

In this paper, we first formulate the Graph Probing Maximization and theoretically prove the strong inapproximability result that finding the optimal strategy based on local incomplete network cannot be approximated within any finite factor. That means no polynomial time algorithm to approximate the optimal strategy within any finite multiplicative error. On the bright side, we design a novel machine learning framework that takes the local information, e.g., node centric indicators, as input and predict the best sequence of $k$ node probing to maximize the observed augmented network. We take into account a common scenario that there is available a reference network, e.g., past exposed terrorist networks when investigating an emerging one, with similar characteristics. Our framework learns a probing strategy by simulating many subnetwork scenarios from reference graph and learning the best strategy from those simulated samples.

The most difficulty in our machine learning framework is that of find the best probing strategy in sampled subnetwork scenarios from the reference network. That is how to characterize the potential gain of a probable node in long-term (into future probing). We term this subproblem Topology-aware GPM (Tada-GPM) since both subnetwork scenario and underlying reference network are available. Therefore, we propose an $(\frac{1}{r+1})$ -approximation algorithm for this problem where $r$ is the radius of the optimal solution. Here, the radius of a solution is defined to be the largest distance from a selected node to the subnetwork. Our algorithm looks far away to the future gain of selecting a node and thus provides a nontrivial approximation guarantee. We further propose an effective heuristic improvement and study the optimal strategy by Integer Linear Programming.

Compared with metric-based methods with inconsistent performance, our learning framework can easily adapt to networks in different traits. As a result, our experiments on real-world networks in diverse disciplines confirm the superiority of our methods through consistently outperforming around 20% the state-of-the-art methods.

Our contributions can be summarized as follows:

•

We first formulate the Graph Probing Maximization (GPM) and show that none of existing metric-based methods consistently work well on different networks. Indeed, we rigorously prove a strong hardness result of GPM problem: inapproximable within any finite factor.

•

We propose a novel machine learning framework which looks into the future potential benefit of probing a node to make the best decision.

•

We experimentally show that our new approach significantly and consistently improves the performance of the probing task compared to the current state-of-the-art methods on a large set of diverse real-world networks.

Related works. Our work has a connection to the early network crawling literature [8, 9, 10, 11]. The website crawlers collect the public data and aim at finding the least effort strategy to gain as much information as possible. The common method to gather relevant and usable information is following the hyperlinks and expanding the domain.

Later, with the creation and explosive growth of OSNs, the attention was largely shifted to harvesting public information on these networks [12, 13, 14]. Chau et al. [12] were able to crawl approximately 11 million auction users on Ebay by a parallel crawler. The record of successful crawling belongs to Kwak et al. [14] who gathered 41.7 million public user profiles, 1.47 billion social relations and 106 million tweets. However, these crawlers are limited to user public information due to the privacy setting feature on OSNs that protects private/updated data from unwelcoming users.

More recently, the new crawling technique of using socialbots [15, 16, 17, 18, 19, 1] to friend the users and be able to see their private information. Boshmaf et al. [15] proposed to build a large scale Socialbots system to infiltrate Facebook. The outcomes of their work are many-fold: they were able to infiltrate Facebook with success rate of 80%; there is a possibility of privacy breaches and the security defense systems are not effective enough to prevent/stop Socialbots. The works in [16, 17] focus on targeting a specific organization on Online social networks by using Socialbots to friend the organization’s employees. As the results, they succeed in discovering hidden nodes/edges and achieving an acceptance rate of 50% to 70%.

Graph sampling and its applications have been widely studied in literature. For instance, Kim and Leskovec [20] study problem of inferring the unobserved parts of the network. They address network completion problem: Given a network with missing nodes and edges, how can ones complete the missing part? Maiya and Berger-Wolf [21] propose sampling method that can effectively be used to infer and approximate community affiliation in the large network.

Most similar to our work, Soundarajan et al. propose MaxOutProbe [3] probing method. MaxOutProbe estimates degrees of nodes to decide which nodes should be probed in partially observed network. This model shows better performance as compared with probing approaches based on node centralities (selecting nodes that have high degree or low local clustering). However, through experiments, we observe that MaxOutProbe’s performance is still worst than the one that ranks nodes based on PageRank or Betweeness.

The probing process can be seen as a diffusion of deception in the network. Thus, it is related to the vast literature in cascading processes in the network [22, 23, 24, 25].

Organization: The rest of this paper is divided into five main sections. Section 2 presents the studied problem and the hardness results. We propose our approximation algorithm and machine learning model in Section 3. Our comprehensive experiments are presented in Section 4 and followed by conclusion in Section 5.

2 Problem Definitions and Hardness

We abstract the underlying network using an undirected graph $G=(V,E)$ where $V$ and $E$ are the sets of nodes and edges. $G$ is not completely observed, instead a subgraph $G^{\prime}=(V^{\prime},E^{\prime})$ of $G$ is seen with $V^{\prime}\subseteq V$ , $E^{\prime}\subseteq E$ . Nodes in $G$ can be divided into three disjoint sets: $V^{f},V^{p}$ and $V^{u}$ as illustrated in Fig. 1.

•

Black/Fully observed nodes: $V^{f}$ contains fully observed nodes (probed) meaning all of their connections are revealed. That is if $u\in V^{f}$ and $(u,v)\in E$ then $(u,v)\in E^{\prime}$ .

•

Gray/Partially observed nodes: $V^{p}$ contains partially observed nodes $u$ that are adjacent to at least one fully probed node in $V^{f}$ and $u\notin V^{f}$ . Only the connections between $u\in V^{p}$ and nodes in $V^{f}$ are observed while the others to unobserved nodes are hidden. Therefore, those nodes $u\in V^{p}$ become the only candidates for discovering unobserved nodes. Note that $V^{\prime}=V^{f}\cup V^{p}$ .

•

White/Unobserved nodes: $V^{u}=V\setminus V^{\prime}$ consists of unobserved nodes. The nodes in $V^{u}$ have no connection to any node in $V^{f}$ but may be connected with nodes in $V^{p}$ .

Node probing: At each step, we select a candidate gray node $u\in V^{p}$ to probe. Once probed, all the neighbors of $u$ in $G$ and the corresponding edges are revealed. That is $u$ becomes a fully observed black node and each white neighbor node $v\in V^{u}$ of $u$ becomes gray and is also available to probe in the subsequent steps. We call the resulted graph after probing augmented graph and use the same notation $G^{\prime}$ when the context is clear. The main goal of ada-GPM is to increase the size of $G^{\prime}$ as much as possible.

Probing budget $k$ : In addition to the subgraph $G^{\prime}$ , a budget $k$ is given as the number of nodes we can probe. This budget $k$ may resemble the effort/cost that can be spent on probing. Given this budget $k$ , our probing problem becomes selecting at most connected $k$ nodes that maximizes the size of augmented $G^{\prime}$ . Alternatively, we want to maximize the number of newly observed nodes.

We call our problem Graph Probing Maximization (GPM). There is a crucial consideration at this point: Should we select $k$ node at once or we should distribute the allowed budget in multiple steps? The answer to this question leads to two different interesting versions: Non-Adaptive and Adaptive. We focus on the adaptive problem.

Definition 2.1 (Adaptive GPM (ada-GPM))

Given an incomplete subnetwork $G^{\prime}$ of $G$ and a budget $k$ , the Adaptive Graph Probing Maximization problem asks for $k$ partially probed nodes in $k$ consecutive steps that maximizes the number of newly observed nodes in $G^{\prime}$ at the end.

In our ada-GPM problem, at each step, a node is selected subject to observing the outcomes of all previous probing. This is in contrast to the non-adaptive version which asks to make $k$ selections from the initial $V^{p}$ at once. Thus, the adaptive probing manner is intuitively more effective than the non-adaptive counterparts. However, it is also considerably more challenging compared to non-adaptivity due to the vastly expanded search space.

2.1 Hardness and Inapproximability.

2.1.1 Empirical Observations.

We show the inconsistency in terms of probing performance of metric-based methods, i.e., clustering coefficient (CLC), betweenness centrality (BC), closeness centrality (CC), Pagerank (PR), local degree (DEG), MaxOutProbe [3] and random (RAND), through experiments on 3 real-world networks, i.e., Gnutella, Co-authorship and Road networks (see Sec. 4 for a detailed description). Our results are shown in Fig. 1. From the figure, we see that the performance of metric-based methods varies significantly on different networks and none of them is consistently better than the others. For example, clustering coefficient-based method exhibits best results on Gnutella but very bad in Collaboration networks. On road networks, all methods seem to be comparable in performance.

2.1.2 Inapproximability Result.

Here, we provide the hardness results of the ada-GPM problem. Our stronger result of inapproximability is shown in the following.

Theorem 1

ada-GPM problem on a partially observed network cannot be approximated within any finite factor. Here, the inapproximability is with respect to the optimal solution obtained when the underlying network is known.

To see this, we construct classes of instances of the problems that there is no approximation algorithm with finite factor.

We construct a class of instances of the probing problems as illustrated in Figure 7a. Each instance in this class has: a single fully probed node in black $b_{1}$ , $n$ observed nodes in gray each of which has an edge from $b_{1}$ and one of the observed nodes, namely $g^{*}$ varying between different instances, having $m$ connections to $m$ unknown nodes in white. Thus, the partially observed graph contains $n+1$ nodes, one fully probed and $n$ observed nodes which are selectable, while the underlying graph has in total of $n+m+1$ nodes. Each instance of the family has a different $g^{*}$ that $m$ unknown nodes are connected to. We now prove that in this class, no algorithm can give a finite approximate solution for the two problem.

First, we observe that for any $k\geq 1$ , the optimal solution which probes the nodes with connections to unknown nodes has the optimal value of $m$ newly explore nodes, denoted by $OPT=m$ . Since any algorithm will not be aware of the number of connections that each gray node has, it does not know that $g^{*}$ leads to $m$ unobserved nodes. Thus, the chance that an algorithm selects $g^{*}$ is small and thus, can perform arbitrarily bad. Our complete proof is presented in the supplementary material.

3 Learning the Best Probing Strategy

Due to the hardness of ada-GPM that no polynomial time algorithm with any finite approximation factor can be devised unless $P=NP$ , an efficient algorithm that provides good guarantee on the solution quality in general case is unlikely to be achievable. This section proposes our machine learning based framework to tackle the ada-GPM problem. Our approach considers the cases that in addition to the probed network, there is a reference network with similar characteristics and can be used to derive a good strategy.

Designing such a machine learning framework is challenging due to three main reasons: 1) what should be selected as learning samples, e.g., (incomplete) subgraphs or (gray) nodes and how to generate them? (Subsec. 3.1); 2) what features from incomplete subnetwork are useful for learning? (Subsec. 3.2) and 3) how to assign labels to learning samples to indicate the benefit of selecting that node in long-term, i.e., to account for future probes?(Subsec. 3.2).

Overview. The general framework, which is depicted in Figure 3, contains four steps: 1) Graph sampling which generates many subnetworks from $G^{r}$ where each subnetwork is a sampled incomplete network with black, gray and white nodes; each candidate gray node in each sampled subnetwork creates a data point in our training data; 2) Data labeling which labels each gray node with their long-term probing benefit; 3) Training a model to learn the probing benefit of nodes from the features and 4) Probing the targeted network guided by the trained machine learning model.

3.1 Building Training Dataset.

Given the reference network $G^{r}=(V^{r},E^{r})$ , where $V^{r}$ is the set of $n$ nodes and $E^{r}$ is set of $m$ edges. Let $\mathcal{G}=\{G^{\prime}_{1},G^{\prime}_{2},...G^{\prime}_{K}\}$ be a collection of subnetwork sampled from $G^{r}$ . The size of sampled subgraph $G^{\prime}_{i}$ is randomly drawn between $0.5\%$ to $10\%$ of the reference graph $G^{r}$ following a power-law distribution. Given a subnetwork size, the sample can be generated using different mechanisms, e.g., Breadth-First-Search, Depth-First-Search or Random Walk [21]. We use $\mathcal{G}$ to construct a training data where each data point is a feature vector representing a candidate gray node. For each sample $G^{\prime}_{i},(1\leq i\leq K)$ , in $\mathcal{G}$ , let $V_{G^{\prime}_{i}}^{p}$ be the set of gray nodes in $G^{\prime}_{i}$ , we compute all the features for each node $u\in V_{G^{\prime}_{i}}^{p}$ to form a data point. As such, each sample $G^{\prime}_{i}$ creates $|V_{G^{\prime}_{i}}^{p}|$ training data points. For assigning the label for each data point, we use our proposed algorithm Tada-Probe, the heuristic improvement or the ILP algorithm (presented in Subsec. 3.3).

3.2 Features for Learning.

We select a rich set of intrinsic node features that only depend on the incomplete subnetwork and will be embedded in our learning model. Table 2 shows a complete list of node features that we use in our machine learning model.

3.3 An $\frac{1}{r+1}$ -Approximation Algorithm for Tada-GPM.

We first propose an $\frac{1}{r+1}$ -optimal strategy for Tada-GPM to probe a sampled subnetwork of the reference network, where $r$ , called radius, is the largest distance from a node in the optimal solution to the initially observed network. This algorithm assigns labels for the training data.

An intuitive strategy, called Naive Greedy, is to select node with highest number of connections to unseen nodes. Unfortunately, this strategy can be shown to perform arbitrarily bad by a simple example. The example includes a fully probed node having a connection to a degree-2 node, which is a bridge to a huge component which is not reachable from the other nodes, and many other connections to higher degree nodes. Thus, the Naive Greedy will not select the degree-2 node and never reach the huge component.

Our algorithm is inspired by a recent theoretical result for solving the Connected Maximum Coverage (Connected Max-Cover) problem in [27]. The Connected Max-Cover assumes a set of elements $\mathcal{I}$ , a graph $G=(V,E)$ and a budget $k$ . Each node $v\in V$ associates with a subset $P_{v}\subseteq\mathcal{I}$ . The problem asks for a connected subgraph $g$ with at most $k$ nodes of $G$ that maximizes $\cup_{v\in g}P_{v}$ . The proposed algorithm in [27] sequentially selects nodes with highest ratio of newly covered nodes to the length of the shortest path from observed nodes and is proved to obtain an $\frac{e-1}{(2e-1)r}$ -approximation factor.

Each node in a network can be viewed as associated with a set of connected nodes. We desire to select $k$ connected nodes to maximize the union of $k$ associated sets. However, different from Connected Max-Cover in which any connected subgraph is a feasible solution, Tada-GPM requires the $k$ selected nodes to be connected from the fully observed nodes $V^{f}$ . Thus, we analogously put another constraint of connectivity to a fixed set of nodes on the returned solution and that adds a layer of difficulty. Interestingly, we show that rooting from observed nodes and greedily adding nodes in the same manner as [27] gives an $\frac{1}{r+1}$ -approximate solution. Additionally, our analysis leads to a better approximation result for Connected Max-Cover since $\frac{e-1}{(2e-1)r}<\frac{1}{2.58\cdot r}<\frac{1}{r+1}$ .

Let denote $O(v)$ to be the set of nodes that $v$ is connected to, i.e., $O(v)=\{u|(v,u)\in E\}$ and $P_{V^{f}}(v)$ be the set of nodes on the shortest path from nodes in $V^{f}$ to $v$ . For a set of nodes $S$ , we call $f(S)$ the number of newly discovered nodes by probing $S$ . Hence, $f(S)$ is our objective function. Our approximation algorithm, namely Topology-aware Adaptive Probing (Tada-Probe), is described in Alg. 1.

The algorithm starts by collapsing all fully observed nodes in $G^{\prime}$ into a single root node $R$ which serves as the starting point. It iterates till all $k$ alloted budget have been exhausted into selecting nodes (Line 3). In each iteration, it selects a node $v_{max}\in V\backslash V^{f}$ within distance $k-i$ having maximum ratio of the number of unobserved nodes $|O(v)\backslash V^{\prime}|$ to the length of shortest path from nodes in $V^{f}$ to $v$ (Line 4). Afterwards, all the nodes on the shortest path from $V^{f}$ to $v_{max}$ are probed (Line 5) and the incomplete graph is updated accordingly (Line 6).

The approximation guarantee of Tada-Probe is stated in the following theorem.

Theorem 2

Tada-Probe returns an $\frac{1}{r+1}$ -approximate probing strategy for Tada-GPM problem where $r$ is the radius of the optimal solution.

Proof.

Let denote the solution returned by Tada-Probe $\hat{S}$ and an optimal solution $S^{*}=\{v^{*}_{1},v^{*}_{2},\dots,v^{*}_{k}\}$ which results in the maximum number of newly discovered nodes, denoted by $OPT$ . We assume that both $\hat{S}$ and $S^{*}$ contain exactly $k$ nodes since adding more nodes never gives worse solutions. We call the number of additional unobserved nodes discovered by $S^{\prime}$ in addition to that of $S$ , denoted by $\Delta_{S}(S^{\prime})$ , the marginal benefit of $S^{\prime}$ with respect to $S$ . For a single node $v$ , $\Delta_{S}(v)=\Delta_{S}(\{v\})$ . In addition, the ratio of the marginal benefit to the distance from the set $S$ to a node $v$ , called benefit ratio, is denoted by $\delta_{S}(v)=\frac{\Delta_{S}(v)}{|P_{S}(v)|}$ .

Since in each iteration of Tada-Probe, we add all the nodes along the shortest path connecting $V^{f}$ to $v_{max}$ , we assume that $t\leq k$ iterations are performed. In iteration $i\geq 0$ , node $v^{i}_{max}$ is selected to probe and, up to that iteration, $S^{i}$ nodes have been selected so far.

Due to the greedy selection, we have, $\forall i\geq 1,\forall\hat{v}\in S^{i}\backslash S^{i-1},\forall v^{*}\in S^{*}$ ,

[TABLE]

Thus, we obtain,

[TABLE]

or, equivalently,

[TABLE]

Adding Eq. 3.3 over all iterations gives,

[TABLE]

The left hand side is actually $f(\hat{S})$ which is sum of marginal benefit over all iterations. The other is the sum of benefit ratios over all the nodes in the optimal solution $S^{*}$ with respect to sets $S^{i-1}$ , where $0\leq i\leq k-1$ , which are subsets of $\hat{S}$ . Thus, $\forall i,j$ ,

[TABLE]

Then, the right hand side is,

[TABLE]

Notice that $\Delta_{\hat{S}}(v^{*}_{j})$ is the marginal benefit of node $v^{*}_{j}$ with respect to set $\hat{S}$ , hence, the summation itself becomes,

[TABLE]

Thus, Eq. 3.4 is reduced to,

[TABLE]

Rearranging the above equation, we get,

[TABLE]

which completes our proof.

3.3.1 Improved Heuristic.

Despite the $\frac{1}{r+1}$ -approximation guarantee, Tada-Probe algorithm only considers the gain of ending node in a shortest path and completely ignores the on-the-way benefit. That is the newly observed nodes discovered when probing the intermediate nodes on the shortest paths are neglected in making decisions. Thus, we can improve Tada-Probe by counting all the newly observed nodes along the connecting paths which are not necessarily the shortest paths and the selection criteria of taking the path with largest benefit ratio is applied. Since the selected path of nodes has the benefit ratio of at least as high as that of considering only the ending nodes, the $\frac{1}{r+1}$ -approximation factor is preserved.

Following that idea, we propose a Dijkstra based algorithm to select the path with maximum benefit ratio. We assign for each node $u$ a benefit ratio $\delta(u)$ , a distance measure $d(u)$ and a benefit value $\Delta(u)$ . Our algorithm iteratively selects a node $u$ with highest benefit ratio and propagates the distance and benefit to its neighbors: if neighbor node $v$ observes that by going through $v$ , $u$ ’s benefit ratio gets higher, $v$ will update its variables to have $u$ as the direct predecessor. Our algorithm finds the path with highest benefit ratio.

Note that extra care needs to be taken in our algorithm to avoid having loops in the computed paths. The loops normally do not appear since closing a loop only increases the distance by one while having the same benefit of the path. However, in extreme cases where a path passes through a node with exceptionally high number of connections to unobserved nodes, loops may happen. To void having loops, we check whether updating a new path will close a loop by storing the predecessor of each node and traverse back until reaching a fully observed node.

3.3.2 Optimal ILP Algorithm.

To study the optimal solution for our Tada-GPM problem when topology is available, we present our Integer Linear Programming formulation. Hence, We can use a convenient off-the-shelf solver, e.g., Cplex, Gurobi, to find an optimal solution. Unfortunately, Integer Linear Programming is not polynomially solvable and thus, extremely time-consuming.

In the prior step, we also collapse all fully probed nodes into a single node $r$ and have connections from $r$ to partially observed nodes. Assume that there are $n$ nodes including $r$ , for each $u\in V$ , we define $y_{u}\in\{0,1\},\forall u\in V$ such that,

[TABLE]

Since at most $k$ nodes are selected adaptively, we can view the solution as a tree with at most $k$ layers. Thus, we define $x_{uj}\in\{0,1\},\forall u\in V,j=1..k$ such that,

[TABLE]

The Tada-GPM problem selects at most $k$ nodes, i.e., $\sum_{u\in V}x_{uk}\leq k$ to maximize the number of newly observed nodes, i.e., $\sum_{u\in V}y_{u}$ . A node is observed if at least one of its neighbors is selected meaning $y_{u}\leq\sum_{v\in N(u)}x_{vk}$ where $N(u)$ denotes the set of $u$ ’s neighbors. Since $r$ is the initially fully observed nodes, we have $x_{r0}=1$ . Furthermore, $u$ can be selected at layer $j$ if at least one of its neighbors has been probed and thus, $x_{uj}\leq\sum_{v\in N(u)}x_{v(j-1)}$ .

Our formulation is summarized as follows,

[TABLE]

From the solution of the above ILP program, we obtain the solution for our Tada-GPM instance by simply selecting nodes $u$ that $x_{uk}=1$ . Note that the layering scheme in our formulation guarantees both the connectivity of the returned solution and containing root node $r$ and thus, the returned solution is feasible and optimal.

Theorem 3

The solution of the ILP program in Eq. 3.15 infers the optimal solution of our Tada-GPM.

3.3.3 Empirical Evaluation.

Here, we compare the probing performance in terms of the number of newly probed nodes that different algorithms deliver on a Facebook ego network111http://snap.stanford.edu/data/egonets-Facebook.html with 347 nodes and 5029 edges. The results are presented in Fig. 4. The figure shows that our Heuristic improvement very often meets the optimal performance of the ILP algorithm while Tada-Probe is just below the former two methods. However, the naive Greedy algorithm performs badly due to having no guarantee on solution quality.

3.4 Training Models.

We consider two classes of well-studied machine learning models. First, linear regression model is applied to learn a linear combination of features characterized by coefficients. These coefficients learn a linear representation of the dependence of labels on the features. The output of training phase of linear regression model is a function $f_{Lin}(.)$ which will be used to estimate the gain $w_{u}^{o}$ of probing a node $u$ in subgraph $G^{\prime}$ . It is noted that we learn $f_{Lin}(.)$ from sampled graph $G^{\prime}_{i}$ of the reference graph $G^{r}$ and use $f_{Lin}(.)$ to probe the incomplete $G^{\prime}$ .

Secondly, we consider logistic regression to our problem as follow: Let $w_{u}^{o}$ , $w_{v}^{o}$ be the gain of probing the two nodes $u$ and $v$ respectively. Given a pair of nodes $<u,v>$ in $V_{G^{\prime}_{i}}^{p}$ , our logistic model $f_{Log(.)}$ learns to predict whether $w_{u}^{o}$ is larger than $w_{v}^{o}$ . Thus, for each $G^{\prime}_{i}$ in $\mathcal{G}$ , we compute node features of each node $u\in V_{G^{\prime}_{i}}^{p}$ . We generate $\binom{|V_{G_{i}}^{p}|}{2}$ pairs of nodes for each subgraph $G^{\prime}_{i}$ and then concatenate features of two nodes in a pair to form a single data point. Each data point $<u,v>$ is labeled by binary value ( $1$ or [math]):

[TABLE]

4 Experiments

In this section, we perform experiments on real-world networks to evaluate performance of the proposed methods.

Datasets. Table 3 describes three types of real-world networks used in our experiments. The road network [28, 29] includes edges connecting different points of interest (gas stations, restaurants) in cites. The second type of network includes several snapshots of GnuTella network with nodes represent hosts in GnuTella network topology and edges represent connections between hosts. In the third type, we use five collaboration networks that cover scientific collaborations between authors papers in which nodes represent scientists, edges represent collaborations. The last type of network models metabolic pathways which are linked series of chemical reactions occurring in cell.

Sampling Methods. We adopt the Breadth-First Search sampling [21] to generate subgraphs for our experiments. In our regression model, for each network that is marked for training in Table 3, we generate samples with number of nodes varies from $0.5\%$ to $10\%$ number of nodes in $G$ . The size distribution of the subgraphs follows a power law distribution with power-exponent $\gamma=-1/4$ . The size of subgraphs used in validation is kept to be roughly $5\%$ of the network size.

Probing Methods. We compare performance of our linear regression (LinReg) and logistic regression (LogReg) probing method with MaxOutProbe[3], RAND (probe a random gray node), and centrality-based methods DEG, BC, CC, PR, CLC (see table 2) that probe the node with the highest centrality value in each step. Each considered method has different objective function for selecting nodes to probe. Sharing the same idea, all of these methods rank nodes in set $V^{p}$ of $G^{\prime}$ , and select the nodes with highest ranking to probe first.

Performance Metrics. For each probing method, we conduct probing at budgets $k\in\{1,100,200,300\}$ . During probing process, we compare performance among probing methods using newly explored nodes, i.e., the increase in the number of nodes in $G^{\prime}$ after $k$ probes. For each of the network used in the validation, we report the average results over 50 subgraphs. For obvious reason, we do not use the training network for validation.

We use statistical programming language R to implement MaxOutProbe and use C++ with igraph framework to implement LinReg, LogReg and metric-based probing methods. All of our experiments are run on a Linux server with a 2.30GHz Intel(R) Xeon(R) CPU and 64GB memory.

4.1 Comparison between Machine Learning Methods and Metric-based Probing Methods.

Table 4 presents probing performance of metric-based methods in each group of networks with the best one is highlighted. Reported result is the number of explored nodes at the end of probing process ( $k=300$ ).

In Fig. 5, we take top 3 metric-based methods in each type of network and compare their performance with LinReg and LogReg. Here, both machine learning methods are trained by using the benefit of node in only 1-step ahead ( $h=1$ , i.e., setting $k=1$ in Alg. 1). Their probing performance outperforms metric-based methods in GnuTella network $20\%$ on average. They match performance of other methods in Road network and are $7\%$ worse than BC in Collaboration network. The experiment with MaxOutProbe in Road network faces the problem of non-adaptive behavior of MaxOutProbe: $|V^{p}|$ of Road network’s samples is smaller than maximal budget $k$ , besides, due to the very low density property of Road network, after probing $|V^{p}|$ nodes in sampled networks MaxOutProbe explores only few new nodes in underlying network; these factors lead to very poor performance of MaxOutProbe. MaxOutProbe also performs poorly as compared with adaptive implementation of metric based methods: It is $67\%$ , and $36\%$ worse than LinReg in Collaboration, GnuTella network respectively.

4.2 Benefits of Looking Into Future Gain.

For LinReg and LogReg, we use different functions trained by different labels (marked as $h=1$ , $h=2$ , $h=3$ , $h=4$ , which indicates regression functions are trained with benefits of node in 1-step, 2-step, 3-step or 4-step ahead). In Fig. 6, we evaluate LinReg and LogReg with RAND and the best metric-based methods for each type of network reported in Table 4. We use of DEG as baseline for our performance comparison. Specifically, we take the ratio of number of newly explored nodes of each probing method to number of newly explored nodes of DEG at the end of probing process. We omit result of MaxOutProbe due to its poor performance observed from previous subsection.

LinReg shows consecutive improvement in probing performance with $h=1$ , $h=2$ , $h=3$ as it ranks candidate nodes in set $V^{p}$ based on their predicted gain at increasing number of hops far away from them. Overall, LinReg has better performance than LogReg. The $h=3$ of LinReg and LogReg outperforms DEG from $12\%$ to $15\%$ in Collaboration and GnuTella network. The result observed for Road network is from $1\%$ to $5\%$ . LinReg with $h=3$ outperforms the best metric-based method in GnuTella network $11.6\%$ ; the improvement for Collaboration and Road network are $7.5\%$ , $2.2\%$ respectively. The $h=4$ of LinReg and LogReg starts decreasing compared with $h=1$ , $h=2$ and $h=3$ . This indicates the benefit of looking further benefits of selecting a node is true within specific number of hops far away from nodes.

Among metric-based methods, while BC consistently performs well, the best method varies across the networks. It indicates the underlying network structure impacts performance of metric-based method and it is hard to determine which node centrality method is the best for which type of network. Interestingly, performance of random probing matches or even outperforms BC, DEG, CC in GnuTella networks. It is because GnuTella networks have low average clustering coefficient that makes BFS-based samples of these networks have star structure. Consequently, metric-based methods tend to rank candidate nodes in sampled network with same score. This helps random probing performs better than metric-based methods in this type of network.

5 Conclusion

This paper studies the Graph Probing Maximization problem which serves as a fundamental component in many decision making problems. We first prove that the problem is not only NP-hard but also cannot be approximated within any finite factor. We then propose a novel machine learning framework to adaptively learn the best probing strategy in any individual network. The superior performance of our method over metric-based algorithms is shown by a set of comprehensive experiments on many real-world networks.

Supplementary Material for ‘Towards Optimal Strategy for Adaptive Probing in Incomplete Networks’

Complete proof of Theorem 1

Proof.

To prove Theorem 1, we construct classes of instances of the problems that there is no approximation algorithm with finite factor.

We construct a class of instances of the probing problems as illustrated in Figure 7a. Each instance in this class has: a single fully probed node in black $b_{1}$ , $n$ observed nodes in gray each of which has an edge from $b_{1}$ and one of the observed nodes, namely $g^{*}$ varying between different instances, having $m$ connections to $m$ unknown nodes in white. Thus, the partially observed graph contains $n+1$ nodes, one fully probed and $n$ observed nodes which are selectable, while the underlying graph has in total of $n+m+1$ nodes. Each instance of the family has a different $g^{*}$ that $m$ unknown nodes are connected to. We now prove that in this class, no algorithm can give a finite approximate solution for the two problem.

First, we observe that for any $k\geq 1$ , the optimal solution which probes the nodes with connections to unknown nodes has the optimal value of $m$ newly explore nodes, denoted by $OPT=m$ . We sequentially examine two possible cases of algorithms, i.e., deterministic and randomized.

–

Consider a deterministic algorithm $\mathcal{A}$ , since the $\mathcal{A}$ is unaware of the connections from gray to unknown nodes, given a budget $1\leq k\ll n$ , the lists or sequences of nodes that $\mathcal{A}$ selects are exactly the same for different instances of problems in the class. Thus, there are instances that $g^{*}$ is not in the fixed list/sequence of nodes selected by $\mathcal{A}$ . In such cases, the number of unknown nodes explored by $\mathcal{A}$ is 0. Compared to the $OPT=m$ , $\mathcal{A}$ is not a finite factor approximation algorithm.

–

Consider a randomized algorithm $\mathcal{B}$ , similarly to the deterministic algorithm $\mathcal{A}$ , $\mathcal{B}$ does not know the connections from the partially observed nodes to white ones. Thus, the randomized algorithm $\mathcal{B}$ essentially selects at random $k$ nodes out of $n$ observed nodes. However, this randomized scheme does not guarantee to select $g^{*}$ as one of its selected nodes and thus, in many situations, the number of unknown nodes discovered is 0 that invalidates $\mathcal{B}$ to be a finite factor approximation algorithm. In average, $\mathcal{B}$ has $\frac{k}{n}$ chance of selecting $g^{*}$ which leads to an optimal solutions with $OPT=m$ . Hence, the objective value is $\frac{km}{n}$ and the ratio with optimal value is $\frac{k}{n}$ . Since $k\ll n$ , we can say that the ratio is $O(\frac{1}{n})$ which is not finite in the average case for randomized algorithm $\mathcal{B}$ .

In both cases of deterministic and randomized algorithms, there is no finite factor approximation algorithm for ada-GPM or batch-GPM.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. T. Nguyen and T. N. Dinh, “Targeted cyber-attacks: Unveiling target reconnaissance strategy via social networks,” in INFOCOM WKSHPS , pp. –, IEEE, 2016.
2[2] K. Avrachenkov, P. Basu, G. Neglia, B. Ribeiro, and D. Towsley, “Pay few, influence most: Online myopic network covering,” in INFOCOM WKSHPS , pp. 813–818, IEEE, 2014.
3[3] S. Soundarajan, T. Eliassi-Rad, B. Gallagher, and A. Pinar, “Maxoutprobe: An algorithm for increasing the size of partially observed networks,” NIPS , 2015.
4[4] S. Hanneke and E. P. Xing, “Network completion and survey sampling.,” in AISTATS , pp. 209–215, 2009.
5[5] F. Masrour, I. Barjesteh, R. Forsati, A. Esfahanian, and H. Radha, “Network completion with node similarity: A matrix completion approach with provable guarantees,” in ASONAM , pp. 302–307, IEEE, 2015.
6[6] D. Golovin and A. Krause, “Adaptive submodularity: A new approach to active learning and stochastic optimization.,” in COLT , pp. 333–345, 2010.
7[7] L. Seeman and Y. Singer, “Adaptive seeding in social networks,” in FOCS , pp. 459–468, IEEE, 2013.
8[8] J. Cho, H. Garcia-Molina, and L. Page, “Efficient crawling through url ordering,” 1998.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Towards Optimal Strategy for Adaptive Probing in Incomplete Networks

Abstract

1 Introduction

2 Problem Definitions and Hardness

Definition 2.1** (Adaptive GPM (ada-GPM))**

2.1 Hardness and Inapproximability.

2.1.1 Empirical Observations.

2.1.2 Inapproximability Result.

Theorem 1

3 Learning the Best Probing Strategy

3.1 Building Training Dataset.

3.2 Features for Learning.

3.3 An 1r+1\frac{1}{r+1}r+11​-Approximation Algorithm for Tada-GPM.

Theorem 2

3.3.1 Improved Heuristic.

3.3.2 Optimal ILP Algorithm.

Theorem 3

3.3.3 Empirical Evaluation.

3.4 Training Models.

4 Experiments

4.1 Comparison between Machine Learning Methods and Metric-based Probing Methods.

4.2 Benefits of Looking Into Future Gain.

5 Conclusion

Supplementary Material for ‘Towards Optimal Strategy for Adaptive Probing in Incomplete Networks’

Complete proof of Theorem 1

Definition 2.1 (Adaptive GPM (ada-GPM))

3.3 An $\frac{1}{r+1}$ -Approximation Algorithm for Tada-GPM.