Transitivity Demolition and the Falls of Social Networks
Hung T. Nguyen, Nam P. Nguyen, Tam Vu, Huan X. Hoang, Thang N. Dinh

TL;DR
This paper introduces efficient approximation algorithms to identify critical nodes and edges in social networks whose removal significantly disrupts the network's triangular connectivity, enhancing understanding of network robustness.
Contribution
It proposes novel approximation algorithms with proven guarantees for identifying vital network elements, scalable to large social networks, and demonstrates superior performance over existing methods.
Findings
Algorithms achieve near-optimal solutions with (1-1/e) guarantee.
Methods are up to 100x faster than current state-of-the-art.
Experiments on large real-world networks validate effectiveness.
Abstract
In this paper, we study crucial elements of a complex network, namely its nodes and connections, which play a key role in maintaining the network's structure and function under unexpected structural perturbations of nodes and edges removal. Specifically, we want to identify vital nodes and edges whose failure (either random or intentional) will break the most number of connected triples (or triangles) in the network. This problem is extremely important because connected triples form the foundation of strong connections in many real-world systems, such as mutual relationships in social networks, reliable data transmission in communication networks, and stable routing strategies in mobile networks. Disconnected triples, analog to broken mutual connections, can greatly affect the network's structure and disrupt its normal function, which can further lead to the corruption of the entire…
| Notation | Meaning |
|---|---|
| Number of vertices/nodes () | |
| Number of edges/links () | |
| The degree of | |
| The set of ’s neighbors | |
| The set of triangles on a node | |
| The number of triangles on | |
| The set of triangles on an edge | |
| The set of triangles on | |
| The set of triangles on a subset of edges |
| Dataset | Type | #Nodes | #Edges | Avg. degree |
|---|---|---|---|---|
| Gnutella4 | Peer-to-peer network(*) | 10.9K | 40K | 3.7 |
| Flickr | Photo sharing network(†) | 80.5K | 11.8M | 138.8 |
| Web graph(*) | 876K | 5.1 M | 5.83 | |
| Skitter | Internet Topology(*) | 1.7M | 11.1M | 6.53 |
| Wiki-Talk | Wikipedia Communication(*) | 2.4M | 5M | 2.1 |
| Orkut | Online Social Network(*) | 3M | 117M | 78 |
| Data | |||||
|---|---|---|---|---|---|
| Flickr | 0.65 | 0.74 | 0.81 | 0.85 | 0.88 |
| Gnutella | 0.77 | 0.90 | 1 | 1 | 1 |
| 0.78 | 0.78 | 0.78 | 0.79 | 0.79 | |
| Skitter | 0.77 | 0.80 | 0.82 | 0.84 | 0.85 |
| Wiki-Talk | 0.84 | 0.95 | 0.97 | 0.99 | 0.99 |
| Orkut | 0.75 | 0.79 | 0.81 | 0.81 | 0.82 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLabor Movements and Unions · Housing, Finance, and Neoliberalism
Transitivity Demolition and the Falls of
Social Networks
Hung T. Nguyen, Nam P. Nguyen, Tam Vu, Huan X. Hoang, and Thang N. Dinh Hung T. Nguyen and Thang N. Dinh are with the Computer Science Department, Virginia Commonwealth University, Richmond, VA, 23220 Email: {hungnt, tndinh}@vcu.edu.Nam P. Nguyen is with Computer and Information Sciences Department, Towson University, Towson, MD, 21252 Email: [email protected] Vu is with Computer Science and Engineering Department, University of Colorado, Denver, CO, 80204 Email: [email protected] X. Hoang is with Information Technology Department, Vietnam National University, Hanoi, Vietnam Email: [email protected] received ; revised ; accepted.
Abstract
In this paper, we study crucial elements of a complex network, namely its nodes and connections, which play a key role in maintaining the network’s structure and function under unexpected structural perturbations of nodes and edges removal. Specifically, we want to identify vital nodes and edges whose failure (either random or intentional) will break the most number of connected triples (or triangles) in the network. This problem is extremely important because connected triples form the foundation of strong connections in many real-world systems, such as mutual relationships in social networks, reliable data transmission in communication networks, and stable routing strategies in mobile networks. Disconnected triples, analog to broken mutual connections, can greatly affect the network’s structure and disrupt its normal function, which can further lead to the corruption of the entire system. The analysis of such crucial elements will shed light on key factors behind the resilience and robustness of many complex systems in practice.
We formulate the analysis under multiple optimization problems and show their intractability. We next propose efficient approximation algorithms, namely DAK-n and DAK-e, which guarantee an -approximate ratio (compared to the overall optimal solutions) while having the same time complexity as the best triangle counting and listing algorithm on power-law networks. This advantage makes our algorithms scale extremely well even for very large networks. In an application perspective, we perform comprehensive experiments on real social traces with millions of nodes and billions of edges. These empirical experiments indicate that our approaches achieve comparably better results while are up to 100x faster than current state-of-the-art methods.
Index Terms:
Triangle breaking, Social networks, Approximation algorithms
I Introduction
Robustness and resilience to unexpected perturbations is perhaps one of the most desirable properties for corporeal complex systems, such as the World Wide Web, transportation networks, communication networks, biological networks and social information networks. In general, resilience of a network evaluates how much the network’s normal function is affected in case of external perturbation, i.e., it measures the network in response to unexpected events such as adversarial attacks and random failures [1]. In order to improve the robustness of real-world systems, it is therefore important to obtain key insights into the structural vulnerabilities of the networks representing them. A major aspect of this is to analyze and understand the effect of failure (either intentionally or at random) of individual components on the degree of clustering in the network.
Clustering, or more particularly, the number of connected triples/triangles, is a fundamental network property that has been shown to be relevant to a variety of topics, such as communities of genes in biological networks, forwarding and routing tables mobile networks, and especially strong connection of users in online social networks (OSNs) [2]. Connected triples nicely capture the social intuition “a friend of your friend is also your friend” [3], and thus, is the fundamental pattern of information diffusion in multiple systems. For example, consider the propagation of information through a social network, such as the spread of a rumor. A growing body of work has identified the importance of the number of connected triples to such propagation; the more connected triples a network has, the easier it is for information to propagate [4, 5, 6, 7, 8]. Connected triples are also behind the fall of some online social sites, such as MySpace and Friendster, as they suffered a catastrophic degrade of active users, activity traffic, and consequently, popularity in the cyberspace. For instance, Friendster claimed to have over 100 million users at its peak, but most them had quit and fled to other networks (e.g., Facebook) by the end of 2009 [9, 10], triggering a cascade of broken bonds and friends leaving Friendster. The identification of elements that crucially affect the number of connected triples in the network, as a result, is of great impact.
The importance of connected triples is not limited to social networks; in the context of air transportation networks, [11] argued that those connected triples of such a network is beneficial, as passengers for a canceled flight can be rerouted more easily. This metric also plays an important role in the network community structure, which is the core of mobile forwarding and routing strategies in Delay Tolerant Networks (DTNs). Particularly, [12] has shown the correlation between the number of disconnected triples and the significant degrade of forwarded packets in DTNs. In addition, as a matter of homeland security, the critical elements for clustering in homeland communication networks should receive greater resources for protection; in complement, the identification of critical elements in a social network of adversaries could potentially limit the spread of information in such a network.
Many measures have been proposed for evaluating the resilience of technological and biological systems; however, there are only few work suggested for social networks. Most studies in the literature focus on how the network behaves under perturbation using the measures such as the pair-wise connectivity [13], natural connectivity [14], or using centrality measures, e.g., degree, betweeness [15], the geodesic length [1], eigenvector [16], etc. Nevertheless, most of them (1) focus only on the local but not the global network’ structure, and (2) do not take mutual interactions and social relationships into account. These limits drive the need for another metric for social resilience. To our knowledge, none of the existing work has examined the number of connected triples from the perspective of vulnerability - as evidenced by the examples above, the damage made by the broken triples, resulted from element-wise failures, can potentially have severe effects on the functionality of the network. This drives the need for an analysis of this metric in complex networks.
Our study in this paper investigates the structural resilience of OSNs under the scenarios of element-wise failures, particularly under two scenarios of adversary attacks and random failures. Our goal is to discover and protect critical network’ elements (nodes and links) whose failures will break most triples in the network. In a nutshell, our contributions are
We study the resilience of social networks through the number of connected triples. This an important structural vulnerability of an OSN that can greatly affect its popularity among the crowds. We formulate the analysis under multiple optimization problems, and show their hardness and intractability. 2. 2.
We propose efficient approximation algorithms to identify triangle-breaking points (i.e., nodes and links) in the network structure: DAK-n algorithm for node removal and DAK-e algorithm for edge removal. Our proposed approaches guarantee are a small constant factor in comparision to optimal solutions. Interestingly, both DAK-n and DAK-e have the same time complexity with the best triangle counting/listing algorithms, . This makes our algorithms scale extremely well for large social data. 3. 3.
We also investigate the input-dependent bounding technique previously appeared in [17] for influence maximization problem. The input-dependent bound usually gives better approximation guarantee than the worst-case bound since it accounts for the particular instance of the problem and particular run of the algorithms. As shown in the experiments, the input-dependent bound vastly improves over the worst-case guarantee and for some networks, returns the exact optimal solutions. 4. 4.
We carry out extensive experiments in comparison with state-of-the-art methods on real-world data with millions of nodes and edges. The results show that DAK-n and DAK-e substantially outperform the other algorithms in terms of running time: up to 100x faster than the direct competitor, GreedyAll [18], which was shown to have the best solution quality and be among the most scalable methods in their papers.
Paper organization: Section II reviews studies that are related to our work. Section III describes the notations, measure functions and problem definitions. Sections IV shows the proof of NP-completeness implying the intractability of these investigating problems. Sections V and VI present our solutions DAK-n and DAK-e for the problems of interested, respectively. In section VII, we report empirical results of our approaches in comparison with other strategies. Finally, section VIII concludes the paper.
II Related Work
Many metrics and approaches have been proposed to account for network robustness and vulnerability [19, 20, 21, 22, 23]. While each of these measures has its own emphasis and rationality, they often come with several shortcomings that prevent them from capturing desired characteristics of network connectivity and resilience. For example, measures based on shortest path are rather sensitive to small changes (e.g. removing edges or nodes); algebraic connectivity and diameter are not meaningful for disconnected graphs (all disconnected graphs have the same values); number of connected components and component sizes, arguably, do not fully reflect level of network connectivity.
Vulnerability assessment has attracted a large amount of attention from the network science community. Work in the literature can be divided into two categories: Measuring the robustness and Manipulating the robustness of a network. In measuring the robustness, different measures and metrics have been proposed such as the graph connectivity [13], the diameter, relative size of largest components, and average size of the isolated cluster [15]. Other work suggests using the minimum node/edge cut [24] or the second smallest non-zero eigenvalue or the Laplacian matrix [25]. In terms of manipulating the robustness, different strategies has been proposed such as [15][26], or using graph percolation [27]. Other studies focus on excluding nodes by centrality measures, such as betweeness and the geodesic length [1], eigenvector [16], the shortest path between node pairs [19], the pair-wise connectivity [13], propagation of worms and cascading failures [28, 29, 30]. More information of general vulnerability assessment can be found in [14] and references therein.
Community structure [31] is an another common pattern found in real-world networks. Network structural vulnerability in social networks, has so far been an untrodden area. In a related work [32], the authors introduced the community structure vulnerability to analyze how the communities are affected when top vertices are excluded from the underlying graphs. They further provided different heuristic approaches to find those critical components in modularity-based community structure. [33] suggested a method based on the generating edges of a community to find the critical components.
Counting and listing triangles in a graph is an important problem, motivated by applications in a variety of areas. The problem of counting triangles on a graph with vertices and edges can be performed in a straightforward manner in . This has been improved to in [34] and where is the exponent of matrix multiplication [35]. To improve the performance of triangle counting in large graphs, parallel algorithms are also studied in [36]. There are also several works on approximate triangle counting [37, 38, 39]. Recently, the -triangle-breaking-node and -triangle-breaking-edge problems are investigated in [18]. The authors provides NP-completeness proofs and greedy algorithms for the problems. Unfortunately, the NP-completeness proofs contains fundamental flaws that cannot be easily fixed.
III Model and Problem Definition
In this section, we first describe the main problem of interest, and then define its four triangle-breaking variants. We then prove the NP-hardness of those problems. Based on the submodularity property of the objective functions, the approximability is stated accordingly for each problem based on the rich literature of optimizing submodular functions [40, 41].
We represent a social network by an undirected graph with nodes and undirected edges. Given a graph , we study multiple attack models in which the attackers attempt to break the most number of triangles in the graph by removing nodes and edges either intentionally or at random. Here, a triangle is broken if one of its edges or nodes is removed from the graph. In what following, we define four variants of the triangle-breaking problem based on Node and Edge removals.
III-A Problem Definition
Definition 1** (-triangle-breaking-node)**
Given an undirected graph and budget size , find a subset of nodes whose removal will break the maximum number of triangles in
[TABLE]
where is the set of triangles with at least one node in , i.e.,
[TABLE]
Note that we can formulate the above problem as an Integer Linear Programming problem (ILP). For each , define such that
[TABLE]
and for each triangle , define an integral variable that satisfies
[TABLE]
The -triangle-breaking-node problem is to remove nodes, i.e., , to break the maximum number of triangles, i.e., to maximize the objective function . Because the triangle is only broken if at least one node in is chosen to be removed, we impose the following constraint,
[TABLE]
In summary, we have the following equivalent ILP formulation.
[TABLE]
Observe that the above ILP formulation is a special case of the Max--Coverage[42] problem. Given an universe set of elements and a collections of subsets of , where , the general Max--Coverage problem asks for subsets of , , to maximize the coverage of where
[TABLE]
is the number of distinct elements in the union of . We call the number of subsets that an element appears in the frequency of that element. Thus, in the Eq. III-A the universe set is (i.e. all the triangles) and the collection of subsets is . This special case of Max--Coverage also satisfies the condition that all the elements have the same frequency three, as each triangle involves exactly three nodes.
Definition 2** (-triangle-breaking-edge)**
Given an undirected graph and budget size , find a subset of edges whose removal will break the maximum number of triangles in .
[TABLE]
where is the set of triangles with at least one edge in .
The equivalent ILP of -triangle-breaking-edge is,
[TABLE]
where
[TABLE]
for all .
-triangle-breaking-edge is also a special case of Max--Coverage in which the elements to be covered are the triangles in , and the collection of subsets includes the set of triangles involving each edge . As each triangle consists of three edges, the frequency of each element in this instance is also three. Moreover, any two subsets have at most one triangle in common.
We also formulate the converse variants in which we want to break a certain number (or a percentage of the total number) of triangles by removing the least number of nodes/edges from the graph. Their definitions and ILP formulations are defined in the following paragraphs
Definition 3** (min-triangle-breaking-node)**
Given an undirected graph and a positive integer , find a minimum-size subset of nodes whose removal will break at least triangles in .
The ILP for min-triangle-breaking-nodeis
[TABLE]
Definition 4** (min-triangle-breaking-edge)**
Given an undirected graph and a positive integer , find the minimum-size subset of edges whose removal will break at least triangles in .
The ILP for min-triangle-breaking-edgeis
[TABLE]
Note that min-triangle-breaking-node and min-triangle-breaking-edge are special cases of the Partial Set Cover problem [40]. The Partial Set Cover problem is a variation of the set cover problem. Given an universe set , a collection of subsets of , Partial Set Cover finds a subcollection to cover only a required number of the elements in . Thus, min-triangle-breaking-node and min-triangle-breaking-edge are equivalent to Partial Set Cover problems in which each element is in exactly three subsets and the intersection of any three subsets contains at most one element.
IV Hardness and Approximability
We next discuss the complexity and present the best approximation guarantees for our defined problems. The summary of the complexity and approximability results for the studied problems is presented in Table II.
IV-A NP-Completeness
Recent work of Li et al. [18] attempted to prove the NP-completeness of problems similar to -triangle-breaking-node and -triangle-breaking-edge. Unfortunately, their proofs contained some flaws. Specifically, the proof of Theorem 2.1 [18] relies on a weaker constraint of the set system: “the intersection of any three subsets in has at most one element”. Indeed, for -triangle-breaking-edge, the correct (and stronger) condition should be: the intersection of any two subsets in has at most one element. Moreover, the proof relies on the assumption that if a problem is not NP-hard then there is a polynomial-time algorithm to solve it. We do not know yet if there exist NP-intermediate problems between NP and P. Consequently, the correctness of the reduction cannot be confirmed.
We show that all four aforementioned variants are all NP-complete problems. We present a simple NP-completeness proof of min-triangle-breaking-node (similarly -triangle-breaking-node) via reduction from the Vertex-Cover problem [42]. The decision versions of -triangle-breaking-node (similarly min-triangle-breaking-node) can be polynomial-time reducible from the following decision problem, called Node-Triangle-Free:
“Given a undirected graph and a number , can we delete nodes from so that there is no more triangles in (a.k.a is triangle-free)?”.
In turn, we show a more important result that Node-Triangle-Free is polynomial-time reducible from the decision version of Vertex Cover problem (definition below). This result will set forth the NP-Completeness of -triangle-breaking-node.
“Given a graph and an integer , is there a vertex-cover of size ?”.
Reduction: Let be an instance of the vertex cover problem. For each edge , we add to a new node and connect to both and . Let be the new graph. We shall reduce to an instance of Node-Triangle-Free. Obviously, if we have a vertex-cover of size in then we can delete the same set of nodes in to obtain a triangle-free graph. In the reverse direction, we can assume without lost of generality that will never be removed. The reason is that we can always remove or and break an equal or greater number of triangle(s). Thus a subset of size that its removal makes triangle-free must induce a vertex-cover of size in . This completes the reduction.
Theorem 1
The problems -triangle-breaking-node and min-triangle-breaking-node are NP-complete.
Using a similar reduction, both -triangle-breaking-edge and min-triangle-breaking-edge can be polynomial-time reducible to the following problem:
“Can we delete edges from a graph so that there is no more triangles in (i.e., to make the graph triangle-free)?”.
The above problem is known to be NP-complete according to [43]. Hence, we obtain the following result.
Theorem 2
The problems -triangle-breaking-edge and min-triangle-breaking-edge are NP-complete.
IV-B Approximability
Since min-triangle-breaking-node and min-triangle-breaking-edge problems are special cases of the Partial Set Cover problem with bounded frequencies [40], the primal-dual algorithm in [40] provides a 3-approximation algorithm for both problems. Instead of operating on sets, the primal-dual algorithm works on the elements in the universe set . It assigns a dual covering cost for each element that signifies the selection of a set to cover that element. The basic operation of the algorithm is increasing all the dual covering costs of those that have not been covered simultaneously until the total cost of uncovered elements in a set equals 1 (the cost of choosing that set). The corresponding set is then selected to the solution and the algorithm continues until satisfying the covering requirement. To achieve the -approximation factor, the algorithm assumes that we know a set in the optimal solution (simply by trying all the possible sets) and applies the primal-dual selection on the rest. Therefore, we obtain the following result.
Theorem 3
There exist 3-approximation algorithms for min-triangle-breaking-node and min-triangle-breaking-edge.
The -triangle-breaking-node and -triangle-breaking-edge problems are special cases of Max--Coverage and the Pipage-rounding method in [41] results in an approximation algorithm with ratio .
The Pipage-rounding is a general method providing worst-case approximation guarantees for a large class of discrete optimization problems, including Max--Coverage, with assignment-type constraints. It first reformulates the problem into a non-linear program which has an integral optimum and is at least greater than the starting problem at any feasible solution. It then finds an integral solution of the non-linear program in two phases: 1) solving the non-integral relaxation of the problem and 2) transform the non-integral solution to an integral one by pipage rounding. The relaxation is polynomially solvable and the second phase takes the solution and rounds it in the manner that the objective value of rounded solution can only increase and get closer to integral numbers. As shown in [41], each rounding circle in Pipage-rounding brings one element in the current solution to integral value. The approximation factor follows directly from the properties of the non-linear program and the rounding procedure. Therefore, we obtain the following result.
Theorem 4
There exist 19/27-approximation algorithms for -triangle-breaking-node and -triangle-breaking-edge.
Remarks. Both the primal-dual method in [40] and the pipage-rounding algorithm in [41] have high time complexity and are not scalable for large networks. As a result, efficient algorithms that can be applied on large-scale data are of desire. In next sessions, we propose efficient discounting algorithms for the studied problems on very large-scale networks with just a slightly looser approximation ratio.
V Algorithms for -triangle-breaking-node
In this section, we first present a naive Greedy Algorithm (Alg. 1) to solve the -triangle-breaking-node problem. We show that the greedy strategy returns an - approximate solution but has prohibitively high time complexity. Thus, in the subsequent subsection, we propose -triangle-breaking-node Discounting Algorithm (DAK-n - Alg. 2) which achieves the same solution quality but is at least time faster. The core efficiency of DAK-n is that it employs a smart updating technique to keep track of the number of effective triangles associated with each of the remaining nodes.
V-A Naive Greedy Algorithm
The first algorithm (Alg. 1) selects at each step the node that breaks the most number of triangles, i.e., , and then adds to the solution . This algorithm continues until nodes have been selected into the returned solution .
Since -triangle-breaking-node is a special case of Max--Coverage, the native greedy algorithm provides a performance guarantee of for -triangle-breaking-node. Another way of proving this is to show that the main objective function (the number of broken triangles) is monotone and submodular, which in turn admits a nearly optimal greedy approximation algorithm [18].
The complexity of Alg. 1 is assuming nodes are selected in the solution. In a recent work, the time complexity for Alg. 1 is brought down to in [18] using the fast triangle computation method in [34]. For large value of , the time-complexity of the algorithm in [18] could be as high as which is very expensive and not scalable for practical large size data. To this end, we present in next section our scalable Discounting Algorithms for -triangle-breaking-node with time complexity which is up to times faster than the algorithm in [18].
V-B *Discounting Algorithm for -*triangle-breaking-node
Our Discounting Algorithm for -triangle-breaking-node (DAK-n - Alg. 2) speeds up significantly the simple greedy algorithm. For small values of , this algorithm requires as much time as the best algorithm for counting the number of triangles.
In principle, DAK-n employs an adaptive strategy in computing the marginal gains (the number of broken triangles) when nodes are removed one after another. At each round, the node that breaks the most number of triangles is selected into the solution. Node is then excluded from the structure and the procedure repeats itself on the remaining nodes and recomputes efficiently the new marginal gain for each node .
We structure DAK-n into two phases. The first phases (lines 1–8) extends the algorithm in [34] to compute the number of triangles that are incident with each node in the graph. This algorithm was proved to be time-optimal in for triangle-listing, and has been shown to be very efficient in practice. The second phase starts at line 9 where it creates a Max-priority-queue to ranks nodes according to values in . DAK-n then (lines 9–18) repeats the vertex selection for rounds. In each round, we select the node with the highest value of (from top of the priority queue) into the solution. The algorithm then removes from the graph, and performs the necessary updates on for all . The algorithm subsequently updates the positions of the nodes and in the queue according to the new values of those nodes in . The key efficiency of DAK-n algorithm lies in its update procedure for . Specifically, the total update for all values of after removing can be done in linear time as indicates in lines 15 – 18. The linear time update is made possible due to the information on the number of triangles involving each node. This significantly reduces the complexity for computing the marginal gain and speeds up the node selection process.
Complexity: The first phase takes as in [34]. The second phase takes a linear time in each round and has a total time complexity as creating and maintaining the Max-priority queue requires . In each sequential round, the algorithm checks all the neighbors of and for each neighbor, it examines all the neighbors of . Thus, the total complexity of checking at a round is where is the degree of . Each update (Lines 17-18) takes constant time since and decrease by 1 and the queue needs to move at most one level in the queue. Thus, the overall complexity is . For , the algorithm has an effective time-complexity , which is the same as the counting triangles procedure.
Approximation guarantees: It is obvious that DAK-n respects the original greedy method as it selects the node with the highest marginal gain at each step. Hence, DAK-n retains the approximation guarantees of the greedy method for Max--Coverage. The following theorem summarizes our suggested approach.
Theorem 5
DAK-n algorithm is an -approximation algorithm for -triangle-breaking-node with complexity .
Note that the naive Greedy (Alg. 1) and Discounting Algorithms (Alg. 2) can be easily adapted for min-triangle-breaking-node by stopping selecting nodes until broken triples triangles are satisfied. This is due to the fact that min-triangle-breaking-node is a special case of the Partial Set Cover problem and the greedy strategy guarantees an approximation solution, where denotes the harmonic function . Thus, Algs. 1 and 2 are ()-approximation algorithms for min-triangle-breaking-node.
V-C Analysis in Networks with Power-law Degree Distribution
As discussed above, DAK-n’s time complexity is for a general network; however, many complex systems of interest such as the Internet, social, and biological networks commonly exhibit the power-law degree distributions [44, 45]. Conceptually, power-law degree distributed networks have the fraction of nodes with degree ( connections to other nodes) is , where is the normalization factor as in the model [46]. Practical networks usually have . In this work, we deduce the maximum degree in a network to because for , the number of edges will be less than 1. We show that in power-law degree distributed networks, the overall time complexity is which implies that DAK-n is as fast as the state-of-the-art algorithms for counting/listing triangles with no additional costs (Theorem 6. This also realizes the scalability of DAK-n in large networks.
Theorem 6
The complexity of DAK-n algorithm is on power-law degree distributed networks. This implies DAK-n is as fast as the best available triangle counting/listing algorithms.
**Proof: ** In a power-law degree distributed network, the numbers of vertices and edges are computed as follows,
[TABLE]
[TABLE]
where is the Riemann Zeta function [46, 47] which converges absolutely for and diverges for all . For the sake of simplicity, we will simply use real number instead of rounding down to integers. The error terms can be easily bounded and are negligible in our proof.
Since Phase 1 of Alg. 2 is for counting triangles, we will analyze phase 2 in Alg. 2 and show its complexity . To this end, we first find the workload at each round in phase 2, sum them all up and utilize the power-law property to obtain the final result. In particular,
[TABLE]
The worst case of the second phase happens when which means that the algorithm has to select all nodes in decreasing order of triangle-breaking gains into the solution set . That leads to the overall complexity of,
[TABLE]
We apply the power-law property on the number of nodes with degree being and the maximum degree is on the above equation which yields
[TABLE]
We consider two cases:
Case 1: . This implies . Eq. 20 becomes,
[TABLE]
Combining Eq. V-C with the number of edges in power-law degree networks in Eq. 18, we obtain,
[TABLE]
where c1 is a constant that satisfies,
[TABLE]
Note that infers converges and c1 is a finite constant.
Thus, in this case, phase 2 has time complexity of .
Case 2: . In this case, Eq. 20 is equivalent to,
[TABLE]
where
[TABLE]
is a finite constant since . This yields the time complexity for Phase 2. Finally, we conclude that the overall time complexity of in both cases.
VI Algorithm for -triangle-breaking-edge
Similarly to -triangle-breaking-node and min-triangle-breaking-node, the edge variants expose similar attributes and thus the greedy algorithm can be directly applied with near-optimal guarantee. We present DAK-e for finding triangle-breaking edges in Alg. 3. On general networks, DAK-e is faster than its node-version, DAK-n, because it possesses a complexity .
Unlike DAK-n, DAK-e maintains for each edge the number of triangles incident on that edge and updates the measure efficiently when removing nodes from . After removing an edge we only needs to consider only updates to discount the triangles incident on from the corresponding edges. Thus the overall complexity in each iteration relies on finding the edge that breaks the maximum number of triangles. Similar to the node version, we also have the same approximation guarantees for the edge-removal problems which is summarized below.
Theorem 7
DAK-e is an -approximation algorithm for -triangle-breaking-edge with complexity .
On power-law degree distributed networks, by similar arguments to DAK-n, we can show that the overall complexity of DAK-e is which is also equal to that of counting/listing triangles in the networks.
Theorem 8
On power-law degree distributed networks, the complexity of DAK-e algorithm is .
An easily adapted algorithm of Alg. 3 can be devised for solving min-triangle-breaking-edge and returns a ()-approximate edge set since min-triangle-breaking-edge is also a special case of Partial Set Cover problem.
Input-dependent approximation guarantees
The -approximation factor, termed fixed worst-case bound, achieved by our algorithms provides a general lower-bound on the solution quality of the selected set . This factor is known in advance even prior to the execution of the methods. Nevertheless, we can often times derive a better approximation bound of the solution quality, namely the input-dependent bound, depending on the problem instance and even the particular run of the algorithms. Inspired by the work in [17] on the Influence Maximization problem, we can apply a similar bounding technique (named online-bound) to obtain a real input-dependent bound on the solution quality in both the naive greedy and our DAK-n and DAK-e algorithms. The input-dependent bound for DAK-n is stated as follows,
Theorem 9** (DAK-n input-dependent bound)**
For a set of selected nodes and each node , let be the marginal gain of when is included in . Let be the sequence of the remaining nodes (not in ) sorted in decreasing order of , then
[TABLE]
where is the triangles broken by the optimal solution with nodes.
By selecting the top nodes with largest marginal triangle-breaking gains into the returned solution of DAK-n, we obtain an upper-bound on the optimal solution. Then by dividing the number of triangles broken by with that upper-bound, we have an input-dependent guarantee on ,
[TABLE]
Similarly, the input-dependent for solution of the DAK-e is computed by the following equation,
[TABLE]
where are the top edges with the highest marginal gain of broken triangles with respect to and is the triangles broken by the optimal edge set with edges.
VII Experimental Evaluation
In this section, we evaluate the quality and performance of our proposed methods, i.e., DAK-n and DAK-e. Empirical results show two important features of our approaches: performance and scalability that are desired for any practical techniques. We compare and contrast ours with the state-of-the-art method, GreedyAll [18] 111[18] also proposed another algorithm, namely Approx which used FM-sketch to approximate the triangle-breaking gain; however, this approximation algorithm imposes the same time complexity with GreedyAll., and approaches based on centrality measures, i.e., Max-degree, Pagerank and randomization. On -triangle-breaking-node and -triangle-breaking-edge, results indicate that our methods vastly outperform GreedyAll up to orders of magnitudes in terms of running time while achieving the same level of solution quality. The baseline methods based on centrality and randomization are slightly faster but the qualities are much worst. We also spend a good portion to study the networks under node and edge removal attacks using the min-triangle-breaking-node and min-triangle-breaking-edge.
VII-A Experimental settings
Datasets
To make our experiments extensive, we select a set of six real-world traces from various domains with sizes ranging from thousand to million scales. The summary of those networks are provided in Table. III. ††(*) http://snap.stanford.edu/data/index.html;
(†) http://socialcomputing.asu.edu/pages/datasets
Specifically, our dataset includes both physical (connected by physical links) and virtual (e.g., friendship, communication) networks. In the first category: Gnutella4 is a snapshot of the Gnutella peer-to-peer file sharing network on August 4th 2002 in which nodes represent hosts in the Gnutella network topology and edges represent connections between the hosts; Skitter is the Internet topology graph captured by tracerouting in 2005. In the second category: Flickr is a contact network crawled from the photo sharing Flickr website where nodes are users and edges are friendship connections between users; Google is the dataset of webpages and hyperlinks between the webs released by Google company in 2002; Wiki-Talk contains the set of users in the Wikipedia website and edit relationship (who edits take pages of whom) and Orkut is an online social networks with users as nodes and friendships as connections.
Performance and Scalability measures
(Performance) For a fair comparison between different methods, we count the number of triangles broken by the set of nodes/edges returned by the algorithms as the quality measure.
(Scalability) In terms of scalability, we record the running time consumed by each algorithm. For the min-triangle-breaking-node and min-triangle-breaking-edge problem, we only measure the running time of DAK-n and DAK-e. The input-dependent bound of our algorithms is also illustrated in the last experiments.
Implementation and Testing Environment
We implemented our algorithms DAK-n and DAK-e in C++ programming language with GCC 4.8 C++11 compiler. We also implemented the GreedyAll [18] algorithm following closely the provided description and pseudo-code. All the experiments are run on a Linux environment with 2.2Ghz Xeon 8 core processor and 100GB of RAM. In each execution, only a single core is assigned for each method.
VII-B Performance Evaluation
The performance, i.e., solution quality, measured by the number of triangles broken by the node or edge sets returned by the algorithms is illustrated in Figs. 1 and 2 for node and edge variants, respectively. As depicted from these figures, DAK-n, DAK-e and GreedyAll consistently have the best performance on all the social traces compared to the others. Pagerank and Max-degree achieve very good solution quality on certain datasets, e.g., Google and Wiki-Talk, but fall far behind DAK-n, DAK-e and GreedyAll on the other tests. The quality of Random strategy, as expected, falls below and is inconsistent compared to the others. In summary, empirical results from multiple real-world data confirm the performance provided by our suggested algorithms.
Figs. 1 and 2 also display the typical trend of monotone and submodular functions as they exhibit the diminishing return property. For the first few selections, the marginal gain (in terms of the number of broken triangles) is significant yet the later rounds provide smaller marginal gain, and the gain tends to saturate quickly.
VII-C Scalability Evaluation
Figs. 3 and 4 report the time consumption (in seconds) of testing algorithms in experiments. These figures display three groups of methods with different magnitudes: (1) GreedyAll with most time consumption (up to 100x times higher than the second group) (2) DAK-n, DAK-e, Pagerank and Max-degree algorithms, and (3) Random method which returns almost instantly random nodes/edges. Our suggested algorithms DAK-n and DAK-e require comparable amount of time as Pagerank and Max-degree which are two canonical centrality measures and very fast to compute. Better yet, DAK-n and DAK-e produce much better solution quality than Pagerank and Max-degree while are very comparable in terms of scalability.
These extensive experiments illustrate that our proposed DAK-n and DAK-e algorithms is highly competitive to the current best GreedyAll method performance meanwhile is much better in terms of scalability. As shown in the previous experiments, only GreedyAll has similarly highest level of solution quality as DAK-n and DAK-e; however, our running time results show that GreedyAll is up to 20 slower than DAK-n on the node removal problem and 100 times slower than DAK-e on the edge removal variants.
VII-D Input-dependent bound testing
Finally, we perform experiments on the input-dependent bounding technique embedded in DAK-n and DAK-e algorithms. Theoretically, the solutions returned by DAK-n and DAK-e are guaranteed to be at least on any problem instance. In practice, we can have better guarantee depending on the problem instance and the execution itself. Our input-dependent bounding strategy is one way of finding such instance- and execution-dependent guarantees.
Table IV presents the input-dependent bounds provided by our proposed DAK-n algorithm for node removal problem. This table shows the input-dependent bounds are substantially better than the theoretical guarantee . For example, with on Wiki-Talk, DAK-n guarantees solution at 95% optimal. For the case of Gnutella network, with , DAK-n guarantees to find the optimal solution, implying that all the triangles have been disrupted. One can also observe that the bound gets tighter when increases. This is explainable due to the nature of our bounding technique: larger means more triangles are broken and the gain of the next nodes becomes smaller and approximation ratio approaches 1.
VIII Conclusion
In this paper, we study the problems of finding critical nodes and links whose failures will severely damage most triangles in the network, changing the network’s organization and (possibly) leading to the unpredictable dissolving of the network. We formulate this vulnerability analysis as optimization problems, and provide proofs of their NP-Completeness. We propose two algorithms DAK-n and DAK-e with notable performance and scalability. Both DAK-n and DAK-e obtain best approximation guarantees: 19/27-approximation for -triangle-breaking-node and -triangle-breaking-edge as well as 3-approximation for min-triangle-breaking-node and min-triangle-breaking-edge, and are scalable for network with millions nodes and edges. Those features lend our approaches nicely into the analysis of various large-scale real-world problems. In the future, we aim to bridge the gaps between theory and practice to design the scalable approximation with best possible approximation ratios.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Petter Holme, Beom Jun Kim, Chang No Yoon, and Seung Kee Han. Attack vulnerability of complex networks. Phys. Rev. E , 65:056109, May 2002.
- 2[2] D. J. Watts and S. H. Strogatz. Collective dynamics of’small-world’networks. Nature , 393(6684):409–10, 1998.
- 3[3] Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. The dynamics of viral marketing. ACM Trans. Web , 1(1), May 2007.
- 4[4] Damon Centola. The spread of behavior in an online social network experiment. Science , 329(5996):1194–1197, 2010.
- 5[5] Kieron J Barclay, Christofer Edling, and Jens Rydgren. Peer clustering of exercise and eating behaviours among young adults in sweden: a cross-sectional study of egocentric network data. BMC public health , 13(1):784, 2013.
- 6[6] Linyuan Lü, Duan-Bing Chen, and Tao Zhou. The small world yields the most effective information spreading. New Journal of Physics , 13(12):123005, 2011.
- 7[7] Nishant Malik and Peter J Mucha. Role of social environment and social clustering in spread of opinions in coevolving networks. Chaos: An Interdisciplinary Journal of Nonlinear Science , 23(4):043123, 2013.
- 8[8] Damon Centola. An experimental study of homophily in the adoption of health behavior. Science , 334(6060):1269–1272, 2011.
