Identifying vital nodes based on reverse greedy method
Tao Ren, Zhe Li, Yi Qi, Yixin Zhang, Simiao Liu, Yanjie Xu, and Tao, Zhou

TL;DR
This paper introduces a reverse greedy method for identifying vital nodes in networks, which outperforms existing methods in maintaining network connectivity, by iteratively removing the least important nodes.
Contribution
The paper proposes a novel reverse greedy approach for vital node identification that demonstrates superior performance over existing methods.
Findings
Reverse greedy method outperforms state-of-the-art techniques
Method effectively identifies nodes critical for network connectivity
Empirical results on ten real networks validate the approach
Abstract
The identification of vital nodes that maintain the network connectivity is a long-standing challenge in network science. In this paper, we propose a so-called reverse greedy method where the least important nodes are preferentially chosen to make the size of the largest component in the corresponding induced subgraph as small as possible. Accordingly, the nodes being chosen later are more important in maintaining the connectivity. Empirical analyses on ten real networks show that the reverse greedy method performs remarkably better than well-known state-of-the-art methods.
| Networks | ||||||
|---|---|---|---|---|---|---|
| Jazz | 198 | 2742 | 27.6970 | 0.6334 | 0.0202 | 1.3951 |
| NS | 379 | 914 | 4.8232 | 0.7981 | -0.0817 | 1.6630 |
| 1133 | 5451 | 9.6222 | 0.2540 | 0.0782 | 1.9421 | |
| PB | 1222 | 16714 | 27.3552 | 0.3600 | -0.2213 | 2.9707 |
| Sex | 15810 | 38540 | 4.8754 | 0 | -0.1145 | 5.8276 |
| 63731 | 817090 | 25.6418 | 0.2532 | 0.1769 | 3.4331 | |
| USAir | 332 | 2126 | 12.8072 | 0.7494 | -0.2079 | 3.4639 |
| Power | 4941 | 6594 | 2.6691 | 0.1065 | 0.0035 | 1.4504 |
| Router | 5022 | 6258 | 2.4922 | 0.0329 | -0.1384 | 5.5031 |
| HepPh | 34546 | 420877 | 24.3662 | 0.2962 | -0.0063 | 2.6055 |
| Networks | Random | BC | CC | DC | H-index | KS | PR | CI | RG |
|---|---|---|---|---|---|---|---|---|---|
| Jazz | 0.4808 | 0.3956 | 0.4199 | 0.4409 | 0.4497 | 0.4571 | 0.4262 | 0.3913 | 0.3477 |
| NS | 0.2752 | 0.0488 | 0.1336 | 0.0540 | 0.1155 | 0.1582 | 0.0524 | 0.0551 | 0.0252 |
| 0.4442 | 0.2578 | 0.2893 | 0.2519 | 0.2836 | 0.2937 | 0.2395 | 0.2231 | 0.1844 | |
| PB | 0.4615 | 0.2192 | 0.2908 | 0.2286 | 0.2578 | 0.2611 | 0.2155 | 0.1968 | 0.1740 |
| Sex | 0.3842 | 0.0841 | 0.2208 | 0.0725 | 0.0981 | 0.1142 | 0.0690 | 0.0604 | 0.0513 |
| 0.4545 | 0.2935 | 0.3570 | 0.3137 | 0.3328 | 0.3389 | 0.2893 | 0.2671 | 0.2372 | |
| USAir | 0.4321 | 0.1129 | 0.1442 | 0.1228 | 0.1498 | 0.1588 | 0.1072 | 0.1105 | 0.0942 |
| Power | 0.2069 | 0.0656 | 0.1973 | 0.0634 | 0.1090 | 0.2628 | 0.0594 | 0.0489 | 0.0088 |
| Router | 0.3044 | 0.0142 | 0.0686 | 0.0121 | 0.0136 | 0.0276 | 0.0136 | 0.0140 | 0.0063 |
| HepPh | 0.4765 | 0.3504 | 0.4259 | 0.3664 | 0.3931 | 0.4022 | 0.3371 | 0.3015 | 0.2657 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Identifying vital nodes based on reverse greedy method
Tao Ren
Software College, Northeastern University of China, Shenyang, 110819, P. R. China
[email protected], [email protected]
Zhe Li
Software College, Northeastern University of China, Shenyang, 110819, P. R. China
Yi Qi
Software College, Northeastern University of China, Shenyang, 110819, P. R. China
Yixin Zhang
Software College, Northeastern University of China, Shenyang, 110819, P. R. China
Simiao Liu
Software College, Northeastern University of China, Shenyang, 110819, P. R. China
Yanjie Xu
Software College, Northeastern University of China, Shenyang, 110819, P. R. China
Tao Zhou
CompleX Lab, University of Electronic Science and Technology of China, Chengdu, 611731, P. R. China
[email protected], [email protected]
Abstract
The identification of vital nodes that maintain the network connectivity is a long-standing challenge in network science. In this paper, we propose a so-called reverse greedy method where the least important nodes are preferentially chosen to make the size of the largest component in the corresponding induced subgraph as small as possible. Accordingly, the nodes being chosen later are more important in maintaining the connectivity. Empirical analyses on ten real networks show that the reverse greedy method performs remarkably better than well-known state-of-the-art methods.
Introduction
Network science is playing an increasingly significant role in many domains including physics, sociology, engineering, biology, management, and so on [1]. Because of the heterogeneous nature of real networks [2], the overall connectivity of complex networks may depend on a small set of nodes, usually named as hub nodes. Taking the Internet as an example, several vital nodes attacked deliberately may lead to the collapse of the whole network [3]. Therefore, an efficient algorithm to identify vital nodes that have critical impacts on the network connectivity can help to better prevent catastrophic outages in power grids or the Internet [3, 4, 5, 6], maintain the connectivity or design efficient attacking strategies for communication networks [7], improve urban transportation capacity with low cost [8], enhance robustness of financial networks [9], and so on.
Till far, to identify vital nodes for network connectivity, the majority of known methods only make use of the structural information [10]. Typical representatives include degree centrality [11] (DC), H-index [12], k-shell decomposition method [13] (KS), PageRank [14] (PR), LeaderRank [15], closeness centrality [16] (CC), betweenness centrality [17] (BC), and so on. For DC, nodes with larger degrees are more vital. For H-index, nodes connecting with many large-degree neighbors are more important. KS assigns a k-shell index to each node based on its topological location, where nodes closer to the core of the network will get higher k-shell indices, and nodes in the periphery will get lower k-shell indices. The nodes with higher k-shell indices are considered to be more vital. PR suggests that the importance of a node is determined by the influences of its neighbors. CC claims that a node averagely closer to other nodes is more vital while BC assumes that a node locating in many shortest paths is of high importance. Recently, Morone and Makse [18] proposed a novel index called collective influence (CI), which is based on the site percolation theory and can find out the minimal set of nodes that are crucial for the global connectivity. CI performs remarkably better than many previous methods in identifying the nodes’ importance for network connectivity [18, 19].
This paper proposed a novel method named reverse greedy (RG) method. The first word stands for the process that we add nodes one by one to an empty network, which is inverse to the usual process that removes nodes from the original network. The second word emphasizes that we choose the nodes added by minimizing the size of the largest component. Empirical analyses on ten real networks show that RG performs remarkably better than well-known state-of-the-art methods.
Results
Algorithms
The core of the RG algorithm is the reverse process, which adds nodes one by one to an empty network while minimizes the cost function until all nodes in the considered network are added. Then, nodes are ranked inverse to the order of additions, that is to say, the later added nodes are more important in maintaining the network connectivity. Denote the original network under consideration, where and are the sets of nodes and edges, respectively. This paper focuses on simple networks, where the weights and directions of edges are ignored, and the self loops are not allowed. The reverse process starts from an empty network , where and . At the th time step, one node from the remaining set is selected to add into the current network to form a new network of nodes, say . Note that, all progressive networks (, with being the size of the original network ) in the process are induced subgraphs of . For example, is consisted of all edges in with both two ends belonging to . According to the greedy strategy, the selected node should minimize the size of the largest component in . If there are multiple nodes satisfying this condition, we will choose the one with the help of another structural feature of the node in (e.g., degree, betweenness, and so on). Therefore, the cost function can be defined as
[TABLE]
where is the size of the largest component after adding node into , is a certain structural feature of node in , and is a very small positive parameter that works only when are indistinguishable for multiple nodes. Each time step, we add the node minimizing the cost function into the network, and if there are still multiple nodes with the minimum cost, we will select one of them randomly. This process stops after time steps, namely all nodes are added with . An illustration of such process in a small network is shown in Figure 1.
Data Description
In this paper, ten real networks from disparate fields are used to test the performance of RG, including two collaboration networks (Jazz and NS), one communication network (Email), three social networks (PB, Sex and Facebook), one transportation network (USAir), one infrastructure network (Power), one technological network (Router) and one citation network (HepPh). Jazz [20] is a collaboration network of jazz musicians. NS [21] is a co-authorship network of scientists working on network science. Email [22] describes email interchanges between users including faculty, researchers, technicians, managers, administrators, and graduate students of the Rovira i Virgili University. PB [23] is a network of US political blogs. Sex [24] is a bipartite network in which nodes are females (sex sellers) and males (sex buyers) and edges between them are established when males write posts indicating sexual encounters with females. Facebook [25] is a sample of the friendship network of Facebook users. USAir [26] is the US air transportation network. Power [27] is the power grid of the western United States. Router [28] is a symmetrized snapshot of the structure of the Internet at the level of autonomous systems. HepPh [29] is a citation network of high energy physics phenomenology. These networks’ topological features (including the number of nodes, the number of edges, the average degree, the clustering coefficient [27], the assortative coefficient [30] and the degree heterogeneity [31]) are shown in Table 1.
Empirical Results
We apply the widely used metric called robustness [32] to evaluate algorithms’ performance. Given a network, we remove one node at each time step and calculate the size of the largest component of the remaining network until the remaining network is empty. The robustness is defined as [32]
[TABLE]
where is the number of nodes in the largest component divided by after removing nodes. The normalization factor ensures that the values of of networks with different sizes can be compared. Obviously, a smaller means a quicker collapse and thus a better performance.
Figure 2 shows the collapsing processes of four representative networks, resulted from the node removal by RG and other benchmark algorithms (see details about these benchmark algorithms in Methods). Obviously, RG can lead to much faster collapse than all other algorithms, and CI is the second best algorithm. Table 2 compares the robustness of RG and other benchmarks. As shown in Table 2, every algorithm is better than the random removal and to our surprise, for all the ten networks, RG is always the best. In most cases, CI is the second best algorithm. One can further observe that the advantage of RG is particularly significant for sparse networks, such as Power and Router.
Discussion
To our knowledge, most previous methods directly identify the critical nodes by looking at the effects due to their removal [10]. In contrast, our method tries to find out the least important nodes, so that the remaining ones are those critical nodes. To our surprise, such a simple idea eventually results in an efficient algorithm that outperforms many well-known benchmark algorithms. Beyond the percolation process considered in this paper, the reverse method provides a novel angle of view that may find successful applications in some other network-based optimization problems related to certain rankings of nodes or edges.
Lastly, we would like to emphasize that the current version of the RG algorithm is just the simplest implementation of the above reverse idea. For example, instead of degree, can be designed in a sophisticated way to improve the algorithm’s performance. In addition, the simple adoption of the greedy strategy may bring us to some local optimums. Such shortage can be to some extent overcame by introducing the beam search [33], which searches for the best set of nodes adding to the network that optimizes the cost function. The present algorithm is the special case for . Although beam search is still a kind of greedy strategy, it usually performs much better when is sufficiently large. At the same time, the beam search with large costs a lot on time and space. Therefore, how to find a good tradeoff is also an open challenge in real practice.
Methods
Benchmark Centralities
Degree Centrality [11] of node is defined as
[TABLE]
where is the adjacency matrix, that is, = 1 if and are directly connected and 0 otherwise.
H-index [12] of node , denoted by , is defined as the maximal integer satisfying that there are at least neighbors of node whose degrees are all no less than . Such index is an extension of the famous H-index in scientific evaluation [34] to network analysis.
PageRank [14] of node is defined as the solution of the equations
[TABLE]
where is the degree of node and is a free parameter controlling the probability of a random jump. In this paper, is set to .
Closeness Centrality [16] of node is defined as
[TABLE]
where is the shortest distance between nodes and .
Betweenness Centrality [17] of node is defined as
[TABLE]
where is the number of shortest paths between nodes and , and is the number of shortest paths between nodes and that pass through node .
Collective Influence [18] (CI) of node is defined as
[TABLE]
where is the set of nodes inside a ball of radius , consisted of all nodes with distances no more than from node , and is the frontier of this ball.
Data Availability
All relevant data are available at https://github.com/MLIF/Network-Data2.
Acknowledgements
The authors acknowledge DataCastle to hold the related world-wide competition and to share the data. This work is partially supported by National Natural Science Foundation of China (61473073, 61104074, 61433014), Fundamental Research Funds for the Central Universities (N161702001, N171706003, N181706001, N182608003).
Author Contributions
T.R., Y.Q., Z.L. and T.Z. devised the research project. Y.Q., Y.X.Z. and S.M.L. performed the research. T.R., Z.L., Y.Q., Y.X.Z., S.M.L. and T.Z. analyzed the data. T.R., Z.L., Y.X.Z., S.M.L., Y.J.X. and T.Z. wrote the paper.
Additional Information
Competing Interests: The authors declare no competing interests.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Newman, M. E. J. Networks (Oxford University Press, Oxford, 2018).
- 2[2] Caldarelli, G. Scale-Free Networks: Complex Webs in Nature and Technology (Oxford University Press, Oxford, 2007).
- 3[3] Cohen, R., Erez, K., Ben-Avraham, D. & \& Havlin, S. Breakdown of the internet under intentional attack. Phys. Rev. Lett. 86, 3682-3685 (2001).
- 4[4] Motter, A. E. & \& Lai, Y. C. Cascade-based attacks on complex networks. Phys. Rev. E 66, 065102 (2002).
- 5[5] Motter, A. E. Cascade control and defense in complex networks. Phys. Rev. Lett. 93, 098701 (2004).
- 6[6] Albert, R., Albert, I. & \& Nakarado, G. L. Structural vulnerability of the North American power grid. Phys. Rev. E 69, 025103 (2004).
- 7[7] Albert, R., Jeong, H. & \& Barabási, A. L. Error and attack tolerance of complex networks. Nature 406, 378–382 (2000).
- 8[8] Li, D., Fu, B., Wang, Y., Lu, G., Berezin, Y., Stanley, H. E., & \& Havlind, S. Percolation transition in dynamical traffic network with evolving critical bottlenecks. Proc. Natl. Acad. Sci. USA 112, 669-672 (2015).
