Seedless Graph Matching via Tail of Degree Distribution for Correlated Erdos-Renyi Graphs
Mahdi Bozorg, Saber Salehkaleybar, Matin Hashemi

TL;DR
This paper introduces a seedless network alignment algorithm that leverages the tail of degree distributions to match nodes in correlated Erdos-Renyi graphs, outperforming previous methods on synthetic and real networks.
Contribution
The proposed algorithm uniquely uses degree distribution tails for seedless graph matching, eliminating the need for auxiliary information.
Findings
Outperforms previous methods in correct matching probability.
Effective on both synthetic Erdos-Renyi and real networks.
Works in sparse graph regimes where recovery is theoretically feasible.
Abstract
The network alignment (or graph matching) problem refers to recovering the node-to-node correspondence between two correlated networks. In this paper, we propose a network alignment algorithm which works without using a seed set of pre-matched node pairs or any other auxiliary information (e.g., node or edge labels) as an input. The algorithm assigns structurally innovative features to nodes based on the tail of empirical degree distribution of their neighbor nodes. Then, it matches the nodes according to these features. We evaluate the performance of proposed algorithm on both synthetic and real networks. For synthetic networks, we generate Erdos-Renyi graphs in the regions of and , where a previous work theoretically showed that recovering is feasible in sparse Erdos-Renyi graphs if and only if the probability of having an edge between a pair…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Complex Network Analysis Techniques · Caching and Content Delivery
\nonumnote
The authors are with the Learning and Intelligent Systems Laboratory, Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran. Webpage: http://lis.ee.sharif.edu, E-mails: [email protected], [email protected] (corresponding author), [email protected].
Seedless Graph Matching via Tail of Degree Distribution for Correlated Erdős-Rnyi Graphs
Mahdi Bozorg
Saber Salehkaleybar
Matin Hashemi
Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran
Abstract
The network alignment (or graph matching) problem refers to recovering the node-to-node correspondence between two correlated networks. In this paper, we propose a network alignment algorithm which works without using a seed set of pre-matched node pairs or any other auxiliary information (e.g., node or edge labels) as an input. The algorithm assigns structurally innovative features to nodes based on the tail of empirical degree distribution of their neighbor nodes. Then, it matches the nodes according to these features. We evaluate the performance of proposed algorithm on both synthetic and real networks. For synthetic networks, we generate Erdős-Rnyi graphs in the regions of and , where a previous work theoretically showed that recovering is feasible in sparse Erdős-Rnyi graphs if and only if the probability of having an edge between a pair of nodes in one of the graphs and also between the corresponding nodes in the other graph is in the order of , where is the number of nodes. Experiments on both real and synthetic networks show that it outperforms previous works in terms of probability of correct matching.
keywords:
Graph Matching \sepNetwork Alignment \sepErdős-Rnyi Graphs
{highlights}
Proposing a network alignment (graph matching) algorithm, which requires neither any seed set of pre-matched nodes, nor any auxiliary node or edge information.
Solving the problem solely based on structural similarities between the two graphs, in specific, based on the tail of empirical degree distribution as node features.
Significantly improving the probability of correct matching compared to previous methods in Erdős-Rnyi graphs.
1 Introduction
Graph matching (or network alignment) between two correlated networks is the problem of finding bijection mapping between the nodes in one network to the nodes in the other network according to structural similarities between them. If the two networks have exactly the same structure, the problem reduces to the graph isomorphism problem, but in general, the two networks are only similar, which makes the problem more challenging.
Network alignment arises in various applications in different fields including computer vision [4], pattern recognition [5], autonomous driving [34], computational biology [10, 30], and social networks [27]. For instance, in computational biology, protein-protein interactions (PPI) can be modeled as networks. PPI networks of different species can be aligned by solving the network alignment problem which can be useful in investigating evolutionary conserved pathways or reconstructing phylogenetic trees [16].
Network alignment algorithms can be classified from different aspects like seed-based algorithms, and seedless algorithms. Seed-based network alignment algorithms work based on a set of pre-matched nodes from the two networks, called seeds [14, 26, 35], while seedless algorithms do not require any seed set as input [5]. Moreover, in order to assist the matching procedure, some algorithms employ node or edge features as a side information (e.g., user names or locations in de-anonymization of social networks [25, 28]), while some other matching algorithms do not require such prior knowledge and only utilize the structural similarities between the two networks as the most important feature in solving the problem [13]. In this paper, we propose a seedless network alignment algorithm which does not require either any input seed set, or any input features for the nodes or edges as side information. In other words, the proposed algorithm works solely based on structural similarities between the two correlated networks.
Most of the seed-based network alignment algorithms rely on the idea of percolation, in which the algorithm starts from a small set of pre-matched nodes (seeds), and gradually expands the set of matched nodes by applying some rules on the neighbor nodes of previously matched nodes. The pioneering method in this category, which succeeded in de-anonymizing a social network with millions of nodes, was introduced by Narayanan and Shmatikov [27]. They empirically observed that the proposed algorithm is very sensitive to the size of the seed set. If the size of seed set is too small, the algorithm could not percolate, but if the size exceeds a threshold, the algorithm could successfully percolate and de-anonymize a large portion of the entire network. Yartseva and Grossglauser [32] later proved that such phenomenon happens in random bigraph models. Later, Kazemi et al. [15] proposed a percolation-based method called NoisySeed algorithm. The main advantage of this algorithm, as the name implies, is that the initial seed set can include some incorrectly matched pairs as well. The required size for the seed set as well as the tolerable number of incorrect matches have been investigated in [15].
Compared with the above solutions, the seedless algorithms do not require pre-matched node pairs as an input. In the literature, several seedless methods have been proposed based on convex relaxations of network alignment problem. For instance, in [24], alignment problem is relaxed as a quadratic programming problem, and then, the solution is projected into zeros and ones in order to recover the mapping between nodes of two networks. Some other seedless algorithms rely on computing graph edit distance between the two networks, which is basically the minimum number of edge deletions or insertions required to convert one of the networks to the other one [5, 11]. Methods based on convex relaxations or graph edit distance are often much more time consuming than other seedless network alignment algorithms [8].
Spectral methods are another type of seedless algorithms which align nodes based on eigenvalues and eigenvectors of a transformation of the network’s adjacency matrix [3, 20]. The main idea in these methods is to obtain Laplacian matrices from adjacency matrices of the two networks and then compute the eigenvectors and eigenvalues of these Laplacian matrices. Next, number of eigenvectors corresponding to top eigenvalues are selected to construct a -dimensional feature vector for every node. From these feature vectors, the nodes in two networks can be aligned based on a distance metric.
Besides to the above seedless algorithms, several machine-learning based algorithms have been proposed that match nodes based on a set of features which are extracted by processing additional information from nodes, e.g., user-names or locations in social networks [1, 9, 28]. As mentioned before, the proposed method in this paper works merely based on structural similarities between the two networks, and does not require any additional features.
Recently, few seedless network alignment algorithms have been proposed for Erdős-Rnyi graphs. Barak et al. [2] presented a matching algorithm that finds certain small sub-networks that appear in both networks, based on which a set of seeds is formed accordingly. Next, a percolation algorithm extends the selected seeds to match all the nodes. This algorithm is designed for Erdős-Rnyi graphs with average node degrees in the range or , where is a small positive constant. This range covers very sparse or very dense Erdős-Rnyi graphs. Compared with this algorithm, the proposed solution in this paper works on Erdős-Rnyi graphs with average node degrees of order . In fact, it has been shown that the true graph matching can be recovered with high probability if and only if the average node degree is in the order of [6]. Thus, our proposed algorithm can work for the minimum value of average node degree that is possible to find the correct matching.
Dai et al. [7] proposed another network alignment algorithm for Erdős-Rnyi graphs called canonical labeling. In the first step of this algorithm, the nodes in the two networks are sorted according to their degrees. Then, the top highest degree nodes in two networks are aligned based on the sorted lists. In the second step, each remaining node gets a binary vector of length . Entry of this vector is equal to one if node is connected to -th node in the sorted list. Otherwise, this entry is set to zero. The nodes are then aligned according to these binary feature vectors. Our experiments show that the canonical labeling does not have good performance in the networks with average node degrees of order or even . Ding et al. [8] proposed a network alignment algorithm for Erdős-Rnyi graphs with average node degree in three regions including . In this algorithm, every node is assigned a feature vector containing empirical degree distribution of its neighbors. Then, the minimum distance on these features are used to match the nodes. This algorithm has a relatively higher accuracy in Erdős-Rnyi graphs with average degree of , but our experiments show that it has lower performance for the graphs with average node degree of order .
Beside to the mentioned algorithms for Erdős-Rnyi graphs, several graph matching algorithms have been proposed with specific applications in PPI networks, social networks, and image databases. Singh et al. [31] introduced a well-known network alignment algorithm in PPI networks, which is named IsoRank. In this algorithm, the similarity of a node in one of the network to a node in the other network depends on how similar are the neighbor nodes of node to the neighbor nodes of node . More specifically, in the first step of this algorithm, the similarity matrix is constructed iteratively where entry indicates similarity of node in one of the network to node in the other network. In each iteration, entry is computed from other entries in like , where and are neighbor nodes of and in the two networks, respectively. In the second step, nodes in two networks are aligned according to . Later, Zhang et al. [36] proposed a network alignment algorithm, called Final algorithm. The Final algorithm can work on both node and edge attributed networks or simple networks without any auxiliary information. Furthermore, this algorithm uses prior knowledge of pairwise alignment preference matrix, where each entry in this matrix shows likelihood of aligning two corresponding nodes from two input networks. If this prior knowledge is not given, all entries of are set to the same value, i.e., a uniform distribution. This algorithm iteratively minimizes an objective function, which is constructed from network structure (i.e, adjacency matrix) and nodes and edges attributes. Zhang et al. [37] proposed another network algorithm called Moana. This algorithm aligns nodes in three steps. First, it coarsens the input networks to a structured representation. Next, it aligns the coarsened representation. Finally, the alignment at multi levels is obtained including node level by interpolation.
Recently, several works [12, 33] with the applications in the fields of computer vision, used graph neural networks in order to obtain node embedding vectors and match the nodes based on them. These works utilized extracted features from images as inputs to the graph neural network to facilitate the process of graph matching.
In this paper, we propose a seedless network alignment algorithm, which works without any auxiliary information. The proposed algorithm has two main steps: In the first step, for each node in any of two correlated networks, we construct a feature vector containing degrees of nodes like having the following two properties: I) Node should be in the neighborhood of node . II) Its degrees is in the tail of empirical degree distribution of nodes in neighborhood of node . Due to this property of the proposed algorithm, we call it “Tail Degree Signature (TDS)" network alignment algorithm. In the second step, we compute a distance metrix between any pair of feature vectors to generate the matrix of distances. Then we use a greedy algorithm or the Hungarian algorithm [17]) to align nodes from the constructed distance matrix. We evaluate the performance of TDS algorithm for both synthetic and real networks. For synthetic networks we select Erdős-Rnyi graphs with average degree of order and , which are difficult regions for the network alignment problem [6]. Experiments show that the proposed TDS algorithm outperforms other related works in both real-world networks and synthetic Erdős-Rnyi graphs with average node degree of order and also .
2 Problem Definition
Network alignment is problem of identifying a bijection mapping between nodes in two structurally similar graphs. Let and be two graphs with node sets and of size , and edge sets and . We denote the edge between nodes and by . Let mapping function denote a one-to-one mapping between nodes of and . The goal in the graph matching problem is to select a matching from different possible mapping functions in the symmetric group such that:
[TABLE]
where is Frobenius norm and and are the adjacency matrices for and , respectively. Moreover, the matrix is a simultaneous row/column permuted version of , and is the permutation matrix corresponding to mapping which is defined as:
[TABLE]
In other words, the objective function in Equation (1) measures the number of mis-matched edges between relabeled version of graph based on mapping and graph . In the worst case, solving the above optimization problem is NP-hard [29].
For synthetic graphs, we assume that and are two correlated Erdős-Rnyi graphs where the original graph is generated with parameter , i.e, there is an edge between any two nodes with probability . Then, two correlated graphs and are constructed where edge sets and are sampled from with probability . In other words, every edge in edge set is in and with probability , independently. The vertex set is the same as , but is a permuted version of according to mapping . The matching algorithm tries to recover given only and . For correlated Erdős-Rnyi graphs, it can be shown [14] that maximum a-posteriori (MAP) estimation is equivalent to minimizing the objective function in Equation (1). Furthermore, MAP estimator finds the ground truth matching, i.e., with high probability if and only if [6]. Hence, no matching algorithm can return the correct output for values less than .
3 Tail Degree Signature (TDS) Algorithm
Our proposed graph matching algorithm consists of two steps: I) For every node in both graphs, a feature vector is extracted. II) Based on these feature vectors, the nodes in the two subsets are matched.
3.1 Feature Extraction
Method: For every node , we extract a feature vector based on its neighbor nodes in as follows: Let be the set of nodes in graph whose distance from node is exactly equal to , where , and is the maximum distance that is considered in the feature extraction procedure. For every node and every , set is formed as the degrees of the nodes in , i.e.,
[TABLE]
Next, for a given integer parameter , we pick of the smallest and of the largest elements in and put them in feature vector of size . Finally, feature vector is formed by concatenating vectors as follows:
[TABLE]
Thus, is a vector of size . By a similar procedure, for every node , feature vector is also formed. Fig. 1(a) shows two example graphs and . Fig. 1(b) shows the construction procedure of where and are generated according to Fig. 1(a). As few other examples, , and are also shown in Fig. 1(b).
Rationale: In constructing the vector , we select the degree of nodes in which are in the tail region of empirical degree distribution of nodes in . Herein, we give an intuition why such selection is more preferable than considering nodes’ degrees outside of this region for . For the original graph , i.e, there is an edge between any two nodes in with probability . Then and are constructed where edge sets and are sampled from with probability . Thus, and are two Erdős-Rnyi graphs with probability. It can be seen that the degree distribution of node in graph or is approximately a normal distribution with parameters and . Let be the normalized degree of node in graph , i.e., . is defined similarly in graph .
Proposition 1**.**
If node is the corresponding node of a node , i.e., , then and are two correlated random variables with the correlation coefficient: . Otherwise, they are approximately uncorrelated for large .
Proof.
To prove the above statement, it is just needed to compute the following term for the two correlated random variables and :
[TABLE]
where is the indicator function.
(a) Due to the fact that the events and are independent for .
(b) The probability of existing an edge between nodes and in the original graph is equal to . Moreover, the probability of having that edge in both graphs and is . Hence, the expectation of event in the second sum would be .
Thus, the correlation coefficient between and would be:
[TABLE]
Similarly, for the case of , it can be shown that . Hence, the two random variables are approximately uncorrelated for large if . ∎
Based on the above observation, we can model the two random variables and as where is an independent standard normal variable. To show the advantage of selecting nodes’ degrees in the tail of degree distribution, we define the following two metrics between any two nodes and :
[TABLE]
where and are the empirical distribution obtained from observing samples of and , respectively. In fact, and represent the total variation distances [23] of and in the tail and central domains of distributions, respectively. For a given node , we are interested in comparing with for any . We define the score for any and . We expect to have better matching results for higher . The score is defined similarly. We compare the average of and experimentally by generating samples of and for two correlation coefficients and . From these samples, one instance of and can be computed. Fig. 2 shows the average of and over instances against parameter for and . As can be seen, the score of tail region is about greater than the one for the central region. This observation illustrates that the empirical degree distribution in the tail region is much more robust to sampling parameter .
Complexity Analysis: For every node , we run BFS algorithm with root node and obtain all nodes in for every . Since the average degree of each node is in the order of , the average number of neighbor nodes up to distance is in the order of . Thus, the time complexity of this part is . Moreover, it takes to sort the nodes in and construct . Therefore, the total time complexity of the feature extraction step for all nodes is in the order of O\Big{(}\big{(}(nps)^{2\lambda}+\lambda(nps)^{\lambda}\log(nps)\big{)}\times n\Big{)}. For and , the time complexity is simplified to . For , it is in the order of .
3.2 Matching Method
First, we compute similarity matrix (or distance matrix) between and . In particular, element in this matrix is equal to: where and .
Next, we form the set of matched pairs between and , by executing Hungarian algorithm on the similarity matrix . More specifically, Hungarian algorithm selects number of entries from matrix , where from each column and each row, exactly one entry is chosen and the selected entries minimize the following cost:
[TABLE]
In other words, by running Hungarian algorithm, we form a mapping between and that has minimum mean of similarity distance over all possible choices. We call this version of the proposed method as “TDS-h algorithm”.
Another option, instead of using Hungarian algorithm, is to use the following simple greedy algorithm. We select the minimun element in matrix and align node with node , and delete row and column from matrix . This process is repeated times. We call this version of the proposed method as “TDS-g algorithm”.
As an example, Fig. 1(c) shows -norm distances between constructed feature vectors and from Fig. 1(b). Four elements of the similarity matrix are shown in the figure. Either of the two matching methods can be applied. In this example, nodes and in graph are matched to nodes and in graph , respectively.
Complexity Analysis: Time complexity of finding similarity matrix is in the order of . Both matching methods are in the order of .
4 Experimental Evaluation
The proposed seedless graph matching algorithm, called tail degree signature (TDS), is experimentally evaluated in this section. The constant parameters and are set to and , respectively. The algorithm is implemented in Python language.
4.1 Accuracy
The two versions of TDS algorithm (TDS-h and TDS-g) are compared with recent seedless graph matching algorithms on Erdős-Rnyi graphs with and . In particular, we consider Degree Profile (DP) [8], Laplacian [3, 20], Canonical labeling [7], IsoRank [31], Final [36], and Moana [37] algorithms.
Fig. 3(a) shows accuracy of TDS algorithms and the other methods versus . Every value in this figure shows the average accuracy for randomly generated Erdős-Rnyi graphs with and . As shown in the figure, both versions of TDS algorithm achieve much higher accuracy compared to the other algorithms. Moreover, they yield accurate solutions for lower values of . For instance, at , TDS-h and TDS-g achieve about accuracy, while DP and Laplacian reach about and accuracy, respectively, and the accuracy of all the other methods are less than . At , TDS-h and TDS-g achieve about accuracy, while the accuracy of all the other methods are about or less than .
Fig. 3(b) presents the same comparisons as above for Erdős-Rnyi graphs with . In this region, TDS-h has higher accuracy than TDS-g for lower values of . Both versions of TDS algorithm achieve much higher accuracy compared to the other algorithms. For instance, at , TDS-h and TDS-g achieve about and accuracy, respectively, while all the other methods are about or less than .
4.2 Runtime
Fig. 4 compares runtimes of the considered algorithms on Erdős-Rnyi graphs with and for and . An empty value in Fig. 4 denotes that we stopped (killed) the process because the runtime exceeded 16 hours.
As can be seen, for , IsoRank, Final, and Moana have smaller runtimes compared to the other methods. TDS-g, Laplacian, and Canonical have comparable runtimes, while TDS-g have much higher accuracy (Fig. 3(b)). The runtimes of TDS-h and DP grow dramatically as the number of nodes increases.
4.3 Real-world Networks
In addition to Erdős-Rnyi graphs, we also evaluated the proposed TDS algorithm on the following three real-world networks and compared it with previous seedless algorithms.
- •
Bitcoin-OTC [19, 18]: This is who-trusts-whom network of people who traded using Bitcoin cryptocurrency on a platform called Bitcoin-OTC. It contains 5,881 nodes and 35,592 edges. Members (nodes) on this platform can rate other members (nodes) in the range -10 to 10. To use Bitcoin-OTC as a benchmark for evaluating graph matching algorithms, we consider the following two unweighted networks. The first network contains all the nodes and edges in the original graph, and the second network contains only the positive edges.
- •
GR-QC [21]: arXiv GR-QC (General Relativity and Quantum Cosmology) collaboration network contains 5241 nodes and 11,923 edges. Each author is represented by a node. Two nodes are connected if the authors have at least one common paper on GR-QC category from January 1993 to April 2003. This dataset contains two networks which are permuted version of one another, i.e., [21].
- •
Facebook [22]: This dataset contains “friends lists" from Facebook, which was collected from a survey using a Facebook app. The dataset includes 4039 nodes and 88234 edges. For the task of network alignment, we added some noises to the Facebook network edges, i.e, we constructed a new network with the same set of nodes as Facebook network while each edge in Facebook network is preserved in the new network with probability .
In Fig. 5, we compare the accuracy of TDS with other seedless algorithms. In almost all cases, the proposed algorithm outperforms the other algorithms. For instance, in GR-QC benchmark, the accuracy is around in TDS-h, TDS-g, Laplacian, IsoRank, and Final algorithms, while the other methods have less than accuracy. In Bitcoin-OTC and Facebook benchmarks, TDS-h, TDS-g, and DP have much higher accuracy compared to the other methods.
The last column in Fig. 5 shows the geometric mean of the other columns. As it can be seen, TDS-h and TDS-g achieve the mean accuracy of about , while the mean accuracy of DP is about , and the other methods have less than accuracy.
5 Conclusion
In this paper, we proposed a seedless graph matching algorithm for correlated Erdős-Rnyi graphs. We introduced node features based on tail of degree distribution. We showed that this approach has advantages with respect to matching nodes based on center of degree distributions. Our experiments showed that the proposed algorithm outperforms other related works for several real networks as well as Erdős-Rnyi graphs with average degree of order and .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abel et al. [2010] Abel, F., Henze, N., Herder, E., Krause, D., 2010. Interweaving public user profiles on the web, in: International conference on user modeling, adaptation, and personalization, Springer. pp. 16–27.
- 2Barak et al. [2018] Barak, B., Chou, C.N., Lei, Z., Schramm, T., Sheng, Y., 2018. (nearly) efficient algorithms for the graph matching problem on correlated random graphs. ar Xiv preprint ar Xiv:1805.02349 .
- 3Carcassoni and Hancock [2002] Carcassoni, M., Hancock, E.R., 2002. Alignment using spectral clusters., in: BMVC, pp. 1–10.
- 4Cho and Lee [2012] Cho, M., Lee, K.M., 2012. Progressive graph matching: Making a move of graphs via probabilistic voting, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 398–405.
- 5Conte et al. [2004] Conte, D., Foggia, P., Sansone, C., Vento, M., 2004. Thirty years of graph matching in pattern recognition. International journal of pattern recognition and artificial intelligence 18, 265–298.
- 6Cullina and Kiyavash [2016] Cullina, D., Kiyavash, N., 2016. Improved achievability and converse bounds for erdos-rényi graph matching, in: ACM SIGMETRICS Performance Evaluation Review, ACM. pp. 63–72.
- 7Dai et al. [2018] Dai, O.E., Cullina, D., Kiyavash, N., Grossglauser, M., 2018. On the performance of a canonical labeling for matching correlated erdos-renyi graphs. ar Xiv preprint ar Xiv:1804.09758 .
- 8Ding et al. [2018] Ding, J., Ma, Z., Wu, Y., Xu, J., 2018. Efficient random graph matching via degree profiles. ar Xiv preprint ar Xiv:1811.07821 .
