Seedless Graph Matching via Tail of Degree Distribution for Correlated   Erdos-Renyi Graphs

Mahdi Bozorg; Saber Salehkaleybar; Matin Hashemi

arXiv:1907.06334·cs.DS·September 29, 2020

Seedless Graph Matching via Tail of Degree Distribution for Correlated Erdos-Renyi Graphs

Mahdi Bozorg, Saber Salehkaleybar, Matin Hashemi

PDF

Open Access

TL;DR

This paper introduces a seedless network alignment algorithm that leverages the tail of degree distributions to match nodes in correlated Erdos-Renyi graphs, outperforming previous methods on synthetic and real networks.

Contribution

The proposed algorithm uniquely uses degree distribution tails for seedless graph matching, eliminating the need for auxiliary information.

Findings

01

Outperforms previous methods in correct matching probability.

02

Effective on both synthetic Erdos-Renyi and real networks.

03

Works in sparse graph regimes where recovery is theoretically feasible.

Abstract

The network alignment (or graph matching) problem refers to recovering the node-to-node correspondence between two correlated networks. In this paper, we propose a network alignment algorithm which works without using a seed set of pre-matched node pairs or any other auxiliary information (e.g., node or edge labels) as an input. The algorithm assigns structurally innovative features to nodes based on the tail of empirical degree distribution of their neighbor nodes. Then, it matches the nodes according to these features. We evaluate the performance of proposed algorithm on both synthetic and real networks. For synthetic networks, we generate Erdos-Renyi graphs in the regions of $Θ (lo g (n) / n)$ and $Θ (lo g^{2} (n) / n)$ , where a previous work theoretically showed that recovering is feasible in sparse Erdos-Renyi graphs if and only if the probability of having an edge between a pair…

Equations16

\overset{π}{^} = π \in S_{n} argmin ∥ A (G_{b}) - P_{π}^{T} A (G_{a}) P_{π} ∥_{F}^{2},

\overset{π}{^} = π \in S_{n} argmin ∥ A (G_{b}) - P_{π}^{T} A (G_{a}) P_{π} ∥_{F}^{2},

P_{\pi}[i,j]=\left\{\begin{array}[]{lcl}1&:&i\in\mathcal{V}_{a},~{}j\in\mathcal{V}_{b},~{}j=\pi(i),\\ 0&:&\text{otherwise.}\end{array}\right.

P_{\pi}[i,j]=\left\{\begin{array}[]{lcl}1&:&i\in\mathcal{V}_{a},~{}j\in\mathcal{V}_{b},~{}j=\pi(i),\\ 0&:&\text{otherwise.}\end{array}\right.

D (N_{i}^{a, t}) = {D_{i^{'}}^{a} ∣ i^{'} \in N_{i}^{a, t}} .

D (N_{i}^{a, t}) = {D_{i^{'}}^{a} ∣ i^{'} \in N_{i}^{a, t}} .

Φ_{i}^{a} = Φ_{i}^{a, 1} ∣ Φ_{i}^{a, 2} ∣ \dots ∣ Φ_{i}^{a, λ} .

Φ_{i}^{a} = Φ_{i}^{a, 1} ∣ Φ_{i}^{a, 2} ∣ \dots ∣ Φ_{i}^{a, λ} .

\begin{split}\mathbb{E}\Big{[}\mathcal{D}_{i}^{a}&\mathcal{D}^{b}_{\pi^{*}(i)}\Big{]}\\ &=\mathbb{E}\Bigg{[}\sum_{k\neq i}\mathbbm{1}[(i,k)\in\mathcal{E}_{a}]\sum_{k^{\prime}\neq\pi^{*}(i)}\mathbbm{1}[(\pi^{*}(i),k^{\prime})\in\mathcal{E}_{b}]\Bigg{]}\\ &\stackrel{{\scriptstyle(a)}}{{=}}\big{(}(n-1)^{2}-(n-1)\big{)}(ps)^{2}\\ &\qquad+\sum_{k\neq i}\mathbb{E}\Big{[}\mathbbm{1}[(i,k)\in\mathcal{E}_{a}]\mathbbm{1}[(\pi^{*}(i),\pi^{*}(k))\in\mathcal{E}_{b}]\Big{]}\\ &\stackrel{{\scriptstyle(b)}}{{=}}\big{(}(n-1)^{2}-(n-1)\big{)}(ps)^{2}+(n-1)ps^{2},\end{split}

\begin{split}\mathbb{E}\Big{[}\mathcal{D}_{i}^{a}&\mathcal{D}^{b}_{\pi^{*}(i)}\Big{]}\\ &=\mathbb{E}\Bigg{[}\sum_{k\neq i}\mathbbm{1}[(i,k)\in\mathcal{E}_{a}]\sum_{k^{\prime}\neq\pi^{*}(i)}\mathbbm{1}[(\pi^{*}(i),k^{\prime})\in\mathcal{E}_{b}]\Bigg{]}\\ &\stackrel{{\scriptstyle(a)}}{{=}}\big{(}(n-1)^{2}-(n-1)\big{)}(ps)^{2}\\ &\qquad+\sum_{k\neq i}\mathbb{E}\Big{[}\mathbbm{1}[(i,k)\in\mathcal{E}_{a}]\mathbbm{1}[(\pi^{*}(i),\pi^{*}(k))\in\mathcal{E}_{b}]\Big{]}\\ &\stackrel{{\scriptstyle(b)}}{{=}}\big{(}(n-1)^{2}-(n-1)\big{)}(ps)^{2}+(n-1)ps^{2},\end{split}

\begin{split}\rho=\frac{\mathbb{E}\Big{[}\mathcal{D}_{i}^{a}\mathcal{D}^{b}_{\pi^{*}(i)}\Big{]}-\mu^{2}}{\sigma^{2}}=s(1-p)/(1-ps).\end{split}

\begin{split}\rho=\frac{\mathbb{E}\Big{[}\mathcal{D}_{i}^{a}\mathcal{D}^{b}_{\pi^{*}(i)}\Big{]}-\mu^{2}}{\sigma^{2}}=s(1-p)/(1-ps).\end{split}

Δ_{t ai l}^{i, j} Δ_{ce n t er}^{i, j} = \frac{1}{2} \int_{- \infty}^{- \frac{1}{2}} ∣ \overset{p}{^}_{U_{i}^{a}} (x) - \overset{p}{^}_{U_{j}^{b}} (x) ∣ d x + \frac{1}{2} \int_{\frac{1}{2}}^{\infty} ∣ \overset{p}{^}_{U_{i}^{a}} (x) - \overset{p}{^}_{U_{j}^{b}} (x) ∣ d x = \frac{1}{2} \int_{- \frac{1}{2}}^{\frac{1}{2}} ∣ \overset{p}{^}_{U_{i}^{a}} (x) - \overset{p}{^}_{U_{j}^{b}} (x) ∣ d x,

Δ_{t ai l}^{i, j} Δ_{ce n t er}^{i, j} = \frac{1}{2} \int_{- \infty}^{- \frac{1}{2}} ∣ \overset{p}{^}_{U_{i}^{a}} (x) - \overset{p}{^}_{U_{j}^{b}} (x) ∣ d x + \frac{1}{2} \int_{\frac{1}{2}}^{\infty} ∣ \overset{p}{^}_{U_{i}^{a}} (x) - \overset{p}{^}_{U_{j}^{b}} (x) ∣ d x = \frac{1}{2} \int_{- \frac{1}{2}}^{\frac{1}{2}} ∣ \overset{p}{^}_{U_{i}^{a}} (x) - \overset{p}{^}_{U_{j}^{b}} (x) ∣ d x,

cos t = \frac{1}{n} \sum X_{ij} .

cos t = \frac{1}{n} \sum X_{ij} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Complex Network Analysis Techniques · Caching and Content Delivery

Full text

\nonumnote

The authors are with the Learning and Intelligent Systems Laboratory, Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran. Webpage: http://lis.ee.sharif.edu, E-mails: [email protected], [email protected] (corresponding author), [email protected].

Seedless Graph Matching via Tail of Degree Distribution for Correlated Erdős-R $\acute{\text{e}}$ nyi Graphs

Mahdi Bozorg

Saber Salehkaleybar

Matin Hashemi

Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran

Abstract

The network alignment (or graph matching) problem refers to recovering the node-to-node correspondence between two correlated networks. In this paper, we propose a network alignment algorithm which works without using a seed set of pre-matched node pairs or any other auxiliary information (e.g., node or edge labels) as an input. The algorithm assigns structurally innovative features to nodes based on the tail of empirical degree distribution of their neighbor nodes. Then, it matches the nodes according to these features. We evaluate the performance of proposed algorithm on both synthetic and real networks. For synthetic networks, we generate Erdős-R $\acute{\text{e}}$ nyi graphs in the regions of $\Theta(\log(n)/n)$ and $\Theta(\log^{2}(n)/n)$ , where a previous work theoretically showed that recovering is feasible in sparse Erdős-R $\acute{\text{e}}$ nyi graphs if and only if the probability of having an edge between a pair of nodes in one of the graphs and also between the corresponding nodes in the other graph is in the order of $\Omega(\log(n)/n)$ , where $n$ is the number of nodes. Experiments on both real and synthetic networks show that it outperforms previous works in terms of probability of correct matching.

keywords:

Graph Matching \sepNetwork Alignment \sepErdős-R $\acute{\text{e}}$ nyi Graphs

{highlights}

Proposing a network alignment (graph matching) algorithm, which requires neither any seed set of pre-matched nodes, nor any auxiliary node or edge information.

Solving the problem solely based on structural similarities between the two graphs, in specific, based on the tail of empirical degree distribution as node features.

Significantly improving the probability of correct matching compared to previous methods in Erdős-R $\acute{\text{e}}$ nyi graphs.

1 Introduction

Graph matching (or network alignment) between two correlated networks is the problem of finding bijection mapping between the nodes in one network to the nodes in the other network according to structural similarities between them. If the two networks have exactly the same structure, the problem reduces to the graph isomorphism problem, but in general, the two networks are only similar, which makes the problem more challenging.

Network alignment arises in various applications in different fields including computer vision [4], pattern recognition [5], autonomous driving [34], computational biology [10, 30], and social networks [27]. For instance, in computational biology, protein-protein interactions (PPI) can be modeled as networks. PPI networks of different species can be aligned by solving the network alignment problem which can be useful in investigating evolutionary conserved pathways or reconstructing phylogenetic trees [16].

Network alignment algorithms can be classified from different aspects like seed-based algorithms, and seedless algorithms. Seed-based network alignment algorithms work based on a set of pre-matched nodes from the two networks, called seeds [14, 26, 35], while seedless algorithms do not require any seed set as input [5]. Moreover, in order to assist the matching procedure, some algorithms employ node or edge features as a side information (e.g., user names or locations in de-anonymization of social networks [25, 28]), while some other matching algorithms do not require such prior knowledge and only utilize the structural similarities between the two networks as the most important feature in solving the problem [13]. In this paper, we propose a seedless network alignment algorithm which does not require either any input seed set, or any input features for the nodes or edges as side information. In other words, the proposed algorithm works solely based on structural similarities between the two correlated networks.

Most of the seed-based network alignment algorithms rely on the idea of percolation, in which the algorithm starts from a small set of pre-matched nodes (seeds), and gradually expands the set of matched nodes by applying some rules on the neighbor nodes of previously matched nodes. The pioneering method in this category, which succeeded in de-anonymizing a social network with millions of nodes, was introduced by Narayanan and Shmatikov [27]. They empirically observed that the proposed algorithm is very sensitive to the size of the seed set. If the size of seed set is too small, the algorithm could not percolate, but if the size exceeds a threshold, the algorithm could successfully percolate and de-anonymize a large portion of the entire network. Yartseva and Grossglauser [32] later proved that such phenomenon happens in random bigraph models. Later, Kazemi et al. [15] proposed a percolation-based method called NoisySeed algorithm. The main advantage of this algorithm, as the name implies, is that the initial seed set can include some incorrectly matched pairs as well. The required size for the seed set as well as the tolerable number of incorrect matches have been investigated in [15].

Compared with the above solutions, the seedless algorithms do not require pre-matched node pairs as an input. In the literature, several seedless methods have been proposed based on convex relaxations of network alignment problem. For instance, in [24], alignment problem is relaxed as a quadratic programming problem, and then, the solution is projected into zeros and ones in order to recover the mapping between nodes of two networks. Some other seedless algorithms rely on computing graph edit distance between the two networks, which is basically the minimum number of edge deletions or insertions required to convert one of the networks to the other one [5, 11]. Methods based on convex relaxations or graph edit distance are often much more time consuming than other seedless network alignment algorithms [8].

Spectral methods are another type of seedless algorithms which align nodes based on eigenvalues and eigenvectors of a transformation of the network’s adjacency matrix [3, 20]. The main idea in these methods is to obtain Laplacian matrices from adjacency matrices of the two networks and then compute the eigenvectors and eigenvalues of these Laplacian matrices. Next, $k$ number of eigenvectors corresponding to top $k$ eigenvalues are selected to construct a $k$ -dimensional feature vector for every node. From these feature vectors, the nodes in two networks can be aligned based on a distance metric.

Besides to the above seedless algorithms, several machine-learning based algorithms have been proposed that match nodes based on a set of features which are extracted by processing additional information from nodes, e.g., user-names or locations in social networks [1, 9, 28]. As mentioned before, the proposed method in this paper works merely based on structural similarities between the two networks, and does not require any additional features.

Recently, few seedless network alignment algorithms have been proposed for Erdős-R $\acute{\text{e}}$ nyi graphs. Barak et al. [2] presented a matching algorithm that finds certain small sub-networks that appear in both networks, based on which a set of seeds is formed accordingly. Next, a percolation algorithm extends the selected seeds to match all the nodes. This algorithm is designed for Erdős-R $\acute{\text{e}}$ nyi graphs with average node degrees in the range $[n^{o(1)},n^{1/153}]$ or $[n^{2/3},n^{1-\epsilon}]$ , where $\epsilon$ is a small positive constant. This range covers very sparse or very dense Erdős-R $\acute{\text{e}}$ nyi graphs. Compared with this algorithm, the proposed solution in this paper works on Erdős-R $\acute{\text{e}}$ nyi graphs with average node degrees of order $\log(n)$ . In fact, it has been shown that the true graph matching can be recovered with high probability if and only if the average node degree is in the order of $\Omega(\log(n))$ [6]. Thus, our proposed algorithm can work for the minimum value of average node degree that is possible to find the correct matching.

Dai et al. [7] proposed another network alignment algorithm for Erdős-R $\acute{\text{e}}$ nyi graphs called canonical labeling. In the first step of this algorithm, the nodes in the two networks are sorted according to their degrees. Then, the top $h$ highest degree nodes in two networks are aligned based on the sorted lists. In the second step, each remaining node $j$ gets a binary vector of length $h$ . Entry $i$ of this vector is equal to one if node $j$ is connected to $i$ -th node in the sorted list. Otherwise, this entry is set to zero. The nodes are then aligned according to these binary feature vectors. Our experiments show that the canonical labeling does not have good performance in the networks with average node degrees of order $\log(n)$ or even $\log^{2}(n)$ . Ding et al. [8] proposed a network alignment algorithm for Erdős-R $\acute{\text{e}}$ nyi graphs with average node degree in three regions including $\Theta(\log^{2}(n))$ . In this algorithm, every node is assigned a feature vector containing empirical degree distribution of its neighbors. Then, the minimum distance on these features are used to match the nodes. This algorithm has a relatively higher accuracy in Erdős-R $\acute{\text{e}}$ nyi graphs with average degree of $\log(n)$ , but our experiments show that it has lower performance for the graphs with average node degree of order $\log^{2}(n)$ .

Beside to the mentioned algorithms for Erdős-R $\acute{\text{e}}$ nyi graphs, several graph matching algorithms have been proposed with specific applications in PPI networks, social networks, and image databases. Singh et al. [31] introduced a well-known network alignment algorithm in PPI networks, which is named IsoRank. In this algorithm, the similarity of a node $i$ in one of the network to a node $j$ in the other network depends on how similar are the neighbor nodes of node $i$ to the neighbor nodes of node $j$ . More specifically, in the first step of this algorithm, the similarity matrix $R$ is constructed iteratively where entry $R_{ij}$ indicates similarity of node $i$ in one of the network to node $j$ in the other network. In each iteration, entry $R_{ij}$ is computed from other entries in $R$ like $R_{uv}$ , where $u$ and $v$ are neighbor nodes of $i$ and $j$ in the two networks, respectively. In the second step, nodes in two networks are aligned according to $R$ . Later, Zhang et al. [36] proposed a network alignment algorithm, called Final algorithm. The Final algorithm can work on both node and edge attributed networks or simple networks without any auxiliary information. Furthermore, this algorithm uses prior knowledge of pairwise alignment preference $H$ matrix, where each entry in this matrix shows likelihood of aligning two corresponding nodes from two input networks. If this prior knowledge is not given, all entries of $H$ are set to the same value, i.e., a uniform distribution. This algorithm iteratively minimizes an objective function, which is constructed from network structure (i.e, adjacency matrix) and nodes and edges attributes. Zhang et al. [37] proposed another network algorithm called Moana. This algorithm aligns nodes in three steps. First, it coarsens the input networks to a structured representation. Next, it aligns the coarsened representation. Finally, the alignment at multi levels is obtained including node level by interpolation.

Recently, several works [12, 33] with the applications in the fields of computer vision, used graph neural networks in order to obtain node embedding vectors and match the nodes based on them. These works utilized extracted features from images as inputs to the graph neural network to facilitate the process of graph matching.

In this paper, we propose a seedless network alignment algorithm, which works without any auxiliary information. The proposed algorithm has two main steps: In the first step, for each node $i$ in any of two correlated networks, we construct a feature vector containing degrees of nodes like $j$ having the following two properties: I) Node $j$ should be in the neighborhood of node $i$ . II) Its degrees is in the tail of empirical degree distribution of nodes in neighborhood of node $i$ . Due to this property of the proposed algorithm, we call it “Tail Degree Signature (TDS)" network alignment algorithm. In the second step, we compute a distance metrix between any pair of feature vectors to generate the matrix of distances. Then we use a greedy algorithm or the Hungarian algorithm [17]) to align nodes from the constructed distance matrix. We evaluate the performance of TDS algorithm for both synthetic and real networks. For synthetic networks we select Erdős-R $\acute{\text{e}}$ nyi graphs with average degree of order $\log(n)$ and $\log^{2}(n)$ , which are difficult regions for the network alignment problem [6]. Experiments show that the proposed TDS algorithm outperforms other related works in both real-world networks and synthetic Erdős-R $\acute{\text{e}}$ nyi graphs with average node degree of order $\log(n)$ and also $\log^{2}(n)$ .

2 Problem Definition

Network alignment is problem of identifying a bijection mapping between nodes in two structurally similar graphs. Let $G_{a}(\mathcal{V}_{a},\mathcal{E}_{a})$ and $G_{b}(\mathcal{V}_{b},\mathcal{E}_{b})$ be two graphs with node sets $\mathcal{V}_{a}$ and $\mathcal{V}_{b}$ of size $n$ , and edge sets $\mathcal{E}_{a}$ and $\mathcal{E}_{b}$ . We denote the edge between nodes $i$ and $j$ by $(i,j)$ . Let mapping function $\pi:\mathcal{V}_{a}\rightarrow\mathcal{V}_{b}$ denote a one-to-one mapping between nodes of $G_{a}$ and $G_{b}$ . The goal in the graph matching problem is to select a matching $\hat{\pi}$ from $n!$ different possible mapping functions in the symmetric group $S_{n}$ such that:

[TABLE]

where $\|.\|_{F}$ is Frobenius norm and $A(G_{a})$ and $A(G_{b})$ are the adjacency matrices for $G_{a}$ and $G_{b}$ , respectively. Moreover, the matrix $P_{\pi}^{T}A(G_{a})P_{\pi}$ is a simultaneous row/column permuted version of $A(G_{a})$ , and $P_{\pi}$ is the permutation matrix corresponding to mapping $\pi$ which is defined as:

[TABLE]

In other words, the objective function in Equation (1) measures the number of mis-matched edges between relabeled version of graph $G_{a}$ based on mapping $\pi$ and graph $G_{b}$ . In the worst case, solving the above optimization problem is NP-hard [29].

For synthetic graphs, we assume that $G_{a}$ and $G_{b}$ are two correlated Erdős-R $\acute{\text{e}}$ nyi graphs where the original graph $G(\mathcal{V},\mathcal{E})$ is generated with parameter $p$ , i.e, there is an edge between any two nodes with probability $p$ . Then, two correlated graphs $G_{a}$ and $G_{b}$ are constructed where edge sets $\mathcal{E}_{a}$ and $\mathcal{E}_{b}$ are sampled from $\mathcal{E}$ with probability $s$ . In other words, every edge in edge set $\mathcal{E}$ is in $\mathcal{E}_{a}$ and $\mathcal{E}_{b}$ with probability $s$ , independently. The vertex set $\mathcal{V}_{a}$ is the same as $\mathcal{V}$ , but $\mathcal{V}_{b}$ is a permuted version of $\mathcal{V}$ according to mapping $\pi^{*}$ . The matching algorithm tries to recover $\pi^{*}$ given only $G_{a}$ and $G_{b}$ . For correlated Erdős-R $\acute{\text{e}}$ nyi graphs, it can be shown [14] that maximum a-posteriori (MAP) estimation is equivalent to minimizing the objective function in Equation (1). Furthermore, MAP estimator finds the ground truth matching, i.e., $\hat{\pi}=\pi^{*}$ with high probability if and only if $ps^{2}=\Omega(\log(n)/n)$ [6]. Hence, no matching algorithm can return the correct output for values less than $\Omega(\log(n)/n)$ .

3 Tail Degree Signature (TDS) Algorithm

Our proposed graph matching algorithm consists of two steps: I) For every node in both graphs, a feature vector is extracted. II) Based on these feature vectors, the nodes in the two subsets are matched.

3.1 Feature Extraction

Method: For every node $i\in\mathcal{V}_{a}$ , we extract a feature vector $\Phi^{a}_{i}$ based on its neighbor nodes in $G_{a}$ as follows: Let $\mathcal{N}^{\,a,t}_{i}$ be the set of nodes in graph $G_{a}$ whose distance from node $i$ is exactly equal to $t$ , where $t\in[1,\lambda]$ , and $\lambda$ is the maximum distance that is considered in the feature extraction procedure. For every node $i\in\mathcal{V}_{a}$ and every $t\in[1,\lambda]$ , set $\mathcal{D}(\mathcal{N}^{\,a,t}_{i})$ is formed as the degrees of the nodes in $\mathcal{N}^{\,a,t}_{i}$ , i.e.,

[TABLE]

Next, for a given integer parameter $\theta$ , we pick $\theta$ of the smallest and $\theta$ of the largest elements in $\mathcal{D}(\mathcal{N}^{\,a,t}_{i})$ and put them in feature vector $\Phi^{a,t}_{i}$ of size $2\,\theta$ . Finally, feature vector $\Phi^{a}_{i}$ is formed by concatenating vectors $\Phi^{a,t}_{i}$ as follows:

[TABLE]

Thus, $\Phi^{a}_{i}$ is a vector of size $2\,\theta\lambda$ . By a similar procedure, for every node $j\in\mathcal{V}_{b}$ , feature vector $\Phi^{b}_{j}$ is also formed. Fig. 1(a) shows two example graphs $G_{a}$ and $G_{b}$ . Fig. 1(b) shows the construction procedure of $\Phi^{a}_{18}$ where $\mathcal{N}^{\,a,t}_{18}$ and $\mathcal{D}(\mathcal{N}^{\,a,t}_{18})$ are generated according to Fig. 1(a). As few other examples, $\Phi^{a}_{5}$ , $\Phi^{b}_{12}$ and $\Phi^{b}_{9}$ are also shown in Fig. 1(b).

Rationale: In constructing the vector $\Phi^{a}_{i}$ , we select the degree of nodes in $\mathcal{N}^{\,a,t}_{i}$ which are in the tail region of empirical degree distribution of nodes in $\mathcal{N}^{\,a,t}_{i}$ . Herein, we give an intuition why such selection is more preferable than considering nodes’ degrees outside of this region for $t=1$ . For the original graph $G(\mathcal{V},\mathcal{E})$ , i.e, there is an edge between any two nodes in $G(\mathcal{V},\mathcal{E})$ with probability $p$ . Then $G_{a}$ and $G_{b}$ are constructed where edge sets $\mathcal{E}_{a}$ and $\mathcal{E}_{b}$ are sampled from $\mathcal{E}$ with probability $s$ . Thus, $G_{a}$ and $G_{b}$ are two Erdős-R $\acute{\text{e}}$ nyi graphs with $sp$ probability. It can be seen that the degree distribution of node $i$ in graph $G_{a}$ or $G_{b}$ is approximately a normal distribution $N(\mu,\sigma^{2})$ with parameters $\mu=(n-1)ps$ and $\sigma=\sqrt{(n-1)(1-ps)ps}$ . Let $U_{i}^{a}$ be the normalized degree of node $i$ in graph $G_{a}$ , i.e., $U_{i}^{a}=(\mathcal{D}_{i}^{a}-\mu)/\sigma$ . $U_{i}^{b}$ is defined similarly in graph $G_{b}$ .

Proposition 1.

If node $j\in\mathcal{V}_{b}$ is the corresponding node of a node $i\in\mathcal{V}_{a}$ , i.e., $j=\pi^{*}(i)$ , then $U_{i}^{a}$ and $U_{j}^{b}$ are two correlated random variables with the correlation coefficient: $\rho=s(1-p)/(1-ps)$ . Otherwise, they are approximately uncorrelated for large $n$ .

Proof.

To prove the above statement, it is just needed to compute the following term for the two correlated random variables $U_{i}^{a}$ and $U_{\pi^{*}(i)}^{b}$ :

[TABLE]

where $\mathbbm{1}[.]$ is the indicator function.

(a) Due to the fact that the events ${1}[(i,k)\in\mathcal{E}_{a}]$ and $\mathbbm{1}[(\pi^{*}(i),k^{\prime})\in\mathcal{E}_{b}]$ are independent for $k^{\prime}\neq\pi^{*}(k)$ .

(b) The probability of existing an edge between nodes $i$ and $k$ in the original graph $G$ is equal to $p$ . Moreover, the probability of having that edge in both graphs $G_{a}$ and $G_{b}$ is $s^{2}$ . Hence, the expectation of event in the second sum would be $ps^{2}$ .

Thus, the correlation coefficient between $\mathcal{D}_{i}^{a}$ and $\mathcal{D}^{b}_{\pi^{*}(i)}$ would be:

[TABLE]

Similarly, for the case of $j\neq\pi^{*}(i)$ , it can be shown that $\rho=s(1-p)/((n-1)(1-ps))$ . Hence, the two random variables are approximately uncorrelated for large $n$ if $j\neq\pi^{*}(i)$ . ∎

Based on the above observation, we can model the two random variables $U_{i}^{a}$ and $U^{b}_{\pi^{*}(i)}$ as $U^{b}_{\pi^{*}(i)}=\rho U_{i}^{a}+\sqrt{1-\rho^{2}}Z$ where $Z$ is an independent standard normal variable. To show the advantage of selecting nodes’ degrees in the tail of degree distribution, we define the following two metrics between any two nodes $i\in\mathcal{V}_{a}$ and $j\in\mathcal{V}_{b}$ :

[TABLE]

where $\hat{p}_{U_{i}^{a}}(x)$ and $\hat{p}_{U_{j}^{b}}(x)$ are the empirical distribution obtained from observing samples of $U_{i}^{a}$ and $U_{j}^{b}$ , respectively. In fact, $\Delta^{i,j}_{tail}$ and $\Delta^{i,j}_{center}$ represent the total variation distances [23] of $\hat{p}_{U_{i}^{a}}(x)$ and $\hat{p}_{U_{j}^{b}}(x)$ in the tail and central domains of distributions, respectively. For a given node $i$ , we are interested in comparing $\Delta_{tail}^{i,\pi^{*}(i)}$ with $\Delta_{tail}^{i,j}$ for any $j\neq\pi^{*}(i)$ . We define the score $s^{i,j}_{tail}=\Delta_{tail}^{i,\pi^{*}(i)}/\Delta_{tail}^{i,j}$ for any $j\neq\pi^{*}(i)$ and $j\in\mathcal{V}_{b}$ . We expect to have better matching results for higher $s^{i,j}_{tail}$ . The score $s^{i,j}_{center}$ is defined similarly. We compare the average of $s^{i,j}_{tail}$ and $s^{i,j}_{center}$ experimentally by generating $100$ samples of $U_{i}^{a}$ and $U_{j}^{b}$ for two correlation coefficients $\rho=s(1-p)/(1-ps)$ and $\rho=0$ . From these samples, one instance of $s^{i,j}_{tail}$ and $s^{i,j}_{center}$ can be computed. Fig. 2 shows the average of $s^{i,j}_{tail}$ and $s^{i,j}_{center}$ over $100$ instances against parameter $s$ for $n=1000$ and $p=\log(n)/n$ . As can be seen, the score of tail region is about $40\%$ greater than the one for the central region. This observation illustrates that the empirical degree distribution in the tail region is much more robust to sampling parameter $s$ .

Complexity Analysis: For every node $i\in\mathcal{V}_{a}$ , we run BFS algorithm with root node $i$ and obtain all nodes in $\mathcal{N}^{\,a,t}_{i}$ for every $t\in[1,\lambda]$ . Since the average degree of each node is in the order of $O((n-1)ps)$ , the average number of neighbor nodes up to distance $\lambda$ is in the order of $O((nps)^{\lambda})$ . Thus, the time complexity of this part is $O((nps)^{2\lambda})$ . Moreover, it takes $O(t(nps)^{t}\log(nps))$ to sort the nodes in $\mathcal{N}^{\,a,t}_{i}$ and construct $\Phi^{a,t}_{i}$ . Therefore, the total time complexity of the feature extraction step for all $n$ nodes is in the order of $O\Big{(}\big{(}(nps)^{2\lambda}+\lambda(nps)^{\lambda}\log(nps)\big{)}\times n\Big{)}$ . For $p=\log(n)/n$ and $\lambda=2$ , the time complexity is simplified to $O(n\log^{4}(n))$ . For $p=\log^{2}(n)/n$ , it is in the order of $O(n\log^{8}(n))$ .

3.2 Matching Method

First, we compute similarity matrix (or distance matrix) $X$ between $\mathcal{V}_{a}$ and $\mathcal{V}_{b}$ . In particular, element $(i,j)$ in this matrix is equal to: $X_{ij}=||\Phi^{a}_{i}-\Phi^{b}_{j}||_{2}$ where $i\in\mathcal{V}_{a}$ and $j\in\mathcal{V}_{b}$ .

Next, we form the set of matched pairs between $\mathcal{V}_{a}$ and $\mathcal{V}_{b}$ , by executing Hungarian algorithm on the similarity matrix $X$ . More specifically, Hungarian algorithm selects $n$ number of entries from matrix $X$ , where from each column and each row, exactly one entry is chosen and the selected entries minimize the following cost:

[TABLE]

In other words, by running Hungarian algorithm, we form a mapping $\pi$ between $\mathcal{V}_{a}$ and $\mathcal{V}_{b}$ that has minimum mean of similarity distance over all possible choices. We call this version of the proposed method as “TDS-h algorithm”.

Another option, instead of using Hungarian algorithm, is to use the following simple greedy algorithm. We select the minimun element $X_{ij}$ in matrix $X$ and align node $i\in G_{a}$ with node $j\in G_{b}$ , and delete row $i$ and column $j$ from matrix $X$ . This process is repeated $n$ times. We call this version of the proposed method as “TDS-g algorithm”.

As an example, Fig. 1(c) shows $l_{2}$ -norm distances between constructed feature vectors $\Phi^{a}_{i}$ and $\Phi^{b}_{j}$ from Fig. 1(b). Four elements of the similarity matrix are shown in the figure. Either of the two matching methods can be applied. In this example, nodes $5$ and $18$ in graph $G_{a}$ are matched to nodes $12$ and $9$ in graph $G_{b}$ , respectively.

Complexity Analysis: Time complexity of finding similarity matrix $X$ is in the order of $O(n^{2})$ . Both matching methods are in the order of $O(n^{3})$ .

4 Experimental Evaluation

The proposed seedless graph matching algorithm, called tail degree signature (TDS), is experimentally evaluated in this section. The constant parameters $\lambda$ and $\theta$ are set to $2$ and $10$ , respectively. The algorithm is implemented in Python language.

4.1 Accuracy

The two versions of TDS algorithm (TDS-h and TDS-g) are compared with recent seedless graph matching algorithms on Erdős-R $\acute{\text{e}}$ nyi graphs with $p=\log(n)/n$ and $p=\log^{2}(n)/n$ . In particular, we consider Degree Profile (DP) [8], Laplacian [3, 20], Canonical labeling [7], IsoRank [31], Final [36], and Moana [37] algorithms.

Fig. 3(a) shows accuracy of TDS algorithms and the other methods versus $s$ . Every value in this figure shows the average accuracy for $50$ randomly generated Erdős-R $\acute{\text{e}}$ nyi graphs with $p=\log(n)/n$ and $n=1000$ . As shown in the figure, both versions of TDS algorithm achieve much higher accuracy compared to the other algorithms. Moreover, they yield accurate solutions for lower values of $s$ . For instance, at $s=0.98$ , TDS-h and TDS-g achieve about $80\%$ accuracy, while DP and Laplacian reach about $40\%$ and $20\%$ accuracy, respectively, and the accuracy of all the other methods are less than $10\%$ . At $s=0.95$ , TDS-h and TDS-g achieve about $45\%$ accuracy, while the accuracy of all the other methods are about or less than $10\%$ .

Fig. 3(b) presents the same comparisons as above for Erdős-R $\acute{\text{e}}$ nyi graphs with $p=\log^{2}(n)/n$ . In this region, TDS-h has higher accuracy than TDS-g for lower values of $s$ . Both versions of TDS algorithm achieve much higher accuracy compared to the other algorithms. For instance, at $s=0.98$ , TDS-h and TDS-g achieve about $90\%$ and $50\%$ accuracy, respectively, while all the other methods are about or less than $10\%$ .

4.2 Runtime

Fig. 4 compares runtimes of the considered algorithms on Erdős-R $\acute{\text{e}}$ nyi graphs with $p=\log(n)/n$ and $p=\log^{2}(n)/n$ for $n=\{1000,2000,4000,8000\}$ and $s=0.99$ . An empty value in Fig. 4 denotes that we stopped (killed) the process because the runtime exceeded 16 hours.

As can be seen, for $p=\log^{2}(n)/n$ , IsoRank, Final, and Moana have smaller runtimes compared to the other methods. TDS-g, Laplacian, and Canonical have comparable runtimes, while TDS-g have much higher accuracy (Fig. 3(b)). The runtimes of TDS-h and DP grow dramatically as the number of nodes increases.

4.3 Real-world Networks

In addition to Erdős-R $\acute{\text{e}}$ nyi graphs, we also evaluated the proposed TDS algorithm on the following three real-world networks and compared it with previous seedless algorithms.

•

Bitcoin-OTC [19, 18]: This is who-trusts-whom network of people who traded using Bitcoin cryptocurrency on a platform called Bitcoin-OTC. It contains 5,881 nodes and 35,592 edges. Members (nodes) on this platform can rate other members (nodes) in the range -10 to 10. To use Bitcoin-OTC as a benchmark for evaluating graph matching algorithms, we consider the following two unweighted networks. The first network contains all the nodes and edges in the original graph, and the second network contains only the positive edges.

•

GR-QC [21]: arXiv GR-QC (General Relativity and Quantum Cosmology) collaboration network contains 5241 nodes and 11,923 edges. Each author is represented by a node. Two nodes are connected if the authors have at least one common paper on GR-QC category from January 1993 to April 2003. This dataset contains two networks which are permuted version of one another, i.e., $s=1$ [21].

•

Facebook [22]: This dataset contains “friends lists" from Facebook, which was collected from a survey using a Facebook app. The dataset includes 4039 nodes and 88234 edges. For the task of network alignment, we added some noises to the Facebook network edges, i.e, we constructed a new network with the same set of nodes as Facebook network while each edge in Facebook network is preserved in the new network with probability $s$ .

In Fig. 5, we compare the accuracy of TDS with other seedless algorithms. In almost all cases, the proposed algorithm outperforms the other algorithms. For instance, in GR-QC benchmark, the accuracy is around $70\%$ in TDS-h, TDS-g, Laplacian, IsoRank, and Final algorithms, while the other methods have less than $5\%$ accuracy. In Bitcoin-OTC and Facebook benchmarks, TDS-h, TDS-g, and DP have much higher accuracy compared to the other methods.

The last column in Fig. 5 shows the geometric mean of the other columns. As it can be seen, TDS-h and TDS-g achieve the mean accuracy of about $60\%$ , while the mean accuracy of DP is about $30\%$ , and the other methods have less than $20\%$ accuracy.

5 Conclusion

In this paper, we proposed a seedless graph matching algorithm for correlated Erdős-R $\acute{\text{e}}$ nyi graphs. We introduced node features based on tail of degree distribution. We showed that this approach has advantages with respect to matching nodes based on center of degree distributions. Our experiments showed that the proposed algorithm outperforms other related works for several real networks as well as Erdős-R $\acute{\text{e}}$ nyi graphs with average degree of order $\Theta(\log(n))$ and $\Theta(\log^{2}(n))$ .

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abel et al. [2010] Abel, F., Henze, N., Herder, E., Krause, D., 2010. Interweaving public user profiles on the web, in: International conference on user modeling, adaptation, and personalization, Springer. pp. 16–27.
2Barak et al. [2018] Barak, B., Chou, C.N., Lei, Z., Schramm, T., Sheng, Y., 2018. (nearly) efficient algorithms for the graph matching problem on correlated random graphs. ar Xiv preprint ar Xiv:1805.02349 .
3Carcassoni and Hancock [2002] Carcassoni, M., Hancock, E.R., 2002. Alignment using spectral clusters., in: BMVC, pp. 1–10.
4Cho and Lee [2012] Cho, M., Lee, K.M., 2012. Progressive graph matching: Making a move of graphs via probabilistic voting, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 398–405.
5Conte et al. [2004] Conte, D., Foggia, P., Sansone, C., Vento, M., 2004. Thirty years of graph matching in pattern recognition. International journal of pattern recognition and artificial intelligence 18, 265–298.
6Cullina and Kiyavash [2016] Cullina, D., Kiyavash, N., 2016. Improved achievability and converse bounds for erdos-rényi graph matching, in: ACM SIGMETRICS Performance Evaluation Review, ACM. pp. 63–72.
7Dai et al. [2018] Dai, O.E., Cullina, D., Kiyavash, N., Grossglauser, M., 2018. On the performance of a canonical labeling for matching correlated erdos-renyi graphs. ar Xiv preprint ar Xiv:1804.09758 .
8Ding et al. [2018] Ding, J., Ma, Z., Wu, Y., Xu, J., 2018. Efficient random graph matching via degree profiles. ar Xiv preprint ar Xiv:1811.07821 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Seedless Graph Matching via Tail of Degree Distribution for Correlated Erdős-Reˊ\acute{\text{e}}eˊnyi Graphs

Abstract

keywords:

1 Introduction

2 Problem Definition

3 Tail Degree Signature (TDS) Algorithm

3.1 Feature Extraction

Proposition 1**.**

Proof.

3.2 Matching Method

4 Experimental Evaluation

4.1 Accuracy

4.2 Runtime

4.3 Real-world Networks

5 Conclusion

Seedless Graph Matching via Tail of Degree Distribution for Correlated Erdős-R $\acute{\text{e}}$ nyi Graphs

Proposition 1.