Network Embedding: on Compression and Learning

Esra Akbas; Mehmet Aktas

arXiv:1907.02811·cs.SI·July 18, 2019

Network Embedding: on Compression and Learning

Esra Akbas, Mehmet Aktas

PDF

1 Repo

TL;DR

This paper introduces NECL, a graph compression method that preserves structural information and accelerates network embedding algorithms like DeepWalk and Node2Vec without sacrificing accuracy, especially on large real-world networks.

Contribution

NECL provides a novel graph compression strategy that enhances the efficiency of existing network embedding methods while maintaining their effectiveness.

Findings

01

Achieves 23-57% reduction in embedding time.

02

Maintains classification accuracy on large networks.

03

Effective on datasets like DBLP, BlogCatalog, Cora, and Wiki.

Abstract

Recently, network embedding that encodes structural information of graphs into a vector space has become popular for network analysis. Although recent methods show promising performance for various applications, the huge sizes of graphs may hinder a direct application of existing network embedding method to them. This paper presents NECL, a novel efficient Network Embedding method with two goals. 1) Is there an ideal Compression of a network? 2) Will the compression of a network significantly boost the representation Learning of the network? For the first problem, we propose a neighborhood similarity based graph compression method that compresses the input graph to get a smaller graph without losing any/much information about the global structure of the graph and the local proximity of the vertices in the graph. For the second problem, we use the compressed graph for network embedding…

Tables3

Table 1. Table 1 : Graphs statistics ( K = 10 3 𝐾 superscript 10 3 K=10^{3} and M = 10 6 𝑀 superscript 10 6 M=10^{6} )

Network	$\| 𝐕 \|$	$\| 𝐄 \|$	class #	Multi-label
Wiki	2405	23192	17	No
Cora	2708	10858	7	No
DBLP	51330	133664	4	Yes
BlogCatalog	10312K	668K	39	Yes

Table 2. Table 2 : Performance comparison of the single-label classification tasks for the similarity threshold λ = 0.5 𝜆 0.5 \lambda=0.5 and training ratio 5 % percent 5 5\% for Cora and Wiki

	Cora (5%)			Wiki (5%)
	NECL(DW)	DW	Gain %	NECL(DW)	DW	Gain %
Macro $𝐅_{𝟏}$	0.671	0.675	-0.61	0.344	0.342	0.58
Micro $𝐅_{𝟏}$	0.704	0.704	0.01	0.477	0.483	-1.24
Time(s)	5.17	8.29	37.65	4.84	8.98	46.07
	NECL(N2V)	N2V	Gain %	NECL(N2V)	N2V	Gain %
Macro $𝐅_{𝟏}$	0.666	0.671	-0.84	0.342	0.348	-1.72
Micro $𝐅_{𝟏}$	0.691	0.709	-2.52	0.475	0.498	-4.62
Time(s)	11.96	17.96	33.39	9.41	19.10	50.74
	Compressed	Original	Gain %	Compressed	Original	Gain %
$\| 𝐕 \|$	1427	2708	47.30	1060	2405	55.93
$\| 𝐄 \|$	5236	10858	51.78	8584	23192	62.99

Table 3. Table 3 : Performance comparison of the multi-label classification tasks for the similarity threshold λ = 0.5 𝜆 0.5 \lambda=0.5 and training ratio 5% and 50% for DBLP and BlogCatalog respectively

	DBLP (5%)			BlogCatalog (50%)
	NECL(DW)	DW	Gain %	NECL(DW)	DW	Gain %
Macro $𝐅_{𝟏}$	0.625	0.622	0.51	0.243	0.250	-2.92
Micro $𝐅_{𝟏}$	0.657	0.653	0.63	0.369	0.391	-5.68
Time(s)	39.97	93.96	57.46	71.7	99.3	27.79
	NECL(N2V)	N2V	Gain %	NECL(N2V)	N2V	Gain %
Macro $𝐅_{𝟏}$	0.626	0.625	0.13	0.251	0.262	-4.11
Micro $𝐅_{𝟏}$	0.658	0.657	0.24	0.368	0.396	-6.68
Time(s)	75.81	175.31	56.75	1247.14	1628.59	23.42
	Compressed	Original	Gain %	Compressed	Original	Gain %
$\| 𝐕 \|$	8824	32984	69.78	8507	543872	17.50
$\| 𝐄 \|$	32984	133664	75.32	10312	667966	18.58

Equations20

f max u \in V \sum l o g P r (N_{S} (u) ∣ ϕ (u))

f max u \in V \sum l o g P r (N_{S} (u) ∣ ϕ (u))

P r (N_{S} (u) ∣ ϕ (u)) = n_{i} \in N_{S} (u) \prod P r (n_{i} ∣ ϕ (u))

P r (N_{S} (u) ∣ ϕ (u)) = n_{i} \in N_{S} (u) \prod P r (n_{i} ∣ ϕ (u))

P r (n_{i} ∣ ϕ (u)) = \frac{exp ( ϕ ( n _{i} ) \cdot ϕ ( u ))}{\sum _{v \in V} exp ( ϕ ( v ) \cdot ϕ ( u ))}

P r (n_{i} ∣ ϕ (u)) = \frac{exp ( ϕ ( n _{i} ) \cdot ϕ ( u ))}{\sum _{v \in V} exp ( ϕ ( v ) \cdot ϕ ( u ))}

s im (T_{u}, T_{v}) = ∣ N (u) \cap N (v) ∣.

s im (T_{u}, T_{v}) = ∣ N (u) \cap N (v) ∣.

s im (T_{u}, T_{v}) = \frac{\sum _{i} T _{u i} T _{v i}}{∣∣ T _{u} ∣∣∣∣ T _{v} ∣∣}

s im (T_{u}, T_{v}) = \frac{\sum _{i} T _{u i} T _{v i}}{∣∣ T _{u} ∣∣∣∣ T _{v} ∣∣}

∣∣ T_{u} ∣∣ = 1/∣ (N (u), ∣∣ T_{v} ∣∣ = 1/∣ (N (v)

∣∣ T_{u} ∣∣ = 1/∣ (N (u), ∣∣ T_{v} ∣∣ = 1/∣ (N (v)

i \sum A_{u i} A_{v i} = ∣ N (u) \cap N (v) ∣.

i \sum A_{u i} A_{v i} = ∣ N (u) \cap N (v) ∣.

s im (T_{u}, T_{v})

s im (T_{u}, T_{v})

= \frac{\frac{1}{∣ N ( u ) ∣∣ N ( v ) ∣} ∣ N ( u ) \cap N ( v ) ∣}{\frac{1}{∣ N ( u ) ∣∣ N ( v ) ∣}}

= ∣ N (u) \cap N (v) ∣.

N s im (u, v) = \frac{2∣ N ( u ) \cap N ( v ) ∣}{∣ N ( u ) ∣ + ∣ N ( v ) ∣}

N s im (u, v) = \frac{2∣ N ( u ) \cap N ( v ) ∣}{∣ N ( u ) ∣ + ∣ N ( v ) ∣}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

esraabil/NECL
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDeepWalk

Full text

\vldbTitle

Network Embedding: on Compression and Learning \vldbAuthorsEsra Akbas and Mehmet Aktas \vldbDOIhttps://doi.org/10.14778/xxxxxxx.xxxxxxx \vldbVolume12 \vldbNumberxxx \vldbYear2019

Network Embedding: on Compression and Learning

Esra Akbas

Mehmet Aktas

Department of Computer Science

Oklahoma State University

Stillwater, OK 74078, USA

[email protected]

Department of Mathematics and Statistics

University of Central Oklahoma

Edmond, OK 73034, USA

[email protected]

(30 July 1999)

Abstract

Recently, network embedding that encodes structural information of graphs into a vector space has become popular for network analysis. Although recent methods show promising performance for various applications, the huge sizes of graphs may hinder a direct application of existing network embedding method to them. This paper presents NECL, a novel efficient Network Embedding method with two goals.

Is there an ideal Compression of a network? 2) Will the compression of a network significantly boost the representation Learning of the network? For the first problem, we propose a neighborhood similarity based graph compression method that compresses the input graph to get a smaller graph without losing any/much information about the global structure of the graph and the local proximity of the vertices in the graph. For the second problem, we use the compressed graph for network embedding instead of the original large graph to bring down the embedding cost. NECL is a general meta-strategy to improve the efficiency of all of the state-of-the-art graph embedding algorithms based on random walks, including DeepWalk and Node2vec, without losing their effectiveness. Extensive experiments on large real-world networks validate the efficiency of NECL method that yields an average improvement of 23 - 57% embedding time, including walking and learning time without decreasing classification accuracy as evaluated on single and multi-label classification tasks on real-world graphs such as DBLP, BlogCatalog, Cora and Wiki.

1 Introduction

Many real-world data can be modeled as networks to capture the interaction (i.e. edges) between individual units (i.e. vertices). Node classification, community detection and link prediction are some applications of network analysis in many different areas such as social networks and biological networks. Node classification is to find the label of vertices using the topology of the network and other labeled vertices such as predicting demographic values, interest, beliefs or other characteristics of the user in a social network or prediction labels of proteins in a biological network [18, 14, 5]. Similarly, link prediction is to determine whether there is an edge between a pair of vertices in a network such as collaboration recommendation on academic social networks and identifying hidden interactions in a protein-protein interaction (PPI) network as a biological network [20, 27].

On the other hand, there are some challenges in network analysis such as high computational complexity, low parallelizability and inapplicability of machine learning methods [11]. Recently, network embedding that encodes structural information of graphs into a vector space has become popular for network analysis [33, 17, 15, 6, 11]. The network embedding is defined as mapping the network data into a low-dimensional vector space which can capture characteristics or role of vertices in the network based on their connections [24].

Previous researchers considered the network embedding as a dimensionality reduction [4]. While these methods are effective on small graphs, scalability is the major concern as the time complexity of these methods are at least quadratic in the number of graph vertices. This makes them impossible to apply on large-scale networks with billions of vertices [33, 6, 11]. In recent years, the network embedding problem has been changed as a part of the optimization problem to preserve the local and global network structures and node proximity. Researchers focus on the scalable methods that use graph factorization or neural networks. Many of them aim to preserve the first and second order proximity [29] or local neighborhood proximity with path sampling using short random walks such as DeepWalk [24] and Node2vec [16]. The idea for path sampling is that vertices in a similar neighborhood will get similar paths and so their representation will be similar.

Although recent methods show promising performance for various applications, the problem of graph embedding still have some challenges that the huge sizes of real-world graphs may obstruct direct applications of existing graph embedding methods on them. On the other hand, when we consider a compressed or summary graph conserving the key structures and patterns of the original graph, many methods would be applicable to large graphs [19]. The aim of graph compressing is to create a smaller graph without losing any/much information about global structure of the graph and the local relationship between the vertices of the graph [34]. Vertices with similar characteristics are grouped and represented by super-nodes in a compressed graph.

Meanwhile, we have an observation that if two vertices share many common neighbors, they have strong second-order similarities and their paths from random walks will be very similar. From similar paths, we may get very similar representations for these vertices. This means we repeat the same walking and learning process to get similar results for these two vertices.

In addition to these, optimization on the co-occurrence probability of the vertices could easily get stuck at a bad local minima as the result of poor initialization. This may cause in generating dissimilar representation for vertices within the same or similar neighborhood set. With combining them into super-nodes, we can give initial knowledge to learning process which can result in better representation.

According to these observations, we investigate network embedding via two problems:

Is there an ideal compression of a network? 2. 2.

Will the compression of a network significantly boost the representation learning of the network?

As a solution to these problems, we propose NECL, a novel network embedding method. For the first problem, we propose a neighborhood similarity based graph compression method that compresses the input graph to get a relatively smaller graph without losing any/much information about global structure of the graph and local proximity of the vertices in the graph. NECL compresses the graph by merging vertices with high number of similar neighbors into super-nodes. For the second problem, we use the compressed graph for network embedding instead of original large graph to bring down the embedding cost. This benefits the efficiency greatly since we do not need to process similar vertices separately to get similar representation. Instead, we will learn the representation of super-nodes and use their representations as the representation of vertices which are merged to create those super-nodes. Embedding a compressed graph will be easier and more efficient embedding than the original graph. The reason is that we will get less pairwise relationships from random walks on smaller set of super-nodes and this generates less diverse training data for embedding part which makes optimization easier. NECL is a general meta-strategy to improve the efficiency of the state-of-the-art algorithms for embedding graphs, including DeepWalk and Node2vec.

Example 1

In Figure 1, we present a graph compressing on the well-known Les Miserables network where vertices correspond to the characters in the novel and edges connect co-appearing characters. While the original network has 77 vertices and 254 edges, the compressed network has 33 vertices and 64 edges. As we see in the figure, the compressed network preserves the local structure of vertices in super-nodes without losing the global structure of the graphs. For example, in Figure 1-(a) neighborhood sets of the vertices $\{1,4,5,6,7,8,9\}$ are same including just node [math]. Hence, random walks from these vertices will have to go from node [math] and get the same results. We also expect that representations of these vertices should be same or very similar. Another example is that neighborhood set of the vertices $\{16,17,18,19,20,21,22\}$ are same including $\{16,17,18,19,20,21,22,23\}$ except 16 has neighbors $\{26,27\}$ . Thus, random walks from these vertices will return to themselves or go far in the graph from node 23. Therefore, they will get the same walking results and as a result similar representations. Instead of walking separately from each of these vertices and learning to get the same or similar feature vectors for them, when we merge them into super-nodes as $7$ and $16$ respectively in the compressed graph in Figure 1-(b), we just need to do walking for one super-node and learn one feature vector that we can use for all of them. After applying merge operation to the whole graph, we get significantly smaller graph (Figure 1(b)) than original graph (Figure 1(a)). Walking on the smaller graph and learning representation from walking results will be more efficient than doing them on the large original graph without decreasing the effectiveness of the learning process.

We summarize the contributions of NECL as follows,

•

New graph compressing method: Based on the observation that vertices with similar neighborhood sets get similar results from random walks and eventually similar representation. We merge these vertices into super-nodes to get a compressed (smaller) graph which preserves the characteristics of the original graph.

•

Efficient graph embedding on compressed graph: We do random walks on and embedding the compressed graph, which has less number of vertices and edges the large original graph, as a result they will be easier and more efficient than walking on and embedding the original graph. We use the representation of super-nodes as the representation of vertices in the original graph.

•

Better efficiency without losing effectiveness: The compressed graph preserve the global structure of the network and super node of it preserves local neighborhood of vertices. Using embedding of compressed graph does not decrease the effectiveness. We demonstrate that NECL(DW), and NECL(N2V) embeddings consistently have better efficiency with less walking and training time but similar or better accuracy than the original methods on multi class and multi-label classification tasks on several real-world networks.

2 Network Embedding using similarity based Compression

In this section We first give preliminary information about network embedding and graph compressing, then we describe our neighborhood similarity based graph compression algorithm and how to use compressed graph towards optimizing the efficiency of network embedding.

2.1 Preliminaries

In this section, we briefly discuss the necessary preliminaries for our new meta-strategy for graph embedding.

In this paper, we consider an undirected, connected, simple graph $G=(V_{G};E_{G})$ where $V_{G}$ is the set of vertices, and $E_{G}\subseteq\{V_{G}\times V_{G}\}$ is the set of edges. The set of neighbors for given a vertex $v\in V_{G}$ is denoted as $N_{G}(v)$ , where $N_{G}(v)=\{u|u\in V_{G}:(u,v)\in E_{G}\}$ .

Compressed graph.

Compressed graph of a given graph $G=(V_{G};E_{G})$ is represented as $CG=(S;M)$ where $S=(V_{S};E_{S})$ is the graph summary with super-nodes $V_{S}$ and super-edges $E_{S}$ . Every node $v$ in $V_{G}$ belongs to a super-node in $V_{S}$ and $M$ is a mapping from each node $v$ to its super-node in $V_{S}$ . A super-edge $E=(V_{i};V_{j})$ in $E_{S}$ represents the set of all edges between vertices in the super-nodes $V_{i}$ and $V_{j}$ .

Network Embedding.

DeepWalk [24] is the pioneer work that uses the idea of word representation learning [21, 22] for network embedding. While vertices in a graph are considered as words, neighbors are considered as their context in natural language. A graph is represented as a set of random walk paths sampled from it. The learning process leverages the co-occurrence probability of the vertices that appear within a window in a sampled path. The node representation is learned by training the Skip-gram model [21, 22] on the random walks. With co-occurrence of the node pairs in the sampled path, a “corpus” $D$ is generated. To be formal, the corpus $D$ is a multiset that counts the multiplicity of vertex-context pairs. Node pairs with high co-occurrence probability are regarded as neighbors. As the size of the window is usually no less than two, we call these kind of neighbors as higher-order proximity.

We define a representation as a mapping $\phi:V\rightarrow\mathbb{R}^{d},d<<|V|$ which represents each vertex $v\in V$ as a point in a low dimensional space $\mathbb{R}^{d}$ . Here $d$ is a parameter specifying the number of dimensions of our feature representation. For every source node $u\in V$ , we define $N_{S}(u)\subset V$ as a network neighborhood of node $u$ generated through a neighborhood sampling strategy $S$ .

We seek to optimize the following objective function, which maximizes the log-probability of observing a network neighborhood $N_{S}(u)$ for a node $u$ conditioned on its feature representation, given by $\phi$

[TABLE]

There is an assumption as the conditional independence of vertices to make the optimization problem tractable with ignoring the vertex ordering in the random walk. Therefore, the likelihood is factorized by assuming that the likelihood of observing a neighborhood node is independent of observing any other neighborhood node given the feature representation of the source:

[TABLE]

The conditional likelihood of every source-neighborhood node pair is modeled as a softmax unit parametrized by a dot product of their features:

[TABLE]

It is too expensive to compute the summation over all vertices for large networks and we approximate it using negative sampling [22]. We optimize Equation (1) using stochastic gradient ascent over the model parameters defining the embedding $\phi$ .

**Random walk based sampling

**The neighborhoods $N_{S}(u)$ are not restricted to just immediate neighbors but can have vastly different structures depending on the sampling strategy $S$ . There are many possible neighborhood sampling strategies for vertices as a form of local search. Different neighborhoods coming from different strategies result in different learned feature representations. For scalability of learning, random walk based methods are used to capture the structural relationships of vertices. They maximize the co-occurrence probability of subsequent vertices within a fixed length window of random walks to preserve higher-order proximity between vertices. With random walks, networks are represented as a collection of vertex sequence. In this section, we take a deeper look at the network neighborhood sampling strategy based on random walks and the proximity captured by random walks.

The co-occurrence probability of node pairs depends on the transition probabilities of vertices. Considering a graph $G$ , we define adjacency matrix $A$ that is symmetric for undirected graphs. For an unweighted graph, we have $A_{ij}=1$ if and only if there exists an edge from $v_{i}$ to $v_{j}$ and $A_{ij}=0$ otherwise. For a graph with adjacency matrix $A$ , we define the diagonal matrix, known as degree matrix, as $D_{ij}=\sum_{k}A_{ik}$ if $i=j$ and $D_{ij}=0$ otherwise. In a random walk, transition probability from one node to other depends on the degree of the vertices. The probability of leaving a node from one of its edges is split uniformly among the edges. We define this 1 step transition probability as $T$ : $T=D^{-1}A$ where $T_{ij}$ is the probability of a transition from vertex $v_{i}$ to vertex $v_{j}$ within one step.

We observe here that if two vertices, $i,j$ , of a graph have many common neighbors, they also have similar transition probabilities to other vertices. This means that if $A_{i}$ and $A_{j}$ are similar, $T_{i}=A_{i}*D_{ii}^{1}$ and $T_{j}=A_{j}*D_{jj}^{1}$ will be similar as well. Hence they have similar neighborhood and get similar neighborhood sets from random walks and, as a result, they get very similar representations from the learning process. Therefore, while random walk based neighborhood sampling strategy captures the higher order proximity within the neighborhood of the vertices, the representation learning process based on language model, e.g. Skip-gram [21], captures the co-occurrence probability of the vertices that appear within a window in a random walk.

2.2 Neighborhood Similarity based Graph Compression

The critical problem for graph compressing with preserving global structures of the graph is to accurately identify vertices that have similar neighborhood so are more likely to have similar representation. In this section, we discuss how to select vertices to merge into super-nodes.

2.2.1 Motivation

The motivation of our method is that if two vertices have the same neighbors, their representations should be very similar. For example, in the toy graph in Figure 2, the neighbor sets of node $a$ and $b$ are same. Hence, their transition probabilities to the other neighbor vertices are also same, i.e. $p(n_{i}|a)=p(n_{i}|b)=1/4$ for all $i\in\{1,2,3,4\}$ . Starting on either $a$ or $b$ will yield the same walk, and we will get the same neighborhood set for them. Therefore, instead of walking and learning representations for both $a$ and $b$ , it is enough to learn for just one of them. We can merge this node pair $(a,b)$ into one super-node $ab$ . Transition probabilities of this super-node to neighbors of $a$ and $b$ are still same with $a$ and $b$ i.e. $p(n_{i}|ab)=1/4$ for all $i\in\{1,2,3,4\}$ . When we obtain the representation of the super-node $ab$ , we can use it as the representation of each node in this pair. Merging these vertices keeps preserving the first and second order proximity. Thus this does not affect the results of walking and learning whereas it increases the efficiency.

Furthermore, compressing may change the transition probability of vertices since the number of their neighbors may decrease. As a result, the transition probability of each neighbors changes. For example, in the toy graph in Figure 2-(a), while the transition probability from $n_{1}$ to its neighbors is $\frac{1}{|N(n_{1})|}$ , after compressing, it becomes $\frac{1}{|N(n_{1})|-1}$ since number of neighbors decrease by one. In order to avoid this problem, we give weights to edges of super-nodes based on the number of merged edges within the compression. For example, the super-edge between super-node $ab$ and $n_{1}$ includes 2 edges which are $(a,n_{1})$ and $(b,n_{1})$ . Therefore, the weight of the super-edge ( $ab,n_{1}$ ) should be 2.

In a real-world graph, it is not expected to have too many vertices with the exact same neighbors. However, for many graph mining problems, such as node classification and graph clustering, if two vertices share many common neighbors, they are expected to be in the same class or cluster although their neighbor sets are not completely same. Hence we expect to have similar feature vectors for the vertices in the same class/cluster after embedding. From these observations, we can also apply the same merge operation on these vertices as well. Following the same idea in the example above, if neighbors of two vertices are similar (but not exactly the same), instead of learning representation for each separately, we can merge them into a super-node.

We now define our graph compressing algorithm formally as follows.

2.2.2 Graph Compressing

For a given graph $G$ , if a set of vertices $n_{1},n_{2},...,n_{r}$ in $V_{G}$ have similar neighbors, we merge these vertices into one super-node $n_{12...r}$ to get a smaller compressed graph $G^{\prime}(V_{G^{\prime}},E_{G^{\prime}})$ . The compressed graph $G^{\prime}$ preserves the local and global structure of the original graph but has significantly fewer vertices and edges.

To decide vertices to merge, we define the neighborhood similarity based on the transition probability. Before defining the neighborhood similarity, here we first show that cosine similarity between transition probabilities of two vertices $u,v$ , $T_{u}$ and $T_{v}$ , are determined by the number of their common neighbors.

Theorem 1

Let $T$ be the 1-step transition probability matrix of vertices $V$ in a graph $G$ and let $u,v\in V$ . Let $T_{u}$ and $T_{v}$ be the transition probability from vertices $u$ and $v$ to other vertices. Then the cosine similarity between $T_{u}$ and $T_{v}$ is

[TABLE]

Proof 2.2.

The cosine similarity between $T_{u}$ and $T_{v}$ is defined by

[TABLE]

By definition of $T$ , we have $T_{u}=\frac{A_{u}}{|N(u)|}$ and $T_{v}=\frac{A_{v}}{|N(v)|}$ . Furthermore, we have

[TABLE]

and

[TABLE]

Hence, if we plug in these into the Equation (1), we get

[TABLE]

This finalizes the proof.

From Theorem 1, we see that the similarity of transition probabilities from two vertices to other vertices depends on the similarity of their neighbors. Therefore we define the neighborhood similarity between two vertices as follows.

Definition 2.3.

(Neighborhood similarity) Given a graph $G$ , the neighborhood similarity between two vertices $u,v$ is given by

[TABLE]

In order to normalize the effect of high degree vertices, we divide the number of common neighbors by degree of vertices. The neighborhood similarity is between 0 and 1 where it is 0 when two vertices have no common neighbor and 1 when both have the exact same neighbors. According to the neighbor similarity, we merge vertices whose similarity value is greater than a given threshold.

The neighborhood similarity based graph compressing algorithm is given in Algorithm 1. It is clear that the vertices with a nonzero neighborhood similarity are 2-step neighbors. Therefore, we do not need to compute the similarity between all pairs of the vertices, instead, we just need to compute the similarity between vertices and its neighbors’ neighbors. For each node $v\in V_{G}$ , we compute the similarity between $v$ and each $k$ as neighbors of neighbors (Line 3-10). Then, we check the similarity value of all pairs ( $u$ , $k$ ) in the list and if it is higher than the given threshold $\lambda$ (line 12), we merge them $u$ and $k$ into a super-node $s_{u,k}$ (line 13). Then we delete edges of $u$ and $k$ and add edges between neighbors of $u$ and $k$ and new super-node $s_{u,k}$ (line 17-24). We give the weights to edges of super-nodes. Original edge weights are assigned to 1. Threshold $\lambda$ decides the trade-off between efficiency and effectiveness. If we use a larger value, it will merge less number of vertices. On the other hand, if we use a smaller value, we merge more vertices and as a side effect, we may merge some dissimilar vertices as well, that results in an increase in efficiency but causes a decrease in accuracy. Note that, the order of merging is arbitrary and one super-node may include more than two vertices of the original graph. For example, if the similarity between the vertices $x$ and $y$ , $NSim(x,y)$ , and the vertices $y$ an $z$ , $NSim(y,z)$ , are both bigger than given threshold, we merge $x$ and $y$ in $s_{x,y}$ and then we merge $s_{x,y}$ and $z$ into $s_{x,y,z}$ . Therefore, during the merge operation, we check whether the node $y$ is merged with another node and if so, we get the super-node of the original node $x$ .

2.2.3 Network embedding on compressed graph

Our algorithm for network embedding on a compressed graph is given in Algorithm 2. After getting the weighted compressed graph $S$ (line 1), we obtain the representation of super-nodes $V_{S}$ as $\phi_{s}$ in the compressed graph with the provided network embedding algorithm (line 2). We apply any random walk based representation learning algorithm on the compressed graph. We just need to apply weighted random walks to take the edge weights into consideration. As the size of the compressed graph is smaller than the original graph, it is more efficient to get embeddings of super-nodes than vertices. Finally, we assign the embedding of super-nodes to vertices according to the mapping $M$ obtained from the compression (line 3-6).

3 Experiments

We perform experimental studies to evaluate the efficiency and effectiveness of our algorithms on challenging multi-class and multi-label classification tasks in several real-world networks. We first provide an overview of the datasets and embedding methods used for experiments. We further show the performance of algorithms and also the improvement of our method on efficiency and discuss parameter sensitivity for different values of similarity threshold $\lambda$ and training ratio.

3.1 Datasets

The general statistics of the datasets used for experiments are reported in Table 1.

•

Cora - Cora is a citation network of machine learning papers. The labels of vertices indicate the topic of the paper. Each paper has a single topic. We convert it to an undirected graph and just use link information. We do not consider the attribute information of vertices which are word vectors indicating the absence/presence of the corresponding word from a given dictionary.

•

Wiki - Wiki is a network with vertices as web pages from 19 classes. Each page has a single label. The link among different vertices is the hyperlink on the web page. We convert it to undirected graph and just use link information. We do not consider the attribute information of vertices which are the TF-IDF values of web pages.

•

DBLP - This is a network of co-authorship of researchers in computer science. The labels represent the research areas in which a researcher publishes his work. The 4 research areas included in this dataset are DB, DM, IR, and ML. A researcher may have more than one research area.

•

BlogCatalog - BlogCatalog is a social network of users as bloggers on the BlogCatalog website. The link shows the relationships between users. The labels of a user represent the categories that blogger has interest and published in extracted from the metadata provided by the user. A user may have more than one label.

3.2 Baseline methods

For the performance evaluation, we use DeepWalk and Node2vec as baseline embedding methods in our model and compare our model with them. We combine each baseline methods with NECL and compare their performance. We give a brief explanation about these methods as follows:

•

DeepWalk - DeepWalk is a random walk based method for network embedding. It preserves the higher order proximity between vertices with generating random walks of fixed length from all the vertices of a graph. With considering the walks as sentences in a language model, it optimizes the log-likelihood of random walks using the Skip-gram model [22], which is for learning word embeddings. DeepWalk uses hierarchical softmax for the efficiency of optimization.

•

Node2vec - Node2vec is a random walk based network embedding method which makes an improvement to the random walk phase of DeepWalk. It applies biased random walks using the return parameter $p$ and the in-out parameter $q$ to combine DFS-like and BFS-like neighborhood explorations. With this way, they preserve the network community and structural roles of vertices. Different than DeepWalk, Node2vec uses negative sampling for optimization.

Parameter Settings: For DeepWalk Node2vec and NECL(DW), NECL(N2V), we set the following parameters: the number of random walks $\gamma$ , walk length $t$ , window size $w$ for the Skip-gram model and representation size $d$ . The parameter setting for all models is $\gamma=40$ , $t=10$ , $w=10$ , $d=128$ . The initial learning rate and final learning rate are set to 0.025 and 0.001 respectively in all models.

3.3 Classification

In this section, we compare our method with the baseline methods in two different classification tasks, namely single-label and multi-label classifications. In the former case, vertices have only one label (Cora and Wiki datasets) and in the latter case, they can have more than one label.

To evaluate our method, firstly, we obtain the embeddings of the vertices with each method and then use them as features to train a classifier. A portion of the labeled vertices are sampled randomly from the graph to train the classifier and the rest of the vertices are used for testing.

To have a detailed comparison between NECL and the baseline methods, we vary the portion of labeled vertices for classification and similarity threshold value $\lambda$ and present the macro and micro $F_{1}$ scores with walking and embedding times. We also report the number of edges and vertices in the compressed graph to see how much each graph is compressed. We increase $\lambda$ from 0.45 to 0.8 to test its effect on the efficiency and effectiveness of the embedding algorithms. While we vary the training ratio on the Cora, Wiki and DBLP datasets from $1\%$ to $50\%$ , we vary the training ratio on the BlogCatalog network from $10\%$ to $80\%$ . The number of class labels of BlogCatalog is about 10 times than other graphs, thus we use a larger portion of labeled vertices

To ensure the reliability of our experiment, the classification process is repeated 10 times, and the average macro $F_{1}$ , micro $F_{1}$ scores and running times are reported. All are performed on a server running Ubuntu 14:04 with $4$ Intel $2.6$ GHz ten-core CPUs and 48 GB of memory.

3.3.1 Single-label Classification

In these experiments, each node in the datasets has a single label from multi-class values. For the classification task, the multi-class SVM is employed as the classifier which uses the one-vs-rest scheme.

Table 2 shows the macro $F_{1}$ and micro $F_{1}$ scores, and time for embedding on Cora and Wiki with $5\%$ labeled vertices and $\lambda=0.5$ similarity threshold value. When the similarity threshold $\lambda<0.5$ , graphs become too small and accuracy decrease dramatically. Therefore, we select $\lambda=0.5$ as the cutting point for compression. As we see in the table, for both datasets, while there is no (significant) change on effectiveness as the macro $F_{1}$ and micro $F_{1}$ scores, there is a significant gain on efficiency as the total embedding time. While there is around 33.4% and 37.65% efficiency improvement on Cora dataset, there is 46% and 50.7% efficiency improvement on Wiki when it is compared with base line results, DeepWalk and Node2vec respectively. There is also significant graph compression ratio for both datasets. The size of the graph is decreased significantly with compressing. While the number of vertices is decreased to 1427 from 2708 (47.3%) for Cora and to 1060 from 2405 (55.9%) for Wiki, the number of edges is decreased to 5236 from 10858 (51.8%) for Cora and to 8584 from 23192 (62.9%) for Wiki.

The detailed comparison between NECL and the baseline methods with varying the portion of labeled vertices for classification and similarity threshold value $\lambda$ is given in Figure 3 and Figure 4 for the Cora and Wiki datasets respectively. We report details of embedding time as walking, training and total embedding time separately. As we see from figures, while the macro and micro $F_{1}$ scores are very similar with or higher than baseline results for $\lambda\geq 0.5$ , the running times are significantly different for both datasets. There is an improvement in both walking and training time for embedding. For both datasets, when the similarity threshold $\lambda<0.5$ , the macro $F_{1}$ and micro $F_{1}$ scores dramatically decrease since it merges many vertices and edges so this may cause information loss in the graph.

3.3.2 Multi-label Classification

The datasets used in these experiments are multi-labeled, i.e., a node can belong to more than one class. For this task, we train a one-vs-rest logistic regression model with $L_{2}$ regularization on the graph embeddings for prediction. The logistic regression model is implemented by LibLinear [12].

Table 3 shows the macro $F_{1}$ and micro $F_{1}$ scores, and time for embedding on DBLP and BlogCatalog with 5% and 50% labeled vertices respectively and $\lambda=0.5$ similarity threshold value. Similar to single label classification, we select $\lambda=0.5$ as the cutting point for compression.

As we see in the table, for DBLP dataset, while the macro $F_{1}$ and micro $F_{1}$ scores of NECL are very similar with baseline results, there is a significant gain on embedding time which are 57.46% and 56.75% for DeepWalk and Node2vec respectively. There is also a high graph compression ratio for this dataset. While the number of vertices is decreased to 8824 from 29199 (69.8%), the number of edges is decreased to 32984 from 133664 (75.3%).

As a scale-free network with complex structure, BlogCatalog is challenging for graph coarsening. While there is a slight decrease in both macro and micro $F_{1}$ scores (2.9% on macro $F_{1}$ and 5.6% on micro $F_{1}$ for DeepWalk and 4.1% on macro $F_{1}$ and 6.6% on micro $F_{1}$ for Node2vec), we obtain about 28.2% and 23.4% gains in the total running time respectively. Furthermore, we reduce the number of vertices and edges about 17.5% and 18.6% percent in the compressed graph respectively.

The detailed comparison between NECL and the baseline methods with varying the portion of labeled vertices and similarity threshold value $\lambda$ for multi-label classification is given in Figure 5 and Figure 6. In addition to the macro and micro $F_{1}$ scores achieved on DBLP and BlogCatalog datasets, we also report detailed embedding time as walking, training and total embedding time separately in Figure 5-(c) and Figure 6-(c).

For the DBLP dataset (Figure 5), as it happens in Cora and Wiki, NECL has very similar, even slightly higher, macro and micro $F_{1}$ scores than baseline methods for $\lambda\geq 0.5$ at all training ratios, but again the scores decrease dramatically for smaller $\lambda$ values. On the other hand, there is a significant gain in walking, training and total embedding time.

For the BlogCatalog dataset (Figure 6), there are similar results as well. Macro $F_{1}$ scores are close each other for $\lambda\geq 0.5$ ; however, micro $F_{1}$ scores are slightly different for $\lambda\leq 0.7$ . For the comparison between NECL and both baseline methods, DeepWalk and Node2vec, although there is a slight decrease in both macro and micro $F_{1}$ scores, we obtain gains on the running times, especially on walking times. For Node2vec, walking time takes a large portion of the embedding time as a result of thebiased walking. The biggest reason is that, since the degree of vertices is higher, defining a biased probability on them takes longer time.

In short, for both the single-label and the multi-label classification tasks, NECL succeeds the similar classification accuracy within a consistently shorter time and with a relatively smaller compressed graph.

3.4 Graph Compression

In this section, we present how the graph size is decreased by compression with different similarity threshold values $\lambda$ . As we see in Figure 7, there is a linear relation between $\lambda$ and the number of vertices and edges till $\lambda=0.5$ , and then graph sizes change dramatically for smaller $\lambda$ for Cora, Wiki and DBLP datasets, but the decrease is slow for BlogCatalog until $\lambda=0.7$ . One of the possible reasons for BlogCatalog is the fact that the sizes of the neighbor sets for some vertices are very large, and it is not easy to get higher similarity for a larger set. For example for two vertices with 15 edges, 10 common neighbors can be considered to have higher similarity. On the other hand, two vertices with 150 edges, we should have 100 common neighbors to get the same similarity value which is not very common.

4 Related Work

In this section, we briefly discuss the related work in the areas of networks embedding and graph compression. Network embedding. Previous researchers consider the graph embedding as a dimensionality reduction [9] such as PCA [31] that captures linear structural information and LE (locally linear embeddings) [26] that preserves the global structure of non-linear manifolds. While these methods are effective on small graphs, scalability is the major concern for them to be applied on large-scale networks with billions of vertices, since the time complexity of these methods is at least quadratic in the number of graph vertices [33, 30]. On the other hand, recent approaches in graph representation learning focus on the scalable methods that use matrix factorization [25] or neural networks [29, 8, 32]. Many of these aim to preserve the first and second order proximity as local neighborhood with path sampling using short random walks such as DeepWalk and Node2vec [17, 15, 6, 11]. Some studies use network embedding on node and graph classification [24, 10, 23], some of them use it on graph clustering [3, 2, 7].

DeepWalk preserves the higher order proximity between vertices by generating random walks of fixed length from all the vertices of a graph. With considering the walks as sentences in a language model, they optimize the log-likelihood of random walks using the Skip-gram model [21], which is for learning word embeddings. DeepWalk uses hierarchical softmax for the efficiency of optimization. Node2vec, which is from the many different extensions of DeepWalk, makes an improvement to the random walk phase in DeepWalk. They apply biased random walks using the return parameter $p$ and the in-out parameter $q$ to combine DFS-like and BFS-like neighborhood explorations. With this way, they preserve the network community and structural roles of vertices. Different than DeepWalk, Node2vec uses negative sampling for optimization.

Optimization in these methods could easily get stuck at a bad local minima as the result of poor initialization. Moreover, while preserving local proximities of vertices in a network, they may not preserve the global structure of the network. To address these issues, a multilevel graph representation learning paradigm, HARP, is proposed in [10] as a graph preprocessing step. In this approach, in a hierarchical manner at varying levels of coarseness, related vertices in the network are combined into super-nodes. After learning the embedding of the coarsened network with a state-of-the-art graph embedding method, the learned embedding is used as an initial value for the next level. In addition to capturing the global structure of the input graph by coalescing, by learning graph representation on these smaller graphs, a good initialization with the embedding of the coarsened network improves performance of the state-of-the-art methods.

NECL use the graph coarsening to capture the local structure of the network without hierarchical manner to improve the efficiency of the random walk based state-of-the-art methods.

Graph compressing. Although recent network embedding methods have a promising performance on the effectiveness of various applications, there are still some challenges since real-world graphs are massive in scale and this may obstruct the direct application of existing methods. On the other hand, when we consider a compressed or summary graph conserving the key structure and patterns of the original graph, many methods would be applicable to large graphs [19].

Graph compressing algorithms, which are popular methods in the graph mining community, compress a graph into a smaller one with preserving certain properties of the original graph, such as connectivity [34]. Vertices with similar characteristics are grouped and represented by super-nodes. Approximations with compressing are used to solve the original problem more efficiently such as all-pairs shortest paths, search engine storage and retrieval [1, 28]. Using an approximation of the original graph not only make a complex problem simpler but also make a good initialization to solve the problem. It has been proved successful in various graph theory problems [13].

NECL extends the idea of the graph compressing layout to network representation learning methods. We illustrate the utility of this paradigm by combining NECL with two state-of-the-art representation learning methods, DeepWalk and Node2vec.

5 CONCLUSIONS

We propose a novel efficient network embedding method NECL which preserves the local structural features of the vertices. To overcome the efficiency limitations of the state-of-the-art methods, we use the idea of the graph compressing layout to network representation learning methods. We combine related vertices of a network into super-nodes which preserve the neighborhood information of the vertices. Then, we use the compressed graph to learn the representation of the vertices in the original graph. We apply the utility of this paradigm by combining NECL with two state-of-the-art representation learning methods, DeepWalk and Node2vec. Extensive experiments on a variety of different real-world graphs validate the efficiency of our approach on challenging multi-class and multi-label classification tasks without decreasing the effectiveness.

One of the future extensions of NECL could be combining it with other kinds of graph representation learning methods which use matrix factorization and deep neural networks to see if it also works well with them. Another extension we are planing is using different similarity measures for compression to preserve different properties of the network.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Adler and M. Mitzenmacher. Towards compressing web graphs. In Proceedings of Data Compression Conference, DCC 2001. , pages 203–212. IEEE, 2001.
2[2] E. Akbas and P. Zhao. Attributed graph clustering: An attribute-aware graph embedding approach. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017 , ASONAM ’17, pages 305–308, New York, NY, USA, 2017. ACM.
3[3] E. Akbas and P. Zhao. Graph Clustering Based on Attribute-Aware Graph Embedding , pages 109–131. Springer International Publishing, Cham, 2019.
4[4] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic , NIPS’01, pages 585–591, Cambridge, MA, USA, 2001. MIT Press.
5[5] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks. In Social network data analytics , pages 115–148. Springer, 2011.
6[6] H. Cai, V. W. Zheng, and K. C.-C. Chang. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering , 30(9):1616–1637, 2018.
7[7] S. Cao, W. Lu, and Q. Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management , CIKM ’15, pages 891–900, New York, NY, USA, 2015. ACM.
8[8] S. Cao, W. Lu, and Q. Xu. Deep neural networks for learning graph representations. In Thirtieth AAAI Conference on Artificial Intelligence , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Network Embedding: on Compression and Learning

Abstract

1 Introduction

Example 1

2 Network Embedding using similarity based Compression

2.1 Preliminaries

2.2 Neighborhood Similarity based Graph Compression

2.2.1 Motivation

2.2.2 Graph Compressing

Theorem 1

Proof 2.2**.**

Definition 2.3**.**

2.2.3 Network embedding on compressed graph

3 Experiments

3.1 Datasets

3.2 Baseline methods

3.3 Classification

3.3.1 Single-label Classification

3.3.2 Multi-label Classification

3.4 Graph Compression

4 Related Work

5 CONCLUSIONS

Proof 2.2.

Definition 2.3.