Learning Edge Representations via Low-Rank Asymmetric Projections

Sami Abu-El-Haija; Bryan Perozzi; Rami Al-Rfou

arXiv:1705.05615·cs.LG·September 15, 2017

Learning Edge Representations via Low-Rank Asymmetric Projections

Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou

PDF

1 Repo

TL;DR

This paper introduces a novel graph embedding method that explicitly models directed edges and uses a graph likelihood objective, resulting in more concise, accurate representations that outperform previous methods in link prediction tasks.

Contribution

The paper presents a new approach combining explicit edge modeling and a graph likelihood objective, improving embedding quality and efficiency for directed graphs.

Findings

01

Achieves up to 76% error reduction in link prediction.

02

Produces embeddings 10 times smaller with higher accuracy.

03

Effectively models directed edges in graph embeddings.

Abstract

We propose a new method for embedding graphs while preserving directed edge information. Learning such continuous-space vector representations (or embeddings) of nodes in a graph is an important first step for using network information (from social networks, user-item graphs, knowledge bases, etc.) in many machine learning tasks. Unlike previous work, we (1) explicitly model an edge as a function of node embeddings, and we (2) propose a novel objective, the "graph likelihood", which contrasts information from sampled random walks with non-existent edges. Individually, both of these contributions improve the learned representations, especially when there are memory constraints on the total size of the embeddings. When combined, our contributions enable us to significantly improve the state-of-the-art by learning more concise representations that better preserve the graph structure.…

Tables3

Table 1. Table 1 . Link Prediction results from ranking E test subscript 𝐸 test E_{\text{test}} across five graph datasets. Numbers shown are the ROC-AUC. Row-wise: the first two datasets are directed and the last four are undirected graphs. Column-wise: The first three methods are adjacency (non-embedding) methods that use the direct neighbors N ( u ) 𝑁 𝑢 N(u) and N ( v ) 𝑁 𝑣 N(v) for computing g ( u , v ) 𝑔 𝑢 𝑣 g(u,v) . We then list all embedding methods, preceded by d, the dimensionality of the learned embeddings. We report three embedding baselines followed by our four embedding methods. We compare embedding methods across different dimensionality { 8 , 16 , 32 , 64 , 128 } 8 16 32 64 128 \{8,16,32,64,128\} , marking in bold the top performer for the given dimensionality. For asymmetric embedding methods, we train with half of the dimensionality (= { 4 , 8 , 16 , 32 , 64 } 4 8 16 32 64 \{4,8,16,32,64\} ) since in practice we need to store both sides of the embedding to compute an edge score, therefore every row contains the same total dimensions per node. The last column shows the relative reduction in error of our Asymmetric Deep model compared the best baseline. † † \dagger indicates that the algorithm runs out of memory on our machine with 32 GB ram.

	Dataset	Adjacency Methods			Embedding Methods
		Non-Embedding baselines				Embedding Baselines			Ours: (end-to-end) Graph Likelihood
		Jaccard	Common Neighbors	Adamic Adar					Symmetric		Asymmetric		\pbox20cm % Error Reduction
		Jaccard	Common Neighbors	Adamic Adar	d	Eigen Maps	node2vec	DNGR	shallow	deep	shallow	deep	\pbox20cm % Error Reduction
directed \ldelim {1040pt	soc-epinions	0.649	0.649	0.647	8	$†$	0.725	$†$	0.694	0.665	0.695	0.825	36.5%
					16	$†$	0.726	$†$	0.710	0.713	0.699	0.840	41.4%
					32	$†$	0.714	$†$	0.740	0.713	0.700	0.845	45.9%
					64	$†$	0.699	$†$	0.766	0.722	0.698	0.834	44.9%
					128	$†$	0.691	$†$	0.782	0.743	0.718	0.828	44.5%
	wiki-vote	0.579	0.580	0.562	8	0.613	0.643	0.630	0.603	0.602	0.608	0.871	63.7%
					16	0.607	0.642	0.622	0.623	0.639	0.643	0.900	71.9%
					32	0.600	0.641	0.619	0.642	0.661	0.683	0.911	75.2%
					64	0.613	0.642	0.598	0.660	0.672	0.702	0.917	76.7%
					128	0.622	0.643	0.554	0.682	0.685	0.730	0.917	76.8%
undirected \ldelim {1540pt	ca-HepTh	0.765	0.765	0.765	8	0.786	0.731	0.706	0.855	0.848	0.605	0.879	43.2%
					16	0.790	0.787	0.780	0.894	0.826	0.885	0.899	51.9%
					32	0.795	0.858	0.829	0.896	0.886	0.884	0.911	37.8%
					64	0.802	0.886	0.868	0.878	0.884	0.870	0.910	21.3%
					128	0.812	0.901	0.897	0.891	0.897	0.820	0.916	14.6%
	ca-AstroPh	0.942	0.942	0.944	8	0.825	0.811	0.852	0.923	0.925	0.592	0.917	44.1%
					16	0.825	0.833	0.877	0.950	0.923	0.657	0.945	55.8%
					32	0.825	0.899	0.917	0.955	0.938	0.942	0.955	46.1%
					64	0.824	0.934	0.939	0.948	0.936	0.936	0.958	30.7%
					128	0.829	0.955	0.968	0.953	0.936	0.939	0.957	n/a
	PPI	0.766	0.776	0.779	8	0.710	0.733	0.583	0.746	0.763	0.550	0.804	26.6%
					16	0.711	0.707	0.687	0.780	0.772	0.786	0.817	36.7%
					32	0.709	0.691	0.741	0.779	0.784	0.794	0.833	35.5%
					64	0.707	0.671	0.767	0.791	0.767	0.813	0.837	30.0%
					128	0.737	0.698	0.769	0.795	0.787	0.799	0.841	31.0%

Table 2. Table 2 . Showing generalization performance by averaging ( test AUC train AUC ) test AUC train AUC \left(\frac{\text{test AUC}}{\text{train AUC}}\right) across all runs for all datasets under two settings: “shallow” VS “deep” asymmetric i.e. absence VS presence of DNN f ( ) 𝑓 f() . Last two columns show the t-test for the difference of test-over-train AUC between the two settings. Adding a DNN to the model is a statistically significant improvement on all graphs with p < 0.001 𝑝 0.001 p<0.001 .

Dataset	mean $(\frac{test AUC}{train AUC})$		Statistical Significance
Dataset	shallow asymmetric	deep asymmetric	t-statistic	p-value
soc-epinions	$0.841$	$0.882$	5.797673	1.53E-06
wiki-vote	$0.908$	$0.948$	3.881161	4.32E-04
ca-HepTh	$0.881$	$0.915$	5.202880	1.62E-05
ca-AstroPh	$0.946$	$0.970$	5.946066	4.08E-07
ppi	$0.865$	$0.893$	4.187474	8.45E-05

Table 3. Table 3 . Inner-quartile Ranges of Standard Deviations of Embedding Norm, across all runs. For shallow models, we calculate std u ∈ V ( ‖ Y u ‖ ) subscript std 𝑢 𝑉 norm subscript 𝑌 𝑢 \text{std}_{u\in V}(||Y_{u}||) , where std u ∈ V ( . ) \text{std}_{u\in V}(.) is the standard deviation for all u ∈ V 𝑢 𝑉 u\in V , and we display the statistics across all runs (e.g. different different embedding dimensions). For deep models, we calculate std u ∈ V ( ‖ f ( Y u ) ‖ ) subscript std 𝑢 𝑉 norm 𝑓 subscript 𝑌 𝑢 \text{std}_{u\in V}(||f(Y_{u})||)

Dataset	Shallow Symmetric			Shallow Asymmetric			Deep Symmetric			Deep Asymmetric
	$25^{th}$	$50^{th}$	$75^{th}$	$25^{th}$	$50^{th}$	$75^{th}$	$25^{th}$	$50^{th}$	$75^{th}$	$25^{th}$	$50^{th}$	$75^{th}$
wiki-vote	0.106	0.202	0.327	0.122	0.142	0.152	0.597	0.901	1.119	0.811	1.096	1.938
soc-epinions	0.382	0.526	0.754	0.299	0.345	0.430	0.888	1.147	1.404	1.276	1.881	3.590
ppi	1.884	4.593	7.842	0.858	1.197	3.801	0.825	1.095	1.410	1.015	1.443	2.645
ca-HepTh	7.370	10.871	12.426	3.093	3.916	7.892	0.957	1.364	1.709	1.120	1.417	2.292
ca-AstroPh	12.069	326.483	4282.095	1.108	4.648	24.271	0.874	1.273	1.999	1.065	1.826	3.062

Equations37

ß Y min (u, v) \in E \sum A_{uv} ∣∣ Y_{u} - Y_{v} ∣ ∣_{2}^{2} s . t . Y^{T} D Y = I,

ß Y min (u, v) \in E \sum A_{uv} ∣∣ Y_{u} - Y_{v} ∣ ∣_{2}^{2} s . t . Y^{T} D Y = I,

\frac{exp ( Y _{w_{1}}^{T} Y _{w_{2}} )}{\sum _{j} exp ( Y _{w_{j}}^{T} Y _{w_{2}} )} .

\frac{exp ( Y _{w_{1}}^{T} Y _{w_{2}} )}{\sum _{j} exp ( Y _{w_{j}}^{T} Y _{w_{2}} )} .

Y min lo g Z - u \in V, v \in V \sum D_{uv} (Y_{u}^{T} Y_{v}),

Y min lo g Z - u \in V, v \in V \sum D_{uv} (Y_{u}^{T} Y_{v}),

f_{θ} : Y_{u}

f_{θ} : Y_{u}

\to FC_{{W_{2}, b_{2}}} \to BatchNorm \to f_{θ} (Y_{u}),

g (u, v) = f (Y_{u})^{T} \times M \times f (Y_{v}),

g (u, v) = f (Y_{u})^{T} \times M \times f (Y_{v}),

g^{(2)} (u, v) = ⟨ w_{g}^{(2)}, [relu (g_{1} (u, v)), \dots, relu (g_{h} (u, v))]⟩,

g^{(2)} (u, v) = ⟨ w_{g}^{(2)}, [relu (g_{1} (u, v)), \dots, relu (g_{h} (u, v))]⟩,

Pr (G) = (u, v) \in E_{train} \prod Q (u, v) (u, v) \in / E_{train} \prod 1 - Q (u, v),

Pr (G) = (u, v) \in E_{train} \prod Q (u, v) (u, v) \in / E_{train} \prod 1 - Q (u, v),

\pbox 20 c m u \in V v \in V \prod Q (u, v)^{\mathbbm 1 [(u, v) \in E_{train}]} (1 - Q (u, v))^{\mathbbm 1 [(u, v) \in / E_{train}]}

\pbox 20 c m u \in V v \in V \prod Q (u, v)^{\mathbbm 1 [(u, v) \in E_{train}]} (1 - Q (u, v))^{\mathbbm 1 [(u, v) \in / E_{train}]}

Pr (G) \propto \pbox 20 c m u \in V v \in V \prod σ (g (u, v))^{D_{uv}} (1 - σ (g (u, v)))^{\mathbbm 1 [(u, v) \in / E_{train}]}

Pr (G) \propto \pbox 20 c m u \in V v \in V \prod σ (g (u, v))^{D_{uv}} (1 - σ (g (u, v)))^{\mathbbm 1 [(u, v) \in / E_{train}]}

u_{1} \to u_{2} \to u_{3} \to \dots \to u_{τ} .

u_{1} \to u_{2} \to u_{3} \to \dots \to u_{τ} .

(u_{i}, u_{j}) \forall j \in Z, i - w_{l} \leq j \leq i + w_{r}, j \neq = i,

(u_{i}, u_{j}) \forall j \in Z, i - w_{l} \leq j \leq i + w_{r}, j \neq = i,

∣ D ∣ = n τ O ((w_{l} + w_{r}) (τ - (w_{l} + w_{r} + 1)) ∣ V ∣) = O (∣ V ∣)

∣ D ∣ = n τ O ((w_{l} + w_{r}) (τ - (w_{l} + w_{r} + 1)) ∣ V ∣) = O (∣ V ∣)

\overset{u}{ˉ} = {v_{1}^{-}, v_{2}^{-}, \dots} s.t. \forall v^{-} \in \overset{u}{ˉ}, (u, v^{-}) \in / E_{train}

\overset{u}{ˉ} = {v_{1}^{-}, v_{2}^{-}, \dots} s.t. \forall v^{-} \in \overset{u}{ˉ}, (u, v^{-}) \in / E_{train}

\displaystyle\begin{split}\mathcal{L}=\mathop{\mathbb{E}}_{(u,v)\sim\mathcal{D}/Z}\bigg{[}&\log\sigma(g(u,v))\\ &+\sum_{v^{-}\in\text{Sample}(K,\bar{u})}\log(1-\sigma(g(u,v^{-})))\bigg{]},\end{split}\vspace{-0.1in}

\displaystyle\begin{split}\mathcal{L}=\mathop{\mathbb{E}}_{(u,v)\sim\mathcal{D}/Z}\bigg{[}&\log\sigma(g(u,v))\\ &+\sum_{v^{-}\in\text{Sample}(K,\bar{u})}\log(1-\sigma(g(u,v^{-})))\bigg{]},\end{split}\vspace{-0.1in}

g (u, v) = \frac{∣ N ( u ) \cap N ( v ) ∣}{∣ N ( u ) \cup N ( v ) ∣}

g (u, v) = \frac{∣ N ( u ) \cap N ( v ) ∣}{∣ N ( u ) \cup N ( v ) ∣}

g (u, v) = ∣ N (u) \cap N (v) ∣

g (u, v) = ∣ N (u) \cap N (v) ∣

g (u, v) = x \in N (u) \cap N (v) \sum \frac{1}{lo g ( ∣ N ( x ) ∣ )}

g (u, v) = x \in N (u) \cap N (v) \sum \frac{1}{lo g ( ∣ N ( x ) ∣ )}

v arg max ⟨ L^{T} f (Y_{u}), R f (Y_{v})⟩ = v arg min ∣∣ L^{T} f (Y_{u}) - R f (Y_{v}) ∣∣.

v arg max ⟨ L^{T} f (Y_{u}), R f (Y_{v})⟩ = v arg min ∣∣ L^{T} f (Y_{u}) - R f (Y_{v}) ∣∣.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google/asymproj_edge_dnn
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning Edge Representations via Low-Rank Asymmetric Projections

Sami Abu-El-Haija

Google ResearchMountain ViewCalifornia

[email protected]

,

Bryan Perozzi

Google ResearchNew York CityNew York

[email protected]

and

Rami Al-Rfou

Google ResearchMountain ViewCalifornia

[email protected]

(2017)

Abstract.

We propose a new method for embedding graphs while preserving directed edge information. Learning such continuous-space vector representations (or embeddings) of nodes in a graph is an important first step for using network information (from social networks, user-item graphs, knowledge bases, etc.) in many machine learning tasks.

Unlike previous work, we (1) explicitly model an edge as a function of node embeddings, and we (2) propose a novel objective, the graph likelihood, which contrasts information from sampled random walks with non-existent edges. Individually, both of these contributions improve the learned representations, especially when there are memory constraints on the total size of the embeddings. When combined, our contributions enable us to significantly improve the state-of-the-art by learning more concise representations that better preserve the graph structure.

We evaluate our method on a variety of link-prediction task including social networks, collaboration networks, and protein interactions, showing that our proposed method learn representations with error reductions of up to 76% and 55%, on directed and undirected graphs. In addition, we show that the representations learned by our method are quite space efficient, producing embeddings which have higher structure-preserving accuracy but are 10 times smaller.

Graph, Edge Learning, Embedding, Random Walk, Link Prediction, Representation Learning

††journalyear: 2017††copyright: rightsretained††conference: CIKM’17 ; November 6–10, 2017; Singapore, Singapore††doi: 10.1145/3132847.3132959††isbn: 978-1-4503-4918-5/17/11

1. Introduction

Recent advancements in learning embedding vectors for words have resulted in a proliferation of methods which learn continuous space representations of graphs (e.g. DeepWalk (Perozzi et al., 2014)). These approaches process a graph and encode each node as a (real-valued) embedding vector, enabling easy integration with existing machine learning algorithms.

Such embedding methods learn a vector space that highly preserves the graph structure. Two nodes would have large similarity in the embedding space (or small distance) if they are strongly connected in the original (discrete) graph. Edges can be weighted or unweighted. Traditional eigen methods (Hagen and Kahng, 1992; Shi and Malik, 2000; Belkin and Niyogi, 2001) learn embeddings that minimize the euclidean distance of connected nodes, which can be solved (with orthonormal constraints) by eigen-decomposition of the symmetric graph Laplacian. Recent random-walk embedding methods (Perozzi et al., 2014; Grover and Leskovec, 2016) learn representations which encode the random walk transition matrix. These methods embed two nodes close if they co-occur frequently in short random walks. In general, random-walk methods outperform “eigen” methods on producing vector representations that preserve the graph structure.

However, recent random-walk embedding methods have two shortcomings. First, these methods do not explicitly model edges. This node-centric assumption represents an edge $(u,v)$ identically to reverse counterpart $(v,u)$ , and is unable to capture asymmetric relationships. Second, to preserve the graph structure they embed nodes into a relatively high-dimensional space, sometimes producing an embedding dictionary larger than the sparse adjacency matrix.

In this work we propose to address these limitations by explicitly modeling edges in the network as a function of the nodes. Specifically, we model edges by (i) using a Deep Neural Network (DNN) to map nodes onto a low-dimensional manifold, (ii) defining an edge function between two nodes as a projection in the manifold coordinates, and (iii) jointly-optimizing the edge function and the manifold by maximizing a new objective we propose, the graph likelihood, which we define as a product of the edge function over all node pairs.

More formally, we learn an embedding vector $Y_{u}\in\mathbb{R}^{D}$ for every graph node $u$ , a manifold-mapping Deep Neural Network (DNN) $f:\mathbb{R}^{D}\rightarrow\mathbb{R}^{d}$ that is shared across all nodes, and an asymmetric edge function $g:(\mathbb{R}^{d}\times\mathbb{R}^{d})\rightarrow\mathbb{R}$ to represent edges in the graph. Our entire model $g(u,v)=f(Y_{u})^{T}\times M\times f(Y_{v})$ is end-to-end differentiable. $M$ is low-rank, as $M=L\times R$ , where both $L\in\mathbb{R}^{d\times b}$ and $R\in\mathbb{R}^{b\times d}$ project the node manifold coordinates to smaller space $\mathbb{R}^{b}$ . Since $b$ is much smaller than $D$ , we are able to reduce the final node embedding significantly. Figure 1 shows a depiction of our architecture. Our desired likelihood is quadratic but we estimate it with a tractable linear objective using negative sampling, similar to (Mikolov et al., 2013).

We find that explicitly modeling edges can drastically reduce the representation dimensionality, for both directed and undirected graphs, especially when coupled with a Deep Neural Network. Further, modeling asymmetry by representing edge $(u,v)$ differently than $(v,u)$ gives an additional performance boost when preserving the structure of directed graphs. We perform an extrinsic evaluation of our method, by comparing it to the state-of-the-art on link-prediction tasks over a variety of graphs from social networks, biology, and e-commerce. We show that we can consistently learn orders of magnitude smaller embedding dimensions, while improving ROC-AUC metrics. For example, we reduce the error on directed graphs by up to $\approx 70\%$ and undirected graphs by up to $\approx 50\%$ when using same-sized representations. However, when our model is restricted to representations which are 8 times smaller than the baselines, we reduce the error in some cases by up to $66\%$ on directed graphs and $16\%$ on undirected graphs. We perform intrinsic evaluations, by training and rendering two-dimensional embedding spaces for two datasets, which we use to gain intuition on placement choices made by our model.

To summarize, our contributions are as follows.

(1)

We propose to explicitly model a directed edge function, which we define as low-rank affine projections on a manifold that is produced by a Deep Neural Network (i.e. “deep embeddings”). 2. (2)

We propose a new objective function, the graph likelihood. 3. (3)

These aspects significantly improve the state-of-art on learning continuous graph representations, especially on directed graphs, while producing significantly smaller representation spaces, as evaluated on five graph datasets.

2. Edge Representations

It is common to embed a graph by learning one continuous $D$ -dimensional vector $Y_{u}\in\mathbb{R}^{D}$ for every graph node $u\in V$ , where relationships between nodes $u$ and $v$ are captured in a very coarse way, through the use of a distance measure (e.g. $dist(Y_{u},Y_{v})$ ). This node-centric modeling assumes that all relationships in the graph are symmetric – a limiting assumption which fails to capture any directed relationships.

We seek to model the asymmetry which occurs in many real world graphs. Specifically, given two nodes, $u$ and $v$ , we desire that their distances are allowed to differ ( $dist(Y_{u},Y_{v})\neq dist(Y_{v},Y_{u})$ ) to reflect ordering in directed relationships, such as follower and followee on Twitter. In addition, the asymmetry can also model degree variance in undirected graphs. Consider a popular node $m$ , then the optimization could make $dist(Y_{u},Y_{m})$ small for all $u$ but not necessarily $dist(Y_{m},Y_{u})$ .

Even though it is possible to learn one representation $Y_{(u,v)}$ for all node pairs $(u,v)$ , this direct modeling is prohibitive in practice and requires an upper-bound space of $O(|V|^{2})$ . Instead, we propose to learn a trainable edge function defined over node embedding coordinates. Specifically, we learn asymmetric transformations of the nodes, which generates for a node $u$ , two representations: one when it is the source of a directed edge, $\hat{Y}^{\text{source}}_{u}$ and one when it is a destination, $\hat{Y}^{\text{dest}}_{u}$ . These representations share a neural network $f$ . These representations can be combined for any pair of nodes to model the strength of their directed relationships. That is, for nodes $u$ and $v$ , we represent $(u,v)$ and $(v,u)$ as $dist(\hat{Y}^{\text{source}}_{u},\hat{Y}^{\text{dest}}_{v})$ and $dist(\hat{Y}^{\text{source}}_{v},\hat{Y}^{\text{dest}}_{u})$ , respectively.

3. Preliminaries

3.1. Link Prediction

Link prediction is a problem of inferring missing edges in a graph. We use it to evaluate the generalization ability of our embedding spaces, as we aim to preserve the graph structure. The common setup (Grover and Leskovec, 2016) is to “hold out” test edges $E_{\textrm{test}}\subset E$ and train on the remaining $E_{\textrm{train}}=E-E_{\textrm{test}}$ . Structure-preserving representations should retrieve the held-out $E_{\textrm{test}}$ with high accuracy.

3.2. Graph Embedding

Graph embedding approaches learn a $D$ -dimensional embedding dictionary ${\bf Y}\in\mathbb{R}^{|V|\times D}$ , containing continuous real-valued vector $Y_{u}\in\mathbb{R}^{D}$ for every graph node $u\in V$ . Earlier approaches in computing embeddings include Eigenmaps (Belkin and Niyogi, 2001), which embeds $Y_{u}$ and $Y_{v}$ to be close if they are connected (i.e. $(u,v)\in E$ or similarly $A_{uv}=1$ ). Formally, Eigenmaps learns embeddings by minimizing an objective:

[TABLE]

where the weight of edge $(u,v)$ is stored in the adjacency matrix at $A_{uv}$ and D is a diagonal weight matrix with $D_{vv}=\sum_{u}A_{uv}$ . This optimization yields an embedding space where $Y_{u}$ and $Y_{v}$ are near if $A_{vu}$ is large (or non-zero). Equation (1) also appears in equivalent forms in (Hagen and Kahng, 1992; Shi and Malik, 2000). Furthermore, Bregman Iterations has been proposed to optimize an L1-formulation of the above objective function (Yu et al., 2015).

3.3. Word2vec

Word2vec (Mikolov et al., 2013) processes a big text corpus (e.g. Wikipedia) and learns one embedding vector for every unique word. If two words $w_{1}$ and $w_{2}$ are frequently “close” (e.g. in same sentence), then the dot product of embeddings $Y_{w_{1}}^{T}Y_{w_{2}}$ is maximized. In particular, every time two words $w_{1}$ and $w_{2}$ are within $C$ words away, where integer $C$ is the “context window size” hyperparameter, then a gradient step increments this likelihood:

[TABLE]

We refer the reader to (Mikolov et al., 2013) for further information111 The denominator, rather than summing over all words, is approximated by hierarchical softmax. In addition, their original formulation learns two vectors per word, one when used as “input” and another used when “output”.

3.4. Graph Embedding with Random Walks

Rather than operating directly on the adjacency matrix ${\bf A}$ , another embedding strategy has been recently proposed by Perozzi, Al-Rfou, and Skiena (Perozzi et al., 2014). Their method, DeepWalk, introduced a new class of Random Walk methods, which extend a node’s direct neighbors to include nodes that are within small number of hops. These approaches sample many random walks from the graph. If nodes $u$ and $v$ are frequently close in the random walks, then the model learns a representation such that the inner product of $\langle Y_{u},Y_{v}\rangle$ is a large positive value. Algorithm 1 extracts random walks. It begins by computing a probability transition matrix $\mathbf{\pi}$ , where $\pi_{u\rightarrow v}$ indicates the probability of a random walker visiting node $v$ conditioned on current node being $u$ .

This model has been extended by node2vec (Grover and Leskovec, 2016) to use a second-order probability transition function containing $\pi_{t\rightarrow u\rightarrow v}$ , where the probability of a random walker visiting node $v$ is conditioned on current node $u$ and previous node $t$ . Node2vec’s random walk use hyper-parameters $p$ and $q$ , which effectively yield a graph traversal algorithm that’s like an interpolation between Depth-First Search (DFS) and Breadth-First Search (BFS). We refer the reader to (Grover and Leskovec, 2016) for further details. We adopt this method for generating random walks in our work.

After sampling random walks, DeepWalk and node2vec treat each walk $(u_{1}\rightarrow u_{2}\rightarrow\dots\rightarrow u_{\tau})$ as a sequence, and then apply the skip-gram model to compute embeddings per word (i.e. node). The objective that they minimize is:

[TABLE]

Where $D_{uv}$ is the number of times nodes $u$ and $v$ appear close to each other (i.e. within the context size) in all random walks. We extend these random walk methods in three important ways: First, rather than using word2vec’s objective (Equation 3), we propose an novel alternative objective, the graph likelihood (see Section 4.2). Second, we explicitly represent an edge function as a function of nodes which we jointly train (see Section 4). Third, we define the “context” for directly graph differently than undirected ones. Specifically, a random walk $u_{1}\rightarrow u_{2}\rightarrow u_{3}\rightarrow u_{4}\rightarrow u_{5}$ would produce $\{u_{1},\dots,u_{5}\}$ as a context of $u_{3}$ if graph is undirected and would produce $\{u_{4},u_{5}\}$ as context if the graph is directed.

4. Our Method

We explain the details of our model and how we train it. The source-code is made available online 222Code available at http://sami.haija.org/graph/deep_embedding.html .

4.1. Model

Given an (un)directed graph $G=(V,E)$ , we learn an embedding vector $Y_{u}\in\mathbb{R}^{D}$ for every node $u\in V$ . In addition, we learn a Deep Neural Network (DNN) $f_{\theta}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{d}$ that maps a node onto a low-dimensional manifold. $f_{\theta}$ is depicted in Figure 2, and is defined as:

[TABLE]

where $\text{FC}_{\{\mathbf{W},\mathbf{b}\}}$ is a fully-connected layer with weight matrix $\mathbf{W}$ and bias vector $\mathbf{b}$ , BatchNorm is described in (Ioffe and Szegedy, 2015), $\text{relu}(x)=\max(0,x)$ is an element-wise activation function, and $\theta=\{\mathbf{W}_{1},\mathbf{b}_{1},\mathbf{W}_{2},\mathbf{b}_{2},\dots\}$ .

We define a general class of edge functions $g(u,v)\in\mathbb{R}$ where symmetricity is not imposed, yielding $g(u,v)\neq g(v,u)$ . Consider a low-rank affine projection in the manifold space:

[TABLE]

where low-rank projection matrix $M=L\times R$ with $L\in\mathbb{R}^{d\times b}$ and $R\in\mathbb{R}^{b\times d}$ . We refer to $b$ as the bottleneck dimension and we experiment with $b<d<D$ . We can factor $g(u,v)$ into an inner product $\langle L^{T}f(Y_{u}),Rf(Y_{v})\rangle$ . We refer to $L^{T}f(Y_{u})\in\mathbb{R}^{b}$ and $Rf(Y_{v})\in\mathbb{R}^{b}$ , respectively, as the left- and right-asymmetric embeddings.

We note Equation (4) can be extended to use a combination of multiple low-rank affine projections, as:

[TABLE]

where $w^{(2)}_{g}\in\mathbb{R}^{h}$ a parameter vector of the output layer, $h$ is the number of projections, and each projection $g_{i}$ has its own $L_{i}\in\mathbb{R}^{d\times b}$ and $R_{i}\in\mathbb{R}^{b\times d}$ . Even though there total size of parameters for ${\bf Y},f,g$ may be large, we can have a low memory footprint during inference if we precompute $L_{i}^{T}f(Y_{u})$ and $R_{i}f(Y_{u})$ for every $u\in V.$

4.2. Graph Likelihood

We introduce our proposed objective, step-by-step. We start with intuitions from the Maximum Likelihood Estimate of Logistic Regression. Given a training graph $G=(V,E_{\text{train}})$ , one can define a probability measure as a product of an edge estimate $Q$ on all node pairs:

[TABLE]

where $Q:V\times V\rightarrow[0,1]$ is a trainable edge estimator. If $Q$ is a perfect estimator, then it should output $1$ on all $(u,v)\in E$ and should output [math] on all $(u,v)\notin E$ , which makes $Pr(x)=1$ iff $x=G$ . An equivalent form of equation 6 is:

[TABLE]

where indicator function $\mathbbm{1}[x]=1$ if predicate $x$ is true and is [math] otherwise. Note that two product terms are mutually exclusive, as one of the powers $\mathbbm{1}[.]$ will evaluate to 1 and the other to 0.

Recent work shows that extending the neighbor-set of nodes beyond their direct connections via random walks, can improve generalization of prediction tasks such as link-prediction and node classification (Perozzi et al., 2014; Grover and Leskovec, 2016; Pan et al., 2016). Following this motivation, we propose to replace the binary edge presence $\mathbbm{1}[(u,v)\in E_{\text{train}}]$ in equation (7) by simulated random walk statistics, and formulate our proposed quadratic objective, the graph likelihood as:

[TABLE]

where $\sigma(x)=1/(1+\exp(-x))$ is the standard logistic, the edge function $g$ is described in section 4.1, and $\mathcal{D}_{uv}$ is the unnormalized frequency that nodes $u$ and $v$ appear within the configured context window, in simulated random walks (Perozzi et al., 2014). Note that our likelihood in equation (8) is not standard, especially that the two terms are not exclusive, [math] and $(u,v)\notin E_{\text{train}}$ to be simultaneously true. It follows that the expression under the product-operator is $\in[0,1]$ since the logistic $\sigma:\mathbb{R}\rightarrow[0,1]$ . We use proportional to ( $\propto$ ) instead of equality since the normalizing constant $\int_{G}\Pr(G)$ is only a scaling factor and should not change the $\operatorname*{arg\,max}$ of the likelihood. Our experiments show that the likelihood yields a powerful representation for preserving the graph structure, as evaluated on link-prediction tasks, out-performing models trained with a skipgram objective (such as node2vec (Grover and Leskovec, 2016)). We show in our Experiments (Section 5) that even when our model is identical to node2vec (i.e. shallow and symmetric), training with our proposed objective produces embeddings that better preserve the graph structure, especially when using low embedding dimensions.

Although a naïve optimization of equation (8) is quadratic, $\mathcal{D}$ is sparse with $O(|V|)$ non-zero entries, making it possible to compute the first term in equation (8) in linear time. Further, we use negative sampling (Section 4.4) to estimate the product over $(u,v)\notin E$ , which is important in many real applications where graphs are large and most edges are negative, having $|\{(u,v):u,v\in V\text{ and }(u,v)\notin E\}|\approx\mathcal{O}(|V|)$ .

4.3. Training Data Generation

Our training algorithm requires positive and negative pairs of nodes as its input. Here we briefly describe their generation.

4.3.1. Positives

Given a graph $G=(V,E)$ , we take a partition $E_{\text{train}}\subset E$ and extract random walks from $E_{\text{train}}$ using Algorithm 1. Starting from every node $u_{1}$ , we simulate $n$ random walks, each of length $\tau$ , like:

[TABLE]

Then, for every walk, we extract all node pairs within the context window, similar to (Mikolov et al., 2013).

[TABLE]

where $w_{l}$ and $w_{r}$ are the context window left and right offsets. For example, for an undirected graph, if we use a context window of size 5, then $w_{l}=w_{r}=2$ and therefore $j\in\{-2,-1,1,2\}$ . Extracting positive pairs yields a list $\mathcal{D}$ that contain duplicates and we over-load this notation by defining $\mathcal{D}_{uv}$ as the frequency of $(u,v)$ in list $\mathcal{D}$ . It is trivial to show that the number of pairs is linear in $|V|$ :

[TABLE]

4.3.2. Negatives

We fix a set of negatives for every node. Before training, for every node $u$ , we create its negative set $\bar{u}$ as:

[TABLE]

where elements of $\bar{u}$ are sampled uniformly at random. Arguably, it is possible to increase the accuracy of our models if we sub-sample frequent nodes (i.e. with a high degree) as recommended by (Mikolov et al., 2013), however we leave this as future work. In our training loop, we uniformly sample a subset of size $K$ from $\bar{u}$ , where $K$ is a hyper-parameter for Negative Sampling. We use a fixed $K=5$ for all our experiments.

4.4. Negative Sampling

We define an objective, $\mathcal{L}$ , that can be computed in linear-time using negative sampling, (similar to (Mikolov et al., 2013, Section 2.2)). $\mathcal{L}$ approximates our quadratic graph likelihood (8), defined as:

[TABLE]

where $\text{Sample}(K,\bar{u})$ uniformly samples $K$ negatives from $\bar{u}$ without replacement and $Z$ is a normalizing constant. Note that the outer expectation $(u,v)\in\mathcal{D}$ is linear and the inner summation goes over $K$ items. We use TensorFlow (TensorflowTeam, 2015) to obtain the gradients $\frac{\partial\mathcal{L}}{\partial Y_{u}},\frac{\partial\mathcal{L}}{\partial\theta},\frac{\partial\mathcal{L}}{\partial L},\frac{\partial\mathcal{L}}{\partial R}$ for each mini-batch. We use PercentDelta (Abu-El-Haija, 2017) to optimize all parameters $\theta,f,g,Y$ . We only update the anchor embeddings $Y_{u}$ during the gradient steps on the objective $\mathcal{L}$ , as preliminary experiments showed that we get better performance.

5. Experiments

For all of our experiments, we simulated $n=80$ walks from every node, each walk is of length $\tau=100$ , and we used a right and left context window sizes, respectively, for directed and undirected graphs as $(w_{l}=0,w_{r}=2)$ and $(w_{l}=2,w_{r}=2)$ .

5.1. Datasets

We test our algorithms on directed and undirected graphs. We obtain PPI from (Stark et al., 2006; Grover and Leskovec, 2016) and the other datasets from Stanford SNAP (Leskovec and Krevl, 2014). We only use the largest weakly connected component (WCC) from the original graph. The statistics and dataset description are as follows:

Directed graphs:

(1)

soc-epinions: A social network $|V|=75,877$ and $|E|=508,836$ . Each directed edge represents whether a user trusts the opinion of another. 2. (2)

wiki-vote: A voting network with $|V|=7,066$ and $|E|=103,663$ . Nodes are Wikipedia editors. Each directed edges represents a vote that another becomes an administrator.

Undirected graphs:

(1)

ca-HepTh: A citation network of High Energy Physics Theory from Arxiv, with $|V|=17,903$ and $|E|=197,031$ . Each undirected edge represents co-authorship between two author nodes. 2. (2)

ca-AstroPh: A citation network of Astrophysics from Arxiv, with $|V|=17,903$ and $|E|=197,031$ . Each undirected edge represents co-authorship between two author nodes. 3. (3)

PPI: A protein-protein interaction graph, with $|V|=3,852$ and $|E|=20,881$ . This is a challenging real-world dataset, where each node is a protein and an edge represents that two proteins interact. 4. (4)

ego-Facebook: A small portion of the Facebook social network, with $|V|=4,039$ and $|E|=88,234$ . The nodes are users and the edges indicate friendship. We note that this graph is an ego-network graph, which contains only the complete social connections of 10 seed users. Rather than running link-prediction experiments on this graph, we analyze its unique structure through visualization in Section 5.3.

5.2. Link Prediction

We follow the setup in (Grover and Leskovec, 2016) for link prediction. First, given a graph $G=(V,E)$ , we partition its edges into two equal size disjoint partitions $E_{\text{train}}$ and $E_{\text{test}}$ , such that, $E_{\text{train}}$ is connected. Second, we sample negative edges for training and testing, $E_{\text{train}}^{-}$ and $E_{\text{test}}^{-}$ , where $E_{\text{train}}^{-}$ is sampled from the compliment of $E_{\text{train}}$ and $E_{\text{test}}^{-}$ is sampled from the compliment of $E$ . All train/test edge sets are of equal size. Third, we simulate random walks on $E_{\text{train}}$ to get $\mathcal{D}$ , using Algorithm 1 and Eq. (9). Fourth, only for directed graphs, we extend $E_{\text{test}}^{-}$ to contain all edges $(v,u)$ s.t. $(u,v)\in E$ and $(v,u)\notin E$ . Finally, we train each algorithm and we evaluate ROC-AUC metrics on their ranking of $(E_{\text{test}},E_{\text{test}}^{-})$ .

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abu-El-Haija (2017) Sami Abu-El-Haija. 2017. Proportionate gradient updates with Percent Delta. In ar Xiv .
3Atwood and Towsley (2016) James Atwood and Don Towsley. 2016. Diffusion-Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS) .
4Belkin and Niyogi (2001) M. Belkin and P. Niyogi. 2001. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems (NIPS) .
5Bruna et al . (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. 2013. Spectral networks and deep locally connected networks on graphs. In International Conference on Learning Representations .
6Cao et al . (2016) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep Neural Networks for Learning Graph Representations. In Proceedings of the Association for the Advancement of Artificial Intelligence .
7Chen et al . (2017) Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. 2017. HARP: Hierarchical Representation Learning for Networks. ar Xiv preprint ar Xiv:1706.07845 (2017).
8Dai et al . (2016) Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative Embeddings of Latent Variable Models for Structured Data. In International Conference on Machine Learning (ICML) .