TL;DR
This paper introduces a novel graph embedding method that explicitly models directed edges and uses a graph likelihood objective, resulting in more concise, accurate representations that outperform previous methods in link prediction tasks.
Contribution
The paper presents a new approach combining explicit edge modeling and a graph likelihood objective, improving embedding quality and efficiency for directed graphs.
Findings
Achieves up to 76% error reduction in link prediction.
Produces embeddings 10 times smaller with higher accuracy.
Effectively models directed edges in graph embeddings.
Abstract
We propose a new method for embedding graphs while preserving directed edge information. Learning such continuous-space vector representations (or embeddings) of nodes in a graph is an important first step for using network information (from social networks, user-item graphs, knowledge bases, etc.) in many machine learning tasks. Unlike previous work, we (1) explicitly model an edge as a function of node embeddings, and we (2) propose a novel objective, the "graph likelihood", which contrasts information from sampled random walks with non-existent edges. Individually, both of these contributions improve the learned representations, especially when there are memory constraints on the total size of the embeddings. When combined, our contributions enable us to significantly improve the state-of-the-art by learning more concise representations that better preserve the graph structure.…
| Dataset | Adjacency Methods | Embedding Methods | |||||||||||
| Non-Embedding baselines | Embedding Baselines | Ours: (end-to-end) Graph Likelihood | |||||||||||
| Jaccard | Common Neighbors | Adamic Adar | Symmetric | Asymmetric | \pbox20cm % Error Reduction | ||||||||
| d | Eigen Maps | node2vec | DNGR | shallow | deep | shallow | deep | ||||||
| directed \ldelim {1040pt | soc-epinions | 0.649 | 0.649 | 0.647 | 8 | 0.725 | 0.694 | 0.665 | 0.695 | 0.825 | 36.5% | ||
| 16 | 0.726 | 0.710 | 0.713 | 0.699 | 0.840 | 41.4% | |||||||
| 32 | 0.714 | 0.740 | 0.713 | 0.700 | 0.845 | 45.9% | |||||||
| 64 | 0.699 | 0.766 | 0.722 | 0.698 | 0.834 | 44.9% | |||||||
| 128 | 0.691 | 0.782 | 0.743 | 0.718 | 0.828 | 44.5% | |||||||
| wiki-vote | 0.579 | 0.580 | 0.562 | 8 | 0.613 | 0.643 | 0.630 | 0.603 | 0.602 | 0.608 | 0.871 | 63.7% | |
| 16 | 0.607 | 0.642 | 0.622 | 0.623 | 0.639 | 0.643 | 0.900 | 71.9% | |||||
| 32 | 0.600 | 0.641 | 0.619 | 0.642 | 0.661 | 0.683 | 0.911 | 75.2% | |||||
| 64 | 0.613 | 0.642 | 0.598 | 0.660 | 0.672 | 0.702 | 0.917 | 76.7% | |||||
| 128 | 0.622 | 0.643 | 0.554 | 0.682 | 0.685 | 0.730 | 0.917 | 76.8% | |||||
| undirected \ldelim {1540pt | ca-HepTh | 0.765 | 0.765 | 0.765 | 8 | 0.786 | 0.731 | 0.706 | 0.855 | 0.848 | 0.605 | 0.879 | 43.2% |
| 16 | 0.790 | 0.787 | 0.780 | 0.894 | 0.826 | 0.885 | 0.899 | 51.9% | |||||
| 32 | 0.795 | 0.858 | 0.829 | 0.896 | 0.886 | 0.884 | 0.911 | 37.8% | |||||
| 64 | 0.802 | 0.886 | 0.868 | 0.878 | 0.884 | 0.870 | 0.910 | 21.3% | |||||
| 128 | 0.812 | 0.901 | 0.897 | 0.891 | 0.897 | 0.820 | 0.916 | 14.6% | |||||
| ca-AstroPh | 0.942 | 0.942 | 0.944 | 8 | 0.825 | 0.811 | 0.852 | 0.923 | 0.925 | 0.592 | 0.917 | 44.1% | |
| 16 | 0.825 | 0.833 | 0.877 | 0.950 | 0.923 | 0.657 | 0.945 | 55.8% | |||||
| 32 | 0.825 | 0.899 | 0.917 | 0.955 | 0.938 | 0.942 | 0.955 | 46.1% | |||||
| 64 | 0.824 | 0.934 | 0.939 | 0.948 | 0.936 | 0.936 | 0.958 | 30.7% | |||||
| 128 | 0.829 | 0.955 | 0.968 | 0.953 | 0.936 | 0.939 | 0.957 | n/a | |||||
| PPI | 0.766 | 0.776 | 0.779 | 8 | 0.710 | 0.733 | 0.583 | 0.746 | 0.763 | 0.550 | 0.804 | 26.6% | |
| 16 | 0.711 | 0.707 | 0.687 | 0.780 | 0.772 | 0.786 | 0.817 | 36.7% | |||||
| 32 | 0.709 | 0.691 | 0.741 | 0.779 | 0.784 | 0.794 | 0.833 | 35.5% | |||||
| 64 | 0.707 | 0.671 | 0.767 | 0.791 | 0.767 | 0.813 | 0.837 | 30.0% | |||||
| 128 | 0.737 | 0.698 | 0.769 | 0.795 | 0.787 | 0.799 | 0.841 | 31.0% | |||||
| Dataset | mean | Statistical Significance | ||
| shallow asymmetric | deep asymmetric | t-statistic | p-value | |
| soc-epinions | 5.797673 | 1.53E-06 | ||
| wiki-vote | 3.881161 | 4.32E-04 | ||
| ca-HepTh | 5.202880 | 1.62E-05 | ||
| ca-AstroPh | 5.946066 | 4.08E-07 | ||
| ppi | 4.187474 | 8.45E-05 | ||
| Dataset | Shallow Symmetric | Shallow Asymmetric | Deep Symmetric | Deep Asymmetric | ||||||||
| wiki-vote | 0.106 | 0.202 | 0.327 | 0.122 | 0.142 | 0.152 | 0.597 | 0.901 | 1.119 | 0.811 | 1.096 | 1.938 |
| soc-epinions | 0.382 | 0.526 | 0.754 | 0.299 | 0.345 | 0.430 | 0.888 | 1.147 | 1.404 | 1.276 | 1.881 | 3.590 |
| ppi | 1.884 | 4.593 | 7.842 | 0.858 | 1.197 | 3.801 | 0.825 | 1.095 | 1.410 | 1.015 | 1.443 | 2.645 |
| ca-HepTh | 7.370 | 10.871 | 12.426 | 3.093 | 3.916 | 7.892 | 0.957 | 1.364 | 1.709 | 1.120 | 1.417 | 2.292 |
| ca-AstroPh | 12.069 | 326.483 | 4282.095 | 1.108 | 4.648 | 24.271 | 0.874 | 1.273 | 1.999 | 1.065 | 1.826 | 3.062 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Learning Edge Representations via Low-Rank Asymmetric Projections
Sami Abu-El-Haija
Google ResearchMountain ViewCalifornia
,
Bryan Perozzi
Google ResearchNew York CityNew York
and
Rami Al-Rfou
Google ResearchMountain ViewCalifornia
(2017)
Abstract.
We propose a new method for embedding graphs while preserving directed edge information. Learning such continuous-space vector representations (or embeddings) of nodes in a graph is an important first step for using network information (from social networks, user-item graphs, knowledge bases, etc.) in many machine learning tasks.
Unlike previous work, we (1) explicitly model an edge as a function of node embeddings, and we (2) propose a novel objective, the graph likelihood, which contrasts information from sampled random walks with non-existent edges. Individually, both of these contributions improve the learned representations, especially when there are memory constraints on the total size of the embeddings. When combined, our contributions enable us to significantly improve the state-of-the-art by learning more concise representations that better preserve the graph structure.
We evaluate our method on a variety of link-prediction task including social networks, collaboration networks, and protein interactions, showing that our proposed method learn representations with error reductions of up to 76% and 55%, on directed and undirected graphs. In addition, we show that the representations learned by our method are quite space efficient, producing embeddings which have higher structure-preserving accuracy but are 10 times smaller.
Graph, Edge Learning, Embedding, Random Walk, Link Prediction, Representation Learning
††journalyear: 2017††copyright: rightsretained††conference: CIKM’17 ; November 6–10, 2017; Singapore, Singapore††doi: 10.1145/3132847.3132959††isbn: 978-1-4503-4918-5/17/11
1. Introduction
Recent advancements in learning embedding vectors for words have resulted in a proliferation of methods which learn continuous space representations of graphs (e.g. DeepWalk (Perozzi et al., 2014)). These approaches process a graph and encode each node as a (real-valued) embedding vector, enabling easy integration with existing machine learning algorithms.
Such embedding methods learn a vector space that highly preserves the graph structure. Two nodes would have large similarity in the embedding space (or small distance) if they are strongly connected in the original (discrete) graph. Edges can be weighted or unweighted. Traditional eigen methods (Hagen and Kahng, 1992; Shi and Malik, 2000; Belkin and Niyogi, 2001) learn embeddings that minimize the euclidean distance of connected nodes, which can be solved (with orthonormal constraints) by eigen-decomposition of the symmetric graph Laplacian. Recent random-walk embedding methods (Perozzi et al., 2014; Grover and Leskovec, 2016) learn representations which encode the random walk transition matrix. These methods embed two nodes close if they co-occur frequently in short random walks. In general, random-walk methods outperform “eigen” methods on producing vector representations that preserve the graph structure.
However, recent random-walk embedding methods have two shortcomings. First, these methods do not explicitly model edges. This node-centric assumption represents an edge identically to reverse counterpart , and is unable to capture asymmetric relationships. Second, to preserve the graph structure they embed nodes into a relatively high-dimensional space, sometimes producing an embedding dictionary larger than the sparse adjacency matrix.
In this work we propose to address these limitations by explicitly modeling edges in the network as a function of the nodes. Specifically, we model edges by (i) using a Deep Neural Network (DNN) to map nodes onto a low-dimensional manifold, (ii) defining an edge function between two nodes as a projection in the manifold coordinates, and (iii) jointly-optimizing the edge function and the manifold by maximizing a new objective we propose, the graph likelihood, which we define as a product of the edge function over all node pairs.
More formally, we learn an embedding vector for every graph node , a manifold-mapping Deep Neural Network (DNN) that is shared across all nodes, and an asymmetric edge function to represent edges in the graph. Our entire model is end-to-end differentiable. is low-rank, as , where both and project the node manifold coordinates to smaller space . Since is much smaller than , we are able to reduce the final node embedding significantly. Figure 1 shows a depiction of our architecture. Our desired likelihood is quadratic but we estimate it with a tractable linear objective using negative sampling, similar to (Mikolov et al., 2013).
We find that explicitly modeling edges can drastically reduce the representation dimensionality, for both directed and undirected graphs, especially when coupled with a Deep Neural Network. Further, modeling asymmetry by representing edge differently than gives an additional performance boost when preserving the structure of directed graphs. We perform an extrinsic evaluation of our method, by comparing it to the state-of-the-art on link-prediction tasks over a variety of graphs from social networks, biology, and e-commerce. We show that we can consistently learn orders of magnitude smaller embedding dimensions, while improving ROC-AUC metrics. For example, we reduce the error on directed graphs by up to and undirected graphs by up to when using same-sized representations. However, when our model is restricted to representations which are 8 times smaller than the baselines, we reduce the error in some cases by up to on directed graphs and on undirected graphs. We perform intrinsic evaluations, by training and rendering two-dimensional embedding spaces for two datasets, which we use to gain intuition on placement choices made by our model.
To summarize, our contributions are as follows.
- (1)
We propose to explicitly model a directed edge function, which we define as low-rank affine projections on a manifold that is produced by a Deep Neural Network (i.e. “deep embeddings”). 2. (2)
We propose a new objective function, the graph likelihood. 3. (3)
These aspects significantly improve the state-of-art on learning continuous graph representations, especially on directed graphs, while producing significantly smaller representation spaces, as evaluated on five graph datasets.
2. Edge Representations
It is common to embed a graph by learning one continuous -dimensional vector for every graph node , where relationships between nodes and are captured in a very coarse way, through the use of a distance measure (e.g. ). This node-centric modeling assumes that all relationships in the graph are symmetric – a limiting assumption which fails to capture any directed relationships.
We seek to model the asymmetry which occurs in many real world graphs. Specifically, given two nodes, and , we desire that their distances are allowed to differ () to reflect ordering in directed relationships, such as follower and followee on Twitter. In addition, the asymmetry can also model degree variance in undirected graphs. Consider a popular node , then the optimization could make small for all but not necessarily .
Even though it is possible to learn one representation for all node pairs , this direct modeling is prohibitive in practice and requires an upper-bound space of . Instead, we propose to learn a trainable edge function defined over node embedding coordinates. Specifically, we learn asymmetric transformations of the nodes, which generates for a node , two representations: one when it is the source of a directed edge, and one when it is a destination, . These representations share a neural network . These representations can be combined for any pair of nodes to model the strength of their directed relationships. That is, for nodes and , we represent and as and , respectively.
3. Preliminaries
3.1. Link Prediction
Link prediction is a problem of inferring missing edges in a graph. We use it to evaluate the generalization ability of our embedding spaces, as we aim to preserve the graph structure. The common setup (Grover and Leskovec, 2016) is to “hold out” test edges and train on the remaining . Structure-preserving representations should retrieve the held-out with high accuracy.
3.2. Graph Embedding
Graph embedding approaches learn a -dimensional embedding dictionary , containing continuous real-valued vector for every graph node . Earlier approaches in computing embeddings include Eigenmaps (Belkin and Niyogi, 2001), which embeds and to be close if they are connected (i.e. or similarly ). Formally, Eigenmaps learns embeddings by minimizing an objective:
[TABLE]
where the weight of edge is stored in the adjacency matrix at and D is a diagonal weight matrix with . This optimization yields an embedding space where and are near if is large (or non-zero). Equation (1) also appears in equivalent forms in (Hagen and Kahng, 1992; Shi and Malik, 2000). Furthermore, Bregman Iterations has been proposed to optimize an L1-formulation of the above objective function (Yu et al., 2015).
3.3. Word2vec
Word2vec (Mikolov et al., 2013) processes a big text corpus (e.g. Wikipedia) and learns one embedding vector for every unique word. If two words and are frequently “close” (e.g. in same sentence), then the dot product of embeddings is maximized. In particular, every time two words and are within words away, where integer is the “context window size” hyperparameter, then a gradient step increments this likelihood:
[TABLE]
We refer the reader to (Mikolov et al., 2013) for further information111 The denominator, rather than summing over all words, is approximated by hierarchical softmax. In addition, their original formulation learns two vectors per word, one when used as “input” and another used when “output”.
3.4. Graph Embedding with Random Walks
Rather than operating directly on the adjacency matrix , another embedding strategy has been recently proposed by Perozzi, Al-Rfou, and Skiena (Perozzi et al., 2014). Their method, DeepWalk, introduced a new class of Random Walk methods, which extend a node’s direct neighbors to include nodes that are within small number of hops. These approaches sample many random walks from the graph. If nodes and are frequently close in the random walks, then the model learns a representation such that the inner product of is a large positive value. Algorithm 1 extracts random walks. It begins by computing a probability transition matrix , where indicates the probability of a random walker visiting node conditioned on current node being .
This model has been extended by node2vec (Grover and Leskovec, 2016) to use a second-order probability transition function containing , where the probability of a random walker visiting node is conditioned on current node and previous node . Node2vec’s random walk use hyper-parameters and , which effectively yield a graph traversal algorithm that’s like an interpolation between Depth-First Search (DFS) and Breadth-First Search (BFS). We refer the reader to (Grover and Leskovec, 2016) for further details. We adopt this method for generating random walks in our work.
After sampling random walks, DeepWalk and node2vec treat each walk as a sequence, and then apply the skip-gram model to compute embeddings per word (i.e. node). The objective that they minimize is:
[TABLE]
Where is the number of times nodes and appear close to each other (i.e. within the context size) in all random walks. We extend these random walk methods in three important ways: First, rather than using word2vec’s objective (Equation 3), we propose an novel alternative objective, the graph likelihood (see Section 4.2). Second, we explicitly represent an edge function as a function of nodes which we jointly train (see Section 4). Third, we define the “context” for directly graph differently than undirected ones. Specifically, a random walk would produce as a context of if graph is undirected and would produce as context if the graph is directed.
4. Our Method
We explain the details of our model and how we train it. The source-code is made available online 222Code available at http://sami.haija.org/graph/deep_embedding.html .
4.1. Model
Given an (un)directed graph , we learn an embedding vector for every node . In addition, we learn a Deep Neural Network (DNN) that maps a node onto a low-dimensional manifold. is depicted in Figure 2, and is defined as:
[TABLE]
where is a fully-connected layer with weight matrix and bias vector , BatchNorm is described in (Ioffe and Szegedy, 2015), is an element-wise activation function, and .
We define a general class of edge functions where symmetricity is not imposed, yielding . Consider a low-rank affine projection in the manifold space:
[TABLE]
where low-rank projection matrix with and . We refer to as the bottleneck dimension and we experiment with . We can factor into an inner product . We refer to and , respectively, as the left- and right-asymmetric embeddings.
We note Equation (4) can be extended to use a combination of multiple low-rank affine projections, as:
[TABLE]
where a parameter vector of the output layer, is the number of projections, and each projection has its own and . Even though there total size of parameters for may be large, we can have a low memory footprint during inference if we precompute and for every
4.2. Graph Likelihood
We introduce our proposed objective, step-by-step. We start with intuitions from the Maximum Likelihood Estimate of Logistic Regression. Given a training graph , one can define a probability measure as a product of an edge estimate on all node pairs:
[TABLE]
where is a trainable edge estimator. If is a perfect estimator, then it should output on all and should output [math] on all , which makes iff . An equivalent form of equation 6 is:
[TABLE]
where indicator function if predicate is true and is [math] otherwise. Note that two product terms are mutually exclusive, as one of the powers will evaluate to 1 and the other to 0.
Recent work shows that extending the neighbor-set of nodes beyond their direct connections via random walks, can improve generalization of prediction tasks such as link-prediction and node classification (Perozzi et al., 2014; Grover and Leskovec, 2016; Pan et al., 2016). Following this motivation, we propose to replace the binary edge presence in equation (7) by simulated random walk statistics, and formulate our proposed quadratic objective, the graph likelihood as:
[TABLE]
where is the standard logistic, the edge function is described in section 4.1, and is the unnormalized frequency that nodes and appear within the configured context window, in simulated random walks (Perozzi et al., 2014). Note that our likelihood in equation (8) is not standard, especially that the two terms are not exclusive, [math] and to be simultaneously true. It follows that the expression under the product-operator is since the logistic . We use proportional to () instead of equality since the normalizing constant is only a scaling factor and should not change the of the likelihood. Our experiments show that the likelihood yields a powerful representation for preserving the graph structure, as evaluated on link-prediction tasks, out-performing models trained with a skipgram objective (such as node2vec (Grover and Leskovec, 2016)). We show in our Experiments (Section 5) that even when our model is identical to node2vec (i.e. shallow and symmetric), training with our proposed objective produces embeddings that better preserve the graph structure, especially when using low embedding dimensions.
Although a naïve optimization of equation (8) is quadratic, is sparse with non-zero entries, making it possible to compute the first term in equation (8) in linear time. Further, we use negative sampling (Section 4.4) to estimate the product over , which is important in many real applications where graphs are large and most edges are negative, having .
4.3. Training Data Generation
Our training algorithm requires positive and negative pairs of nodes as its input. Here we briefly describe their generation.
4.3.1. Positives
Given a graph , we take a partition and extract random walks from using Algorithm 1. Starting from every node , we simulate random walks, each of length , like:
[TABLE]
Then, for every walk, we extract all node pairs within the context window, similar to (Mikolov et al., 2013).
[TABLE]
where and are the context window left and right offsets. For example, for an undirected graph, if we use a context window of size 5, then and therefore . Extracting positive pairs yields a list that contain duplicates and we over-load this notation by defining as the frequency of in list . It is trivial to show that the number of pairs is linear in :
[TABLE]
4.3.2. Negatives
We fix a set of negatives for every node. Before training, for every node , we create its negative set as:
[TABLE]
where elements of are sampled uniformly at random. Arguably, it is possible to increase the accuracy of our models if we sub-sample frequent nodes (i.e. with a high degree) as recommended by (Mikolov et al., 2013), however we leave this as future work. In our training loop, we uniformly sample a subset of size from , where is a hyper-parameter for Negative Sampling. We use a fixed for all our experiments.
4.4. Negative Sampling
We define an objective, , that can be computed in linear-time using negative sampling, (similar to (Mikolov et al., 2013, Section 2.2)). approximates our quadratic graph likelihood (8), defined as:
[TABLE]
where uniformly samples negatives from without replacement and is a normalizing constant. Note that the outer expectation is linear and the inner summation goes over items. We use TensorFlow (TensorflowTeam, 2015) to obtain the gradients for each mini-batch. We use PercentDelta (Abu-El-Haija, 2017) to optimize all parameters . We only update the anchor embeddings during the gradient steps on the objective , as preliminary experiments showed that we get better performance.
5. Experiments
For all of our experiments, we simulated walks from every node, each walk is of length , and we used a right and left context window sizes, respectively, for directed and undirected graphs as and .
5.1. Datasets
We test our algorithms on directed and undirected graphs. We obtain PPI from (Stark et al., 2006; Grover and Leskovec, 2016) and the other datasets from Stanford SNAP (Leskovec and Krevl, 2014). We only use the largest weakly connected component (WCC) from the original graph. The statistics and dataset description are as follows:
Directed graphs:
- (1)
soc-epinions: A social network and . Each directed edge represents whether a user trusts the opinion of another. 2. (2)
wiki-vote: A voting network with and . Nodes are Wikipedia editors. Each directed edges represents a vote that another becomes an administrator.
Undirected graphs:
- (1)
ca-HepTh: A citation network of High Energy Physics Theory from Arxiv, with and . Each undirected edge represents co-authorship between two author nodes. 2. (2)
ca-AstroPh: A citation network of Astrophysics from Arxiv, with and . Each undirected edge represents co-authorship between two author nodes. 3. (3)
PPI: A protein-protein interaction graph, with and . This is a challenging real-world dataset, where each node is a protein and an edge represents that two proteins interact. 4. (4)
ego-Facebook: A small portion of the Facebook social network, with and . The nodes are users and the edges indicate friendship. We note that this graph is an ego-network graph, which contains only the complete social connections of 10 seed users. Rather than running link-prediction experiments on this graph, we analyze its unique structure through visualization in Section 5.3.
5.2. Link Prediction
We follow the setup in (Grover and Leskovec, 2016) for link prediction. First, given a graph , we partition its edges into two equal size disjoint partitions and , such that, is connected. Second, we sample negative edges for training and testing, and , where is sampled from the compliment of and is sampled from the compliment of . All train/test edge sets are of equal size. Third, we simulate random walks on to get , using Algorithm 1 and Eq. (9). Fourth, only for directed graphs, we extend to contain all edges s.t. and . Finally, we train each algorithm and we evaluate ROC-AUC metrics on their ranking of .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Abu-El-Haija (2017) Sami Abu-El-Haija. 2017. Proportionate gradient updates with Percent Delta. In ar Xiv .
- 3Atwood and Towsley (2016) James Atwood and Don Towsley. 2016. Diffusion-Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS) .
- 4Belkin and Niyogi (2001) M. Belkin and P. Niyogi. 2001. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems (NIPS) .
- 5Bruna et al . (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. 2013. Spectral networks and deep locally connected networks on graphs. In International Conference on Learning Representations .
- 6Cao et al . (2016) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep Neural Networks for Learning Graph Representations. In Proceedings of the Association for the Advancement of Artificial Intelligence .
- 7Chen et al . (2017) Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. 2017. HARP: Hierarchical Representation Learning for Networks. ar Xiv preprint ar Xiv:1706.07845 (2017).
- 8Dai et al . (2016) Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative Embeddings of Latent Variable Models for Structured Data. In International Conference on Machine Learning (ICML) .
