Graph Space Embedding

Jo\~ao Pereira; Albert Groen; Erik Stroes; Evgeni Levin

arXiv:1907.13443·cs.LG·August 1, 2019

Graph Space Embedding

Jo\~ao Pereira, Albert Groen, Erik Stroes, Evgeni Levin

PDF

Open Access

TL;DR

The paper introduces Graph Space Embedding (GSE), a computationally efficient method that encodes interactions in a space, with theoretical analysis and improved clinical prediction performance.

Contribution

It presents a novel GSE technique with theoretical insights and a new interpretability strategy, outperforming traditional methods in clinical data.

Findings

01

GSE achieves superior predictive accuracy on clinical data.

02

Theoretical results define optimal parameter regimes for GSE.

03

A new interpretability approach identifies key interactions in predictions.

Abstract

We propose the Graph Space Embedding (GSE), a technique that maps the input into a space where interactions are implicitly encoded, with little computations required. We provide theoretical results on an optimal regime for the GSE, namely a feasibility region for its parameters, and demonstrate the experimental relevance of our findings. Next, we introduce a strategy to gain insight on which interactions are responsible for the certain predictions, paving the way for a far more transparent model. In an empirical evaluation on a real-world clinical cohort containing patients with suspected coronary artery disease, the GSE achieves far better performance than traditional algorithms.

Tables1

Table 1. Table 1: The GSE benchmark against random-walk graph kernel (RWGK), random forests (RF), the GSE with constant interaction matrix (GSE*), and radial basis function (RBF). For all kernels, SVM was used as the learning algorithm.

Method	AUC std	AUC avg	Run time avg(s)
GSE	0.055890	0.814141	7.63
RWGK	0.051704	0.808838	1720
RF	0.066036	0.764141	17.99
GSE*	0.082309	0.787879	6.59
RBF	0.095247	0.779293	1.16

Equations47

G_{x} (A) = φ (A) \circ x^{⊤} x,

G_{x} (A) = φ (A) \circ x^{⊤} x,

k_{n} (G, G^{^{'}}) = i, j = 1 \sum n [γ]_{i, j} ⟨ [G]^{i}, [G^{'}]^{j} ⟩_{F},

k_{n} (G, G^{^{'}}) = i, j = 1 \sum n [γ]_{i, j} ⟨ [G]^{i}, [G^{'}]^{j} ⟩_{F},

k_{n} (G, G^{^{'}}) = k, l = 1 \sum ∣ V ∣ i = 1 \sum n ϕ_{i, k, l} (G) ϕ_{i, k, l} (G^{'}),

k_{n} (G, G^{^{'}}) = k, l = 1 \sum ∣ V ∣ i = 1 \sum n ϕ_{i, k, l} (G) ϕ_{i, k, l} (G^{'}),

k_{n} (G, G^{'}) = ⟨ i = 1 \sum n θ^{i} [G]^{i}, j = 1 \sum n θ^{j} [G^{'}]^{j} ⟩_{F}

k_{n} (G, G^{'}) = ⟨ i = 1 \sum n θ^{i} [G]^{i}, j = 1 \sum n θ^{j} [G^{'}]^{j} ⟩_{F}

= ⟨ i = 1 \sum n θ^{i} [G]^{i}, i = 1 \sum n θ^{i} [G^{'}]^{i} ⟩_{F},

k_{\infty} (G, G^{^{'}}) = ⟨ e^{β G}, e^{β G^{'}} ⟩_{F},

k_{\infty} (G, G^{^{'}}) = ⟨ e^{β G}, e^{β G^{'}} ⟩_{F},

k (x, y) = e^{- \frac{∣∣ x - y ∣ ∣ ^{2}}{σ ^{2}}} = c e^{\frac{2 < x , y >}{σ ^{2}}},

k (x, y) = e^{- \frac{∣∣ x - y ∣ ∣ ^{2}}{σ ^{2}}} = c e^{\frac{2 < x , y >}{σ ^{2}}},

k (G, G^{'}) = c e^{\frac{2 < x , y >}{σ ^{2}}} = c r_w n = 0 \sum \infty \frac{( 2 ⟨ γ G , γ G ^{'} ⟩ _{F} ) ^{n}}{σ ^{2 n} n !}

k (G, G^{'}) = c e^{\frac{2 < x , y >}{σ ^{2}}} = c r_w n = 0 \sum \infty \frac{( 2 ⟨ γ G , γ G ^{'} ⟩ _{F} ) ^{n}}{σ ^{2 n} n !}

k (G, G^{^{'}}) = c n = 0 \sum \infty λ (\frac{2}{ν})^{n} r_e α^{n} (\cdot) \sum \frac{\prod _{i = 1}^{∣ E ∣} [ G _{i} G _{i}^{'} ] ^{α_{i}}}{\prod _{i = 1}^{∣ E ∣} Γ ( α _{i} + 1 )},

k (G, G^{^{'}}) = c n = 0 \sum \infty λ (\frac{2}{ν})^{n} r_e α^{n} (\cdot) \sum \frac{\prod _{i = 1}^{∣ E ∣} [ G _{i} G _{i}^{'} ] ^{α_{i}}}{\prod _{i = 1}^{∣ E ∣} Γ ( α _{i} + 1 )},

σ^{2} (K (ν)) = E [K (ν)^{2}] - b E [K (ν)]^{2} = (\frac{D - 1}{D}) d = 1 \sum D e^{- 2 ν d} - \frac{1}{D ^{2}} i \neq = j \sum D^{2} - D 2 e^{- ν (d_{i} + d_{j})},

σ^{2} (K (ν)) = E [K (ν)^{2}] - b E [K (ν)]^{2} = (\frac{D - 1}{D}) d = 1 \sum D e^{- 2 ν d} - \frac{1}{D ^{2}} i \neq = j \sum D^{2} - D 2 e^{- ν (d_{i} + d_{j})},

\frac{∥ K ^{'} ( ν ) - K ^{'} ( ν ^{'} ) ∥}{∥ ν - ν ^{'} ∥} \leq L (K^{'}) : \forall ν, ν^{'},

\frac{∥ K ^{'} ( ν ) - K ^{'} ( ν ^{'} ) ∥}{∥ ν - ν ^{'} ∥} \leq L (K^{'}) : \forall ν, ν^{'},

\frac{∥⊤ - Λ∥}{∥ ν - ν ^{'} ∥}, Λ = 2 (\frac{D - 1}{D ^{2}}) [d = 1 \sum D d (e^{- 2 ν d} - e^{- 2 ν^{'} d})], ⊤ = \frac{2}{D ^{2}} i \neq = j \sum D^{2} - D (d_{i} + d_{j}) (e^{- ν (d_{i} + d_{j})} - e^{- ν^{'} (d_{i} + d_{j})}) .

\frac{∥⊤ - Λ∥}{∥ ν - ν ^{'} ∥}, Λ = 2 (\frac{D - 1}{D ^{2}}) [d = 1 \sum D d (e^{- 2 ν d} - e^{- 2 ν^{'} d})], ⊤ = \frac{2}{D ^{2}} i \neq = j \sum D^{2} - D (d_{i} + d_{j}) (e^{- ν (d_{i} + d_{j})} - e^{- ν^{'} (d_{i} + d_{j})}) .

∥⊤ - Λ∥ \leq 2 (\frac{D - 1}{D}) d_{ma x} .

∥⊤ - Λ∥ \leq 2 (\frac{D - 1}{D}) d_{ma x} .

e^{- c ν} - e^{- c ν^{'}} = δ \frac{e ^{c ν^{'}} - e ^{c ν}}{e ^{c (ν + ν^{'})}} \to 0, : ν, ν^{'} > 0,

e^{- c ν} - e^{- c ν^{'}} = δ \frac{e ^{c ν^{'}} - e ^{c ν}}{e ^{c (ν + ν^{'})}} \to 0, : ν, ν^{'} > 0,

k_{n} (G, G^{^{'}}) = i, j = 1 \sum n [γ]_{i, j} ⟨ [G]_{k l}^{i}, [G^{'}]_{k l}^{j} ⟩_{F} = k, l = 1 \sum ∣ V ∣ i = 1 \sum n [G]_{k l}^{i} j = 1 \sum n [γ]_{i, j} [G^{'}]_{k l}^{j} .

k_{n} (G, G^{^{'}}) = i, j = 1 \sum n [γ]_{i, j} ⟨ [G]_{k l}^{i}, [G^{'}]_{k l}^{j} ⟩_{F} = k, l = 1 \sum ∣ V ∣ i = 1 \sum n [G]_{k l}^{i} j = 1 \sum n [γ]_{i, j} [G^{'}]_{k l}^{j} .

E [θ Ω (h)] = ε E [L] \leftrightarrow θ = \frac{ε E [ L ]}{E [ Ω ( h )]} .

E [θ Ω (h)] = ε E [L] \leftrightarrow θ = \frac{ε E [ L ]}{E [ Ω ( h )]} .

\xi(\mathbb{x}_{0})=\min_{h\in\mathcal{H}}\mathcal{L}\Big{(}h,f,k(\mathbb{G}_{\mathbb{x}_{i}},\mathbb{G}_{\mathbb{x}_{0}})\Big{)}+\theta\Omega(h).

\xi(\mathbb{x}_{0})=\min_{h\in\mathcal{H}}\mathcal{L}\Big{(}h,f,k(\mathbb{G}_{\mathbb{x}_{i}},\mathbb{G}_{\mathbb{x}_{0}})\Big{)}+\theta\Omega(h).

f (x) \approx \hat{f} (x) = f (x_{0}) + \nabla_{x} f (x_{0}) (x - x_{0}) .

f (x) \approx \hat{f} (x) = f (x_{0}) + \nabla_{x} f (x_{0}) (x - x_{0}) .

p_{F} (f = 1∣ δ) = λ e^{- λ δ} .

p_{F} (f = 1∣ δ) = λ e^{- λ δ} .

p_{T} (t = 1∣ δ) = ⎩ ⎨ ⎧ \frac{2}{v} \frac{δ - a ( τ )}{u}, a (τ) < δ \leq b \frac{2}{v}, b < δ \leq c 0, otherwise,

p_{T} (t = 1∣ δ) = ⎩ ⎨ ⎧ \frac{2}{v} \frac{δ - a ( τ )}{u}, a (τ) < δ \leq b \frac{2}{v}, b < δ \leq c 0, otherwise,

∣ \hat{f} (x) - f (x_{0}) ∣ \geq τ \Leftrightarrow i = 1 \sum N \nabla_{x} f (x_{0}) [i] δ [i] \geq τ,

∣ \hat{f} (x) - f (x_{0}) ∣ \geq τ \Leftrightarrow i = 1 \sum N \nabla_{x} f (x_{0}) [i] δ [i] \geq τ,

δ [i] \geq \frac{τ}{N ^{'} \nabla _{x} f ( x _{0} ) [ i ]} \equiv τ_{0},

δ [i] \geq \frac{τ}{N ^{'} \nabla _{x} f ( x _{0} ) [ i ]} \equiv τ_{0},

a (τ)_{i} = τ_{0} (1 + \frac{θ _{a} ( M _{min} - i )}{M _{min}}) .

a (τ)_{i} = τ_{0} (1 + \frac{θ _{a} ( M _{min} - i )}{M _{min}}) .

c = b (c_{l} + \frac{E [ ∣∣ τ _{0} ∣ ∣ _{2} ] - ∣∣ τ _{0} ∣ ∣ _{2}}{E [ ∣∣ τ _{0} ∣ ∣ _{2} ] + ∣∣ τ _{0} ∣ ∣ _{2}}), c_{l} \in] 2, + \infty [.

c = b (c_{l} + \frac{E [ ∣∣ τ _{0} ∣ ∣ _{2} ] - ∣∣ τ _{0} ∣ ∣ _{2}}{E [ ∣∣ τ _{0} ∣ ∣ _{2} ] + ∣∣ τ _{0} ∣ ∣ _{2}}), c_{l} \in] 2, + \infty [.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBioinformatics and Genomic Networks · Advanced Graph Neural Networks · Gene expression and cancer classification

Full text

\xpatchcmd

Proof.

Graph Space Embedding

Abstract

We propose the Graph Space Embedding (GSE), a technique that maps the input into a space where interactions are implicitly encoded, with little computations required. We provide theoretical results on an optimal regime for the GSE, namely a feasibility region for its parameters, and demonstrate the experimental relevance of our findings. Next, we introduce a strategy to gain insight on which interactions are responsible for the certain predictions, paving the way for a far more transparent model. In an empirical evaluation on a real-world clinical cohort containing patients with suspected coronary artery disease, the GSE achieves far better performance than traditional algorithms.

1 Introduction

Learning from interconnected systems can be a particularly difficult task due to the possibly non-linear interaction between the components Linde et al. (2015); Bereau et al. (2018). In some cases, these interactions are known and therefore constitute an important source of prior information Jonschkowski (2015); Zhou et al. (2018). Although prior knowledge can be leveraged in a variety of ways Yu et al. (2010), most of the research involving interactions, is focused on their discovery. One popular approach to deal with feature interactions, is to cast the interaction network as a graph and then use kernel methods based on graph properties, such as walk-lengths or subgraphs Borgwardt and Kriegel (2005); Shervashidze et al. (2009); Kriege and Mutzel (2012) or, more recently, graph deep convolutional methods Defferrard et al. (2016); Fout et al. (2017); Kipf and Welling (2017). In this work however, we focus on the case in which the interactions are feature specific and a universal property of the data instances, which make the pattern search algorithms not suitable for this task. To our knowledge, there is limited research involving this setting, although we suggest many problems can be formulated in the same way (see Figure 1). To address this knowledge gap, we present a novel method: Graph Space Embedding (GSE), an approach related to the ’random-walk’ graph kernel Gärtner et al. (2003); Kang et al. (2012) with an important difference: it is not limited to the sum of all walks of a given length, but rather compares similar edges in two different graphs, which results in better expressiveness. Our empirical evaluation demonstrates that GSE leads to an improvement in performance compared to other baseline algorithms when plasma protein measurements and their interactions are used to predict ischaemia in patients with Coronary Artery Disease (CAD) van Nunen et al. (2015); Zimmermann et al. (2015). Moreover, the kernel can be computed in $\mathcal{O}(n^{2})$ , where $n$ is the number of features, and its hyperparameters efficiently optimized via maximization of the kernel matrix variation.

1.1 Main Contributions

Graph Space Embedding function that efficiently maps input into an “interaction-based” space 2. 2.

Novel theoretical result on optimal regime for the GSE, namely feasibility region for its parameters 3. 3.

Even Decent Sampling Algorithm: a strategy to gain insight on which interactions are responsible for the certain prediction

2 Approach

A remark on notation: we will use bold capital letters for matrices, bold letters for arrays and lower case letters for scalars/functions/1-d variables (ex. $\mathbb{X},\mathbb{x},x$ ).

2.1 Interaction Graphs

Any network can be represented by a graph $\mathcal{G}=\{V,E\}$ , where $E$ is a set of edges, $V$ a set of vertices. Denote by $\mathbb{A}_{|V|\times|V|}$ ( $|V|$ is equal to the number of features $N$ ) the adjacency matrix, where $\mathbb{A}_{i,j}$ represents the interaction between feature $i$ and $j$ , and whose value is [math] if there is no interaction.

Let $\mathbb{x}_{{}_{1\times N}}$ be an array with measurements of features $1$ to $N$ for a given point in the data. In order to construct an instance-specific matrix, one can weigh the interaction between each pair of features with a function of their values’ product:

[TABLE]

where $\varphi(\mathbb{A})$ is some function of the network interaction matrix $\mathbb{A}$ , and the operator $\circ$ represents the Hadamard product, i.e. $(\mathbb{A}\circ\mathbb{B})_{i,j}=(\mathbb{A})_{i,j}(\mathbb{B})_{i,j}$ .

2.2 Graph Kernel

Unlike the distance in euclidean geometry, which intuitively represents the length of a line between two points, there is no such tangible metric for graphs. Instead, one has to decide what is a reasonable evaluation for the difference between two graphs in the context of the problem.

A popular approach Gärtner et al. (2003) is to compare random walks on both graphs. The $i,j$ th entry of the order $k$ power of an adjacency matrix $\mathbb{A}_{{}_{|V|\times|V|}}$ : $\mathbb{A}^{k}=\underbrace{\mathbb{A}\mathbb{A}...\mathbb{A}}_{k\,times}$ , corresponds to the number of walks of length $k$ from $i$ to $j$ . Any function that maps the data into a feature space $\mathcal{H}$ : $\phi:X\rightarrow\mathcal{H}$ , $k(\mathbb{x},\mathbb{y})=<\phi(\mathbb{x}),\phi(\mathbb{y})>$ is a kernel function. Using the original graph kernel formulation, it is possible to define a kernel that will implicitly map the data into a space where the interactions are incorporated:

[TABLE]

where $\mathbb{G}$ and $\mathbb{G}^{\prime}$ correspond to $\mathbb{G}_{\mathbb{x}}(\mathbb{A})$ and $\mathbb{G}_{\mathbb{x}^{\prime}}(\mathbb{A})$ (see eq. 1); $\gamma_{i,j}$ is a function that ”controls” the mapping $\phi(\cdot)$ ; and $n$ is the maximum allowed ”random walks” length. If $\gamma$ is decomposed into $\mathbb{U}\mathbb{\Lambda}\mathbb{U}^{T}$ , where $\mathbb{U}$ is a matrix whose columns are the eigenvectors of $\gamma$ , and $\mathbb{\Lambda}$ a diagonal matrix with its eigenvalues at each diagonal entry, then equation 2 can be re-factored into:

[TABLE]

where $\phi_{i,k,l}(\mathbb{G})=\sum_{j=1}^{n}[\sqrt{\mathbb{\Lambda}}\mathbb{U}^{T}]_{i,j}\mathbb{G}^{j}$ . Consequently, different forms of the function $\gamma$ can be chosen, with different interpretations. For the case where $\gamma_{i,j}=\theta^{i}\theta^{j}$ , which yields:

[TABLE]

the kernel entry can be interpreted as an inner product in a space where there is a feature for every node pair { $k,l$ }, which represents the weighted sum of paths of length $1$ to $n$ from $k$ to $l$ $(\phi_{k,l}=\sum_{i=1}^{n}\theta^{i}\mathbb{G}_{k,l}^{i})$ Tsivtsivadze et al. (2011). The kernel can then be used with a method that employs the kernel trick, such as support vector machines, kernel PCA or kernel clustering. Another interesting case is when we consider the weighted sum of paths of length $1$ to $\infty$ . This can be calculated using:

[TABLE]

since $e^{\beta\mathbb{G}}=\lim_{n\to+\infty}\sum_{i=0}^{n}\frac{\beta^{i}}{i!}\mathbb{G}^{i}$ , where $\beta$ is a parameter.

2.3 Graph Space Embedding

Since we are dealing with a universal interaction matrix for every data point and the interactions are feature specific, it makes sense to compare the same set of edges for every pair of points. As a consequence, we can also avoid solving time-consuming graph structure problems. With these two points in mind, we combined the previous graph kernel methods and the radial basis function (RBF) to develop a new kernel which we will henceforth refer to as Graph Space Embedding (GSE). The radial basis function is defined as:

[TABLE]

where $c=e^{-\frac{||\mathbb{x}||^{2}}{\sigma^{2}}}e^{-\frac{||\mathbb{y}||^{2}}{\sigma^{2}}}$ . The GSE uses the distance $\left\langle\sqrt{\gamma}[\mathbb{G}],\sqrt{\gamma}[\mathbb{G}^{\prime}]\right\rangle_{F}$ in the radial basis function:

[TABLE]

If we then take the upper term of the fraction in $r\_w$ to be $\left[2\sum_{i=0}^{|E|}\gamma\,\mathbb{G}_{i}\mathbb{G}^{\prime}_{i}\right]^{n}$ , we can use the multinomial theorem to expand each term of the exponential power series, and the expression for the kernel then becomes:

[TABLE]

where $\Gamma$ is the gamma function, $\mathbb{G}_{i}\in E$ is the value of edge i in $\mathbb{G}$ and $\nu=\frac{\sigma^{2}}{\gamma}$ . Here, $\boldsymbol{\alpha}^{n}(\cdot)$ represents a combination of $|E|$ integers: $(\alpha_{1},\alpha_{2},...,\alpha_{|E|})$ , with $\sum_{i}^{|E|}\boldsymbol{\alpha}_{i}^{n}(\cdot)=n$ , and the sum in $r\_e$ is taken over all possible combinations of $\boldsymbol{\alpha}^{n}(\cdot)$ . For instance, for $n=3$ in a graph with $|E|=5$ , possible examples of $\boldsymbol{\alpha}^{3}(\cdot)$ include $(0,1,1,1,0)$ or $(0,2,1,0,0)$ (see Figure 2).

We begin by noting that since the sum in $r\_e$ is taken over all combinations $(l,k)\in V\times V$ of size $n$ , the GSE then represents a mapping from the input space to a space where all combinations of $n=0\rightarrow\infty$ edges are compared between $\mathbb{G}$ and $\mathbb{G}^{{}^{\prime}}$ , walks or otherwise (see fig 2). Notice that this is in contrast with the kernel of equation 5, where the comparison is between a sum of all possible walks of length $n=0\rightarrow\infty$ from one node to another in the two graphs.

The GSE also allows repeated edges. However, if the data is normalized so that $\mu(\mathbb{G}_{i})\simeq 0,\sigma(\mathbb{G}_{i})\simeq 1$ , then both the power in the numerator and the denominator of $r\_e$ will effectively dampen most combinations with repeated edges, with a higher dampening factor for higher number of repetitions and/or combinations. Even for outlier values, the gamma function will quickly dominate the numerator of $r\_e$ . The $\lambda$ factor serves the purpose of shrinking the combinations with higher number of edges for $\nu>2$ . Finally, $\sigma^{2}$ now serves a dual purpose: the usual one in RBF to control the influence of points in relation to their distance (see equation 6), while at the same time controlling how much combinations of increasing order are penalized.

2.4 $\nu$ Feasibility Region

As discussed in the above section, the hyperparameter $\nu$ controls the shrinking of the contribution of higher order edge combinations. Intuitively, not all values of $\nu$ will yield a proper kernel matrix since too large of a value will leave out too many edge combinations while one too small will saturate the kernel values. This motivates the search for a $\nu$ value feasible operation region, where the kernel incorporates the necessary information for separability. Informally speaking, the kernel entry $k(\mathbb{G},\mathbb{G}^{\prime})$ measures the similarity of $\mathbb{G}$ and $\mathbb{G}^{\prime}$ . In case too few/many edge combinations are considered, the variation of the kernel values will be equal to $1$ . Therefore, we use the variation of the kernel matrix $\sigma^{2}(\mathbb{K})$ as a proxy to detect if $\nu$ is within acceptable bounds. We shall refer to the ability of the kernel to map the points in the data into separable images $\phi(\mathbb{x})$ as kernel expressiveness.

To determine this region analytically, we find the $\nu_{max}$ that yields the largest kernel variation, and then use the loss function around this value to determine in which direction the value $\nu$ should take for minimal loss.

Lemma 2.1.

$\max_{\nu}\,\,\sigma^{2}\left(\mathbb{K}(\nu)\right)$ * can be numerically estimated and is guaranteed to converge with a learning rate $\alpha\leq\frac{D}{2(D-1)d_{max}}$ , where $D$ is the total number of inter graph combinations and $d_{max}$ is the largest combination distance.*

Proof.

The analytical expression for the variance is:

[TABLE]

where we used the binomial theorem to expand $b$ , and $d=||\mathbb{G}-\mathbb{G}^{\prime}||^{2}$ . To guarantee the convergence of numerical methods the function derivative must be Lipschitz continuous:

[TABLE]

by overloading the notation: $\mathbb{K}^{\prime}(\nu)=\frac{\partial\sigma^{2}\left(\mathbb{K}(\nu)\right)}{\partial\nu}$ to simplify the expression. The left side of equation 10 becomes:

[TABLE]

Since $0\leq e^{-\beta}\leq 1\,:\forall\,\beta\in\mathbb{R}$ , then:

[TABLE]

When $\epsilon=\nu-\nu^{\prime}\rightarrow 0$ :

[TABLE]

and $\delta$ tends much faster to 0 then $\epsilon$ , since the denominator of $\delta$ is the exponential of the sum of $\nu$ and $\nu^{\prime}$ . Thus, the function $k^{\prime}(\nu)$ is Lipschitz continuous with constant equal to: $L(\mathbb{K}^{\prime}(\nu))=2\left(\frac{D-1}{D}\right)d_{max}$ . ∎

We shall later demonstrate empirically that $\nu^{*}=\max_{\nu}\,\,\sigma^{2}(\mathbb{K}(\nu))$ improves the class separability for our dataset.

2.5 Comparison with Standard Graph Kernels

The original formulation of the graph kernel by Gartner et. al (see eq. 2), multiplies sums of random walks of length $i$ from one edge to another ( $k\rightarrow l$ ) by sums of random walks $k\rightarrow l$ from the other graph being compared of a length not necessarily equal to $i$ :

[TABLE]

The infinite length random walk formulation (see eq. 5) behaves in a similar way. Our method though, always compares the same set of edges in the two graphs.

Another important difference is the complexity of our method versus the random-walk graph kernel. For an $m\times m$ kernel and $n\times n$ graph, the worst-case complexity for a length $k^{\prime}$ random walk kernel is $\mathcal{O}(m^{2}k^{\prime}n^{4})$ and $\mathcal{O}(m^{2}k^{\prime}n^{2})$ for dense and sparse graphs, respectively Vishwanathan et al. (2010). The GSE, on the other hand, is always $\mathcal{O}\left(m^{2}n^{2}\right)$ since the heaviest operation is the Frobenius inner product in order to compute the distance between $\mathbb{G}$ and $\mathbb{G}^{\prime}$ . Moreover, once this distance is computed, evaluating the kernel for different values of $\nu$ is $\mathcal{O}(1)$ , which combined with the fact that the variance of this kernel is Lipschitz continuous, allows for efficient searching of optimal hyperparameters (see section 2.4).

2.6 Interpretability

How could we better understand what the GSE is doing, when it maps points into an infinite-dimensional space? A successful recent development in explaining black-box models is that of Local Interpretable Model-agnostic Explanations (LIME) Ribeiro et al. (2016), where a model is interpreted locally by making slight perturbations in the input and building an interpretable model around the new predictions. We too shall monitor our model’s response to changes in the input, but instead of making random perturbations, we will perturb the input in the direction of maximum output change.

Given an instance from the dataset $\mathbb{x}_{1\times N}$ , where $N$ is the number of features, and the function that will incorporate the feature connection network $\mathbb{G}_{\mathbb{x}}(\mathbb{A})$ (e.g. $\mathbb{G}_{\mathbb{x}}(\mathbb{A})=\mathbb{A}\circ\mathbb{x}^{\top}\mathbb{x}$ ), we will find the direction to which the model is the most sensitive (positive and negative). Unlike optimization, where the goal is to converge as fast as possible, here we are interested in the intermediate steps of the descent. This is because we shall use the set $\mathcal{\mathbb{G}}=\{\mathbb{G}_{\mathbb{x}_{1}},\,\mathbb{G}_{\mathbb{x}_{2}},\,...,\,\mathbb{G}_{\mathbb{x}_{M}}\}$ and the black-box model’s predictions $\mathbb{f}=\{f(\mathbb{x}_{1}),\,f(\mathbb{x}_{2}),\,...,\,f(\mathbb{x}_{M})\}$ to fit our interpretable model $h(\mathbb{G})\in\mathcal{H}$ (where $\mathbb{x}_{i}$ is a variation of the original sample $\mathbb{x}_{0}$ , and $\mathcal{H}$ represents the space of all possible interpretable functions $h$ ). This way, we will indirectly unveil the interactions that our model is most sensitive to, and show how these impact the predictions. To penalize complex models over simpler ones, we will introduce a function $\Omega(h)$ that measures model complexity. To scale the model complexity term appropriately, we can find a scalar $\theta$ so that the expected value of $\Omega(h)$ is equal to a fraction $\varepsilon$ of the expected value of the loss:

[TABLE]

Lastly, for highly non-linear models, the larger the input space the more complex the output explanations are likely to be, so we will weigh the sample deviations the same as the original sample $\mathbb{x}_{0}$ using the model’s own similarity measure $k(\mathbb{G}_{\mathbb{x}_{i}},\mathbb{G}_{\mathbb{x}_{0}})$ . Putting it all together:

[TABLE]

where $\mathcal{L}\Big{(}h,f,k(\mathbb{G}_{\mathbb{x}_{i}},\mathbb{G}_{\mathbb{x}_{0}})\Big{)}$ is the loss of $h$ when using $\mathbb{G}_{\mathbb{x}_{i}}$ to predict the black-box model output $f(\mathbb{x}_{i})$ , weighted by the kernel distance to the original sample $k(\mathbb{G}_{\mathbb{x}_{i}},\mathbb{G}_{\mathbb{x}_{0}})$ .

2.6.1 Even Descent Sampling Method

In order to adequately cover the most sensitive regions, we need to take steps with equidistant output values. Thus, we developed a novel adaptive method to sample more in steeper regions and less in flatter ones. The intuition is that we would like to approximate the function values in unexplored regions, so that we choose an appropriate sampling step while considering the uncertainty of the approximation. Due to the stochastic nature of the method, it is able to escape local extremes. Consider the value of function $f$ at a point $\mathbb{x}_{0}$ and its first order Taylor approximation at an arbitrary point $\mathbb{x}$ :

[TABLE]

The larger the difference $\delta=\mathbb{x}-\mathbb{x}_{0}$ , the less likely it is that the approximation error $f(\mathbb{x})-\hat{f}(\mathbb{x})$ is small. Assume we would like to model the random variable $F$ , which takes the value of $1$ if the approximation error is small ( $\delta=|\hat{f}(\mathbb{x})-f(\mathbb{x})|\approx 0$ ), and 0 otherwise. We will model the probability density function of $F$ as being:

[TABLE]

Consider also the random variable $T$ which takes the value of $1$ if the absolute difference in the output for a point $\mathbb{x}$ exceeds an arbitrary threshold ( $|f(\mathbb{x})-f(\mathbb{x}_{0})|>\tau$ ), and [math] otherwise. Assume there is zero probability this event occurs for sufficiently small steps: $\delta<a(\tau)$ , for some value $a(\tau)$ . Let us further assume that our confidence that $|f(\mathbb{x})-f(\mathbb{x}_{0})|>\tau$ increases linearly after the value $\delta=a(\tau)$ , until the maximum confidence level is reached at $\delta=b$ . After some value $\delta=c$ , we decide not to make any further assumptions about this event, so we attribute zero probability from that point on. This can be modeled as:

[TABLE]

where $v=2c-a(\tau)-b\,,\,u=b-a(\tau)$ and $T=1$ , if $|f(\mathbb{x})-f(\mathbb{x}_{0}|>\tau$ and [math] otherwise. The distribution of interest is then $p_{S}=p(f=1\cap t=1|\delta)$ . To simplify the calculations, we impose the uncertainty about our approximation (expressed by $F$ ) and the likelihood of a sufficiently large output difference (expressed by $T$ ) to be independent given $\delta$ : $p(f=1\cap t=1|\delta)=p(t=1|\delta)p(f=1|\delta)$ , and since the goal is to sample steps from this distribution, we will divide it by the normalization constant: $Z=p(f=1\cap t=1)=\int_{-\infty}^{+\infty}p(f=1\cap t=1|\delta)d\delta$ . See Figure 3 for an illustration of the method.

There are a couple of properties that can be manipulated for a successful sampling of the output space:

Controlled Termination

To force the algorithm to terminate after a minimum number of samples $M_{min}$ have been sampled, one can decrease the value of $a(\tau)$ with each iteration so that it becomes increasingly more likely that a value of $\delta$ will be picked such that $|f(\mathbb{x})-f(\mathbb{x}_{0})|<\tau$ , terminating the routine. For this purpose, one can compute the estimated threshold value $\tau_{0}$ that will keep the routine running.

[TABLE]

where $N$ is the number of features. This is an underdetermined equation, but one possible trivial solution is to set:

[TABLE]

where $N^{\prime}$ is the number of non-zero gradient values, then let $a(\tau)$ decay with time so that it will reach this limit value after $M_{min}$ iterations:

[TABLE]

Escaping Local Extrema

To make it more likely to escape local extrema, one possibility is to set the cut-off value $c$ larger when the norm of $\tau_{0}$ (eq. 21) is larger than its expected value, and smaller otherwise:

[TABLE]

This formulation allows jumping out from zones where the gradient is locally small, while taking smaller steps where the gradient is larger than expected.

Termination When Too Far from Original Sample

Since we are trying to explain the model locally, the sampling should terminate when the algorithm is exploring too far from the original sample. For that purpose, one can set $\lambda$ to increase with increasing distance $d$ to the original sample, pushing the probability density towards the left: $\lambda(d)=e^{-\frac{d}{\sigma^{2}}}$ .

Putting all of the above design considerations together, you can find the complete routine in algorithm 1.

3 Experiments

3.1 Materials

For all our analysis, we used plasma protein levels of patients with suspected coronary artery disease who were diagnosed for the presence of ischaemia Bom et al. (2018). A total of 332 protein levels were measured using proximity extension arrays Assarsson et al. (2014), and of the 196 patients, 108 were diagnosed with ischaemia. The protein-protein interactions data is available for download at StringDB Jensen et al. (2009). We implemented the GSE and the random walk kernel in python and used sci-kit learn implementation Pedregosa et al. (2011) for the other algorithms in the comparison.

3.2 Ischaemia Classification Performance

We benchmarked the GSE performance and running time when predicting ischaemia against the random-walk graph kernel, RBF, and random forests. Additionally, in order to test the hypothesis that the protein-interaction information is improving the analysis, we also tested GSE using a constant matrix full of ones as the interaction matrix. For this benchmark, we performed a 10-cycle stratified shuffle cross-validation split on the normalized protein data and recorded the average ROC area under the curve (AUC). To speed up the analysis, we used a training set of 90 pre-selected proteins using univariate feature selection with the F-statistic Hira and Gillies (2015). The results are shown in table 1.

The GSE outperformed all the other compared methods, and the fact that the GSE with a constant matrix (GSE*) had a lower performance increases our confidence that the prior interaction knowledge is beneficial for the analysis. The GSE is also considerably faster than the Random-Walk kernel, as expected. To test how both scale increasing feature size, we compared the running time of both for different pre-selected numbers of proteins. The results are depicted in Figure 4.

3.3 Performance for Different $\nu$ Values

Recall from section 2.4 that a feasible operating region for the $\nu$ values in the GSE kernel was analytically determined. We wanted to investigate how the loss function performs within this region, and whether it is possible to draw conclusions regarding the GSE kernel behaviour with respect to the interactions. To test this, the $\nu^{\ast}=\max_{\nu}\sigma^{2}[k(\nu)]$ was found using a gradient descent (ADAM Kingma and Ba (2015)) on the training set over 20 stratified shuffle splits (same preprocessing as in 3.2).

We then measured the ROC AUC on the validation set using 12 multiples of $\nu^{\ast}$ . The results can be seen in Figure 5. It is quite interesting that our proxy for measuring kernel expressiveness turns out to be a convex function peaking at $\nu^{\ast}$ .

3.4 Interpretability Test

To test how interpretable our model’s predictions are, first we trained the model on a random subset of our data and used the trained model to predict the rest of the data. Then we employed the method described in section 2.6 on a random patient in the test set, using decision trees as the interpretable models $h(\mathbb{G})\in\mathcal{H}$ , and a linear weighted combination of max depth and min samples per split as the complexity penalization term $\Omega(h)$ . We then picked the two most important features and made a 3d plot using an interpolation of the prediction space. The result is depicted in Figure 6.

The Even Descent Sampling tests instances which are approximately equidistant in the output values. For this patient, our model ’predicts’ its ischaemia risk could be mitigated by lowering protein TIMP metallopeptidase inhibitor 4 (”TIMP4”) and the interaction between lipoprotein lipase (”LPL”) and renin (”REN”).

4 Conclusions

In this paper, we address the problem of analyzing interconnected systems and leveraging the often-known information about how the components interact. To tackle this task, we developed the Graph Space Embedding algorithm and compared it to other established methods using a dataset of proteins and their interactions from a clinical cohort to predict ischaemia. The GSE results outperformed the other algorithms in running time and average AUC. Moreover, we presented an optimal regime for the GSE in terms of a feasibility region for its parameters, which vastly decreases the optimization time. Finally, we developed a new technique for interpreting black-box models’ decisions, thus making it possible to inspect which features and/or interactions are the most relevant.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Assarsson et al. [2014] Erika Assarsson, Martin Lundberg, et al. Homogenous 96-plex pea immunoassay exhibiting high sensitivity, specificity, and excellent scalability. Plos one , 9(4): e 95192, 2014.
2Bereau et al. [2018] Tristan Bereau, Robert A. Di Stasio Jr., Alexandre Tkatchenko, and O. Anatole von Lilienfeld. Non-covalent interactions across organic and biological subsets of chemical space: Physics-based potentials parametrized from machine learning. The Journal of Chemical Physics , 148, 2018.
3Bom et al. [2018] Michiel J. Bom, Evgeni Levin, Paul Knaapen, et al. Predictive value of targeted proteomics for coronary plaque morphology in patients with suspected coronary artery disease. E Bio Medicine. , 2018.
4Borgwardt and Kriegel [2005] Karsten M. Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In In Proceedings of the 5th International Conference on Data Mining , page 74–81, 2005.
5Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In In Advances in Neural Information Processing Systems. , page 3844–3852, 2016.
6Fout et al. [2017] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. Protein interface prediction using graph convolutional networks. In In Advances in Neural Information Processing Systems , page 6533–6542, 2017.
7Gärtner et al. [2003] Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. In Computational Learning Theory and Kernel Machines, 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003 , volume 129-143(3), pages 129–143, 2003.
8Hira and Gillies [2015] Zena M. Hira and Duncan F. Gillies. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinformatics , 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Proof.

Graph Space Embedding

Abstract

1 Introduction

1.1 Main Contributions

2 Approach

2.1 Interaction Graphs

2.2 Graph Kernel

2.3 Graph Space Embedding

2.4 ν\nuν Feasibility Region

Lemma 2.1**.**

Proof.

2.5 Comparison with Standard Graph Kernels

2.6 Interpretability

2.6.1 Even Descent Sampling Method

Controlled Termination

Escaping Local Extrema

Termination When Too Far from Original Sample

3 Experiments

3.1 Materials

3.2 Ischaemia Classification Performance

3.3 Performance for Different ν\nuν Values

3.4 Interpretability Test

4 Conclusions

2.4 $\nu$ Feasibility Region

Lemma 2.1.

3.3 Performance for Different $\nu$ Values