Graph Construction using Principal Axis Trees for Simple Graph   Convolution

Mashaan Alshammari; John Stavrakakis; Adel F. Ahmed; Masahiro; Takatsuka

arXiv:2302.12000·cs.LG·November 8, 2023

Graph Construction using Principal Axis Trees for Simple Graph Convolution

Mashaan Alshammari, John Stavrakakis, Adel F. Ahmed, Masahiro, Takatsuka

PDF

Open Access 1 Repo

TL;DR

This paper proposes a novel graph construction method using Principal Axis trees combined with supervised information to improve GNN performance, especially with simple graph convolution, by effectively creating adjacency matrices.

Contribution

It introduces a new graph construction scheme leveraging PA-trees and supervised labels, enhancing GNNs' ability to learn from incomplete or missing adjacency information.

Findings

01

SGC outperforms GCN in speed and results.

02

Using supervised information improves graph construction.

03

Careful tuning of smoothing levels prevents oversmoothing.

Abstract

Graph Neural Networks (GNNs) are increasingly becoming the favorite method for graph learning. They exploit the semi-supervised nature of deep learning, and they bypass computational bottlenecks associated with traditional graph learning methods. In addition to the feature matrix $X$ , GNNs need an adjacency matrix $A$ to perform feature propagation. In many cases, the adjacency matrix $A$ is missing. We introduce a graph construction scheme that constructs the adjacency matrix $A$ using unsupervised and supervised information. Unsupervised information characterizes the neighborhood around points. We used Principal Axis trees (PA-trees) as a source for unsupervised information, where we create edges between points falling onto the same leaf node. For supervised information, we used the concept of penalty and intrinsic graphs. A penalty graph connects points with different class labels,…

Tables1

Table 1. Table 1 : Properties of tested datasets. N = 𝑁 absent N= number of samples, d = 𝑑 absent d= number of features.

	$N$	$d$	Train/Valid/Test	source
Dataset 1	266	2	50/50/166	Zelnik-Manor and Perona (2004)
Dataset 2	399	2	50/50/199	Fränti and Sieranoja (2018)
Dataset 3	622	2	50/50/522	Zelnik-Manor and Perona (2004)
Dataset 4	788	2	50/50/688	Fränti and Sieranoja (2018)
iris	150	4	50/50/50	Pedregosa et al. (2011)
wine	178	13	50/50/78	Pedregosa et al. (2011)
BC-Wisc.	569	30	50/50/469	Pedregosa et al. (2011)
digits	1797	64	50/50/1697	Pedregosa et al. (2011)
Olivetti	400	4096	1024/1024/2048	Pedregosa et al. (2011)
PenDigits	10992	16	2748/2748/5496	Dua and Graff (2017)
mGamma	19020	10	4755/4755/9510	Dua and Graff (2017)
credit card	30000	24	7500/7500/15000	Dua and Graff (2017)

Equations24

x *_{G} g = U (U^{⊤} x ⊙ U^{⊤} g),

x *_{G} g = U (U^{⊤} x ⊙ U^{⊤} g),

x *_{G} g_{θ} = U g_{θ} U^{⊤} x .

x *_{G} g_{θ} = U g_{θ} U^{⊤} x .

x *_{G} g_{θ} = i = 0 \sum K θ_{i} T_{i} (\tilde{L}) x,

x *_{G} g_{θ} = i = 0 \sum K θ_{i} T_{i} (\tilde{L}) x,

x *_{G} g_{θ} = θ_{0} x - θ_{1} D^{- 1/2} A D^{- 1/2} x .

x *_{G} g_{θ} = θ_{0} x - θ_{1} D^{- 1/2} A D^{- 1/2} x .

x *_{G} g_{θ} = θ (I_{n} + D^{- 1/2} A D^{- 1/2}) x .

x *_{G} g_{θ} = θ (I_{n} + D^{- 1/2} A D^{- 1/2}) x .

H = X *_{G} g_{Θ} = f (\overset{ˉ}{A} X Θ),

H = X *_{G} g_{Θ} = f (\overset{ˉ}{A} X Θ),

\hat{Y} = so f t ma x (\overset{ˉ}{A} \dots \overset{ˉ}{A} \overset{ˉ}{A} X Θ^{(1)} Θ^{(2)} \dots Θ^{(K)}),

\hat{Y} = so f t ma x (\overset{ˉ}{A} \dots \overset{ˉ}{A} \overset{ˉ}{A} X Θ^{(1)} Θ^{(2)} \dots Θ^{(K)}),

\hat{Y} = so f t ma x (\overset{ˉ}{A}^{K} X Θ) .

\hat{Y} = so f t ma x (\overset{ˉ}{A}^{K} X Θ) .

A_{ij} = exp (\frac{- d ^{2} ( i , j )}{σ}),

A_{ij} = exp (\frac{- d ^{2} ( i , j )}{σ}),

(x_{i}, x_{j}) \in E^{P A} \Leftrightarrow x_{i} \in W an d x_{j} \in W,

(x_{i}, x_{j}) \in E^{P A} \Leftrightarrow x_{i} \in W an d x_{j} \in W,

(x_{i}, x_{j}) \in E^{p} \Leftrightarrow y_{i} \neq = y_{j},

(x_{i}, x_{j}) \in E^{p} \Leftrightarrow y_{i} \neq = y_{j},

(x_{i}, x_{j}) \in E^{i} \Leftrightarrow y_{i} = y_{j} .

(x_{i}, x_{j}) \in E^{i} \Leftrightarrow y_{i} = y_{j} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mashaan14/PAtree-SGC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Graph Theory and Algorithms · Complex Network Analysis Techniques

MethodsTest · k-Nearest Neighbors · Convolution · Graph Convolutional Network

Full text

Graph Construction using Principal Axis Trees for Simple Graph Convolution

\nameMashaan Alshammari \[email protected]

\addrIndependent Researcher

Riyadh, Saudi Arabia \AND\nameJohn Stavrakakis \[email protected]

\addrSchool of Computer Science

The University of Sydney

NSW 2006, Australia \AND\nameAdel F. Ahmed \[email protected]

\addrInformation and Computer Science Department

King Fahd University of Petroleum and Minerals

Dhahran, Saudi Arabia \AND\nameMasahiro Takatsuka \[email protected]

\addrSchool of Computer Science

The University of Sydney

NSW 2006, Australia

Abstract

Graph Neural Networks (GNNs) are increasingly becoming the favorite method for graph learning. They exploit the semi-supervised nature of deep learning, and they bypass computational bottlenecks associated with traditional graph learning methods. In addition to the feature matrix $X$ , GNNs need an adjacency matrix $A$ to perform feature propagation. In many cases, the adjacency matrix $A$ is missing. We introduce a graph construction scheme that constructs the adjacency matrix $A$ using unsupervised and supervised information. Unsupervised information characterizes the neighborhood around points. We used Principal Axis trees (PA-trees) as a source for unsupervised information, where we create edges between points falling onto the same leaf node. For supervised information, we used the concept of penalty and intrinsic graphs. A penalty graph connects points with different class labels, whereas an intrinsic graph connects points with the same class labels. We used the penalty and intrinsic graphs to remove or add edges to the graph constructed via PA-tree. We tested this graph construction scheme on two well-known GNNs: 1) Graph Convolutional Network (GCN) and 2) Simple Graph Convolution (SGC). The experiments show that it is better to use SGC because it is faster and delivers better or the same results as GCN. We also test the effect of oversmoothing on both GCN and SGC. We found out that the level of smoothing has to be carefully selected for SGC to avoid oversmoothing.

Keywords: Deep learning , Graph Convolutional Network (GCN) , Simple Graph Convolution (SGC) , Binary Space-Partitioning Trees (BSP-trees)

1 Introduction

Graph representation learning methods have gained popularity in recent years. The reason was the simplicity of modelling most of machine learning problems using graph representation. Given a set of samples $X$ with $\{\vec{x_{1}},\vec{x_{2}},\cdots,\vec{x_{n}}\}$ , one could construct a graph $G(V,E)$ , where the set $V$ contains the feature vectors as graph nodes. The set $E$ holds the relations between feature vectors represented as graph edges Hart et al. (2000). Graph Neural Network (GNNs) is one of the most effective schemes for graph representation learning. It has been applied to sentiment analysis Zhou et al. (2020); Liang et al. (2021); Phan et al. (2022) and computer vision tasks Qi et al. (2021); Zhang et al. (2022); Jia et al. (2022).

The idea of designing a deep network for graph representation learning has come through multiple iterations Hammond et al. (2011); Bruna et al. (2013); Henaff et al. (2015); Defferrard et al. (2016). One of the well-known GNN methods is Graph Convolutional Network (GCN) Kipf and Welling (2017). Throughout the hidden layers, GCN performs feature propagation between neighbors on the graph, then a nonlinear transformation of the graph is passed to the next layer. The last layer in GCN was set as a softmax function to produce the labels on graph nodes. Simple Graph Convolution (SGC) was proposed by Wu et al. (2019) where they remove the nonlinearity between the layers. This means stacking $K$ hidden layers is a matrix multiplication between the feature matrix $X$ and the adjacency matrix $A$ for $K$ times.

GCN and SGC both cannot create or modify graph edges, which means the graph has to be constructed before running GCN or SGC. There are methods to modify the adjacency matrix $A$ to achieve higher accuracy Franceschi et al. (2019) or to improve robustness against attacks Jin et al. (2020). The problem is, these methods only work with GCN, because they need the transition between the hidden layers to perform adjacency matrix optimization. Since this transition is absent in SGC and replaced by matrix multiplication, these adjacency optimization methods are not compatible with SGC.

We propose a new graph construction scheme based on unsupervised and supervised information. The proposed scheme works with GCN and SGC. For unsupervised graph edges creation, we used Principal Axis trees (PA-trees) Sproull (1991); McNames (2001), which is one type of Binary Space Partitioning trees (BSP-trees) Ram and Gray (2013). We also used supervised information to create graph edges. The field of dimensionality reduction introduced the concept of penalty and intrinsic graph. We used this concept to create edges from the training data Yan et al. (2007). Both unsupervised and supervised information were blended in one adjacency matrix and processed by either GCN or SGC.

Our contributions can be summarized as the following:

•

The proposed graph construction scheme uses Principal Axis trees (PA-trees) to highlight the density in the dataset. It also utilizes the training data to characterize intraclass compactness and interclass separability.

•

We studied the effect of smoothing on GCN and SGC using the proposed graph construction. Our results provide an empirical evidence that SGC is more vulnerable to oversmoothing than GCN. This supports the findings presented in Zhao and Akoglu (2019); Yang et al. (2020).

2 Related work

The learning task on graphs consists of two components: graph construction and learning algorithm. This study focuses on learning algorithms using deep learning. The next subsection introduces Graph Convolutional Network (GCN), while the following subsection discusses graph construction methods.

2.1 Graph Convolutional Network (GCN)

Performing learning tasks on graphs is one of the long-studied problems in machine learning literature. One of the oldest methods in this field is spectral clustering Shi and Malik (1997); Weiss (1999); Shi and Malik (2000); Ng et al. (2001); von Luxburg (2007). Given a set of samples $X$ with $\{\vec{x_{1}},\vec{x_{2}},\cdots,\vec{x_{n}}\}$ , spectral clustering method starts by constructing the adjacency matrix $A\in\mathbb{R}^{n\times n}$ using some pairwise similarity metric and the degree matrix $D_{ii}=\sum_{j}A_{ij}$ . Then, eigen decomposition is performed on the graph Laplacian $L=D^{-1/2}AD^{-1/2}$ to map the points into an embedding space. In that space, similar points fall closer to each other and can be detected using $k$ -means. The biggest hurdle for spectral clustering is decomposing an $n\times n$ adjacency matrix, which can be prohibitive with large datasets Defferrard et al. (2016); Shaham et al. (2018).

The superiority of spectral clustering comes from the mapping function. So, the question in the literature was: can we learn this mapping function instead of computing it through eigen decomposition. Designing a deep network that learns new representations of the feature vectors $\{\vec{x_{1}},\vec{x_{2}},\cdots,\vec{x_{n}}\}$ , can replace the deterministic mapping function in spectral clustering.

The connectivity of the graph $G$ is encoded in the graph Laplacian $L$ , and the graph Laplacian eigenvectors define the graph Fourier transform Shuman et al. (2013). Finding the new representation of a feature vector $x$ is done by performing the convolution between the input signal $x$ with a filter $g\in\mathbb{R}^{n}$ Wu et al. (2021):

[TABLE]

where $U$ is the matrix of the eigenvectors of the graph Laplacian and $\odot$ is the elementwise product. If we set the filter $g_{\theta}$ as $diag(U^{\top}g)$ , then the formula in equation 1 can be rewritten as:

[TABLE]

One of the earliest studies in this field was done by Bruna et al. (2013), where they designed a convolutional net that operates on the spectrum of the input features. However, their approach still needs the expensive eigen decomposition step with complexity $O(n^{3})$ . The new direction of research was to approximate the filter $g_{\theta}$ using Chebyshev polynomials. This idea has came through a series of refinements by different studies Henaff et al. (2015); Defferrard et al. (2016). Graph convolutional network (GCN) introduces a first order approximation of Chebyshev polynomials. Approximating graph convolutions using Chebyshev polynomials takes the following form:

[TABLE]

where $\tilde{L}=\frac{2L}{\lambda_{max}}-I_{n}$ . The Chebyshev polynomials are defined recursively as: $T_{i}(x)=2xT_{i-1}(x)-T_{i-2}(x)$ with $T_{0}(x)=1$ and $T_{1}(x)=x$ . GCN assumes that $K=1$ and $\lambda_{max}=2$ Kipf and Welling (2017), therefore equation 3 is simplified as:

[TABLE]

GCN further assumes that $\theta=\theta_{0}=-\theta_{1}$ , which leads to a simpler definition of graph convolution:

[TABLE]

GCN is a multilayer network where a single layer is defined as:

[TABLE]

where $X$ is the feature matrix holding the feature vectors $\{\vec{x_{1}},\vec{x_{2}},\cdots,\vec{x_{n}}\}$ , and $\bar{A}=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ with self loops added to the adjacency $\tilde{A}=A+I_{n}$ . $f(\cdot)$ is the activation function which was set as $ReLU(x)=max(0,x)$ for the hidden layers and $softmax(x)$ for the last layer.

A modification to GCN was introduced as Simple Graph Convolution (SGC) Wu et al. (2019). SGC removes the nonlinearity between GCN layers. In SGC, the learned representations $\hat{Y}$ of the input feature vectors $X$ is defined as:

[TABLE]

where $K$ is the number of layers. Let $\bar{A}^{K}$ denote the repeated multiplication of the adjacency matrix and $\Theta=\Theta^{(1)}\Theta^{(2)}\cdots\Theta^{(K)}$ . Then, equation 7 can be rewritten as:

[TABLE]

Given this definition, SGC brought down the computations in the hidden layers to a pre-processing step with no weights needed $\bar{X}=\bar{A}^{K}X$ . The final layer becomes a linear logistic regression classifier $\hat{Y}=softmax(\bar{X}\Theta)$ .

Graph Convolutional Network (GCN) is still an active research area with several topics emerging from the literature. One of the topics is Graph AutoEncoder (GAE) where GCN is employed to compute note representations in the latent space Kipf and Welling (2016). A new contribution to this research was to replace the weight sharing in GCNs with factor sharing between reconstructed adjacency matrices to find similarities Chen et al. (2023). Another research track studies adversarial attacks on GCNs Dai et al. (2018). Recent work by Wu et al. (2022) proposed multi-view graph augmentation to defend against adversarial attacks.

2.2 Graph construction

The methods introduced in the previous section need an adjacency matrix $A$ to work on. For many applications, the adjacency matrix $A$ is not present, and researchers have to construct it from the feature matrix $X$ . It is important to mention some studies that perform adjacency matrix modifications while training the GCN. For example, Franceschi et al. (2019) designed a framework that modifies the adjacency matrix to improve the performance of GCN. Jin et al. (2020) modified the adjacency matrix to prevent malicious attacks from compromising the learning algorithm. Some researchers relied on feedback from GCN to improve the adjacency. Zhong et al. (2023) used a self-adaptive adjacency matrix network that learns the adjacency matrix based on feedback from the GCN network. The GCN network performs pseudo-labeling on the data. Then, the adjacency matrix learns the connections adaptively. They constructed the initial graph using the $k$ -nn graph. These methods use alternating optimization schema to update $\theta$ and $A$ . They are incompatible with SGC because SGC computes the hidden layers as a pre-processing step with no weight optimization.

Another approach is to construct the adjacency matrix $A$ independently from GCN. Ye et al. (2021) used Gaussian kernel to compute pairwise similarities between samples. The Gaussian kernel is a conventional choice to construct the adjacency matrix, but it needs the setting of the hyperparameter $\sigma$ . Another option is to involve supervised information in the adjacency matrix construction. Ma et al. (2023) proposed a new metric named cross-class neighborhood similarity (CCNS) to measure the similarity between nodes. CCNS quantifies how similar the neighborhoods of two nodes with the same label are across the entire graph.

Constructing the adjacency matrix $A$ involves identifying similar points and creating an edge linking them. In its simplest form, the adjacency matrix $A$ can be constructed using the Gaussian kernel:

[TABLE]

where $-d^{2}\left(i,j\right)$ is the Euclidean distance, and $\sigma$ is a global scale set manually. Points separated by a small Euclidean distance are linked by an edge with large weight. There are two problems associated with constructing the adjacency matrix using the formula in equation 9: 1) the tuning of the parameter $\sigma$ and 2) the resulting adjacency matrix is not sparse.

Binary space-partitioning trees (BSP-trees) are very useful to define a hierarchical structure of the dataset Ram and Gray (2013). As we go deeper down a BSP-tree, the relevant neighborhood around the point $x$ is narrowed down. One of the famous BSP-trees is the principal-axis tree (PA-tree) Sproull (1991); McNames (2001). A PA-tree splits the feature vectors at the median along the first principal component. Constructing the adjacency matrix $A$ from PA-tree can be done by creating edges linking the points that fall into the same leaf node. Defining similarity between points using binary space-partitioning trees was implemented for spectral clustering Yan et al. (2019); Wang et al. (2019).

Graph construction using binary space-partitioning trees (BSP-trees) is done in an unsupervised way. Studies in the field of dimensionality reduction have constructed the graph using supervised information. A study by Yan et al. (2007) proposed the concept of penalty and intrinsic graphs. Edges in the penalty graph $G^{p}$ connect points from different classes. These edges were used to characterize the interclass separability. For intraclass compactness, they used the intrinsic graph $G^{i}$ that connects points from the same class.

From the review introduced in this section, we can identify three conditions for the graph $G$ to be passed to a deep network. First, it has to work with both GCN and SGC, regardless of the fact that SGC skips the nonlinearity between the hidden layers. Second, most of the graph edges have to be constructed in an unsupervised manner. Finally, the construction scheme must use the training samples to add edges to the graph or remove edges that link samples from different classes.

3 Graph construction for GCN and SGC

Our proposed graph construction method passes through two stages: 1) constructing graph edges using unsupervised information and 2) adding/removing edges from the graph based on supervised information. The proposed model expects feature vectors with the same dimensions. In the case of feature vectors with different dimensions, a preprocessing step is needed to ensure equal dimensions are passed to the neural net. The preprocessing can take the form of feature selection where the features with the most importance are kept. Another option is to apply dimensionality reduction on a subset of the features. The next subsections introduce the problem statement followed by graph construction stages.

3.1 Problem statement

The task of the proposed method is to perform node classification on the graph using two types of Graph Neural Networks (GNNs): GCN and SGC. There are some notations to be introduced before we present the problem statement. Let $G=(V,E)$ be a graph where $V$ is the set of nodes and $E$ is the set of edges. Graph edges describe the similarity between each pair of points and represented by the adjacency matrix $A\in\mathbb{R}^{n\times n}$ . The feature matrix $X=\{\vec{x_{1}},\vec{x_{2}},\cdots,\vec{x_{n}}\}\in\mathbb{R}^{n\times d}$ , where $x_{i}$ is the feature vector for the node $v_{i}$ . The graph $G$ can be represented using the adjacency and feature matrices $G=(A,X)$ . In a node classification problem, only a subset of nodes $V_{l}=\{v_{1},v_{2},\cdots,v_{l}\}$ have known class labels $Y_{l}=\{y_{1},y_{2},\cdots,y_{l}\}$ . The goal for a GNN is to learn a function $f_{\theta}:V_{l}\rightarrow Y_{l}$ that maps nodes to their corresponding labels, then it can uncover the labels for unseen data.

With the introduction of these notations, the problem can be stated as follows:

Given a feature matrix $X$ and partial node label $Y_{l}$ in absence of the adjacency matrix $A$ , construct $A$ using a PA-tree, add/remove edges using penalty graph $G^{p}$ and intrinsic graph $G^{i}$ , then run GNN to perform node classification.

3.2 Constructing a graph using unsupervised information from PA-trees

Binary Space Partitioning trees (BSP-trees) provide a hierarchical view for the input points. Principal Axis trees (PA-trees) are one type of BSP-trees. The PA-tree algorithm starts by projecting all points in the dataset onto the first principal component, and split them at the median. Points that are less than the median are placed into the left child and other points are placed into the right child. This process is repeated recursively until a maximum number of data points in leaf node $n_{0}$ is reached Keivani and Sinha (2021). Algorithm 1 shows the steps for PA-tree construction.

We created edges from PA-trees by connecting the points falling into the same leaf node:

[TABLE]

where $W$ is a leaf node. One parameter that influences this process is $n_{0}$ , which is the maximum number of points allowed in a leaf node to stop splitting. In the experiments we set $n_{0}=20$ , the same setting was used by Yan et al. (2018, 2021).

Figure 1 shows an example of constructing a graph using PA-trees. All points falling onto the same leaf node were fully connected. The points in the orange class that represents the smile, were split into two different tree branches. This can be explained by the position of this class. It stretches along the first principal component, and splitting at the median will break this class. This observation shows the importance of using supervised information to fill in these gaps created by unsupervised construction of the graph.

3.3 Constructing penalty and intrinsic graphs from the training data

From the graph shown in Figure 1, it is evident that unsupervised information cannot capture the high-level relationships between classes. We have to use the training feature vectors to capture these high-level relationships between classes. Yan et al. (2007) presented a framework to construct edges from the training feature vectors. They constructed two graphs, a penalty graph $G^{p}$ (Figure 2-b) with edges connecting samples from different classes. This graph characterizes the interclass separability and defined as:

[TABLE]

where $y_{i}$ and $y_{j}$ are the class labels for the feature vectors $x_{i}$ and $x_{j}$ respectively.

The second graph is the intrinsic graph $G^{i}$ (Figure 2-c) which connects samples from the same class to identify the intraclass compactness. The intrinsic graph $G^{i}$ is defined as:

[TABLE]

The proposed method relies on the unsupervised edges made by the PA-tree. The graph created these edges based on the first principal components, which means it will split the points along the axis with the highest variance. However, this is not necessarily the case with all classes. Some classes stretch along the first principal component, for example, the class with the orange color in Figure 1-b. The intrinsic graph helps us to compensate for such weaknesses. The intrinsic graph creates edges based on the class not the location of a point. For example, the class with the orange color was connected by the intrinsic graph in Figure 2-c.

The final adjacency matrix that was passed to GCN and SGC is the result of refining the graph produced by the PA-tree $G^{PA}$ . All edges in the penalty graph $G^{p}$ should be removed from the PA-tree graph $G^{PA}$ . Also, all edges in the intrinsic graph $G^{i}$ are added to the PA-tree graph $G^{PA}$ . The pseudocode in Algorithm 2 shows the steps for our graph construction method.

4 Experiments and discussions

We designed the experiments to test the efficiency of the proposed graph construction scheme. We also examined different settings that have an influence over the learning algorithm. These settings include the level of smoothing used in GCN and SGC and the number of trees used to construct the graph. Our experiments include a comparison with ground truth adjacency and machine learning methods.

All the datasets used in the experiments are available publicly. We downloaded some datasets from the scikit-learn library Pedregosa et al. (2011); Buitinck et al. (2013), and we downloaded others from public repositories. Table 1 shows the properties of the datasets, their training splits, and their sources. We used three groups, each of which has four datasets. The first group is the 2-dimensional datasets Dataset 1 to Dataset 4. These are easy to visualize with artificially created classes to make it harder for the classifier. The second group of datasets involves iris, wine, BC-Wisc., and digits. These datasets have a large number of dimensions. The last group of datasets involves Olivetti, PenDigits, mGamma, and credit card. These datasets have a large number of nodes, which will be useful in testing the running time for both SGC and GCN.

All experiments were coded in Python 3, and can be found on the following GitHub repository https://github.com/mashaan14/PAtree-SGC. Here are the properties of the machine used in the experiments: a Windows 11 machine with 20 GB of memory and a 3.10 GHz Intel Core i5-10500 CPU.

4.1 Testing the accuracy of GCN and SGC

The results for testing the accuracy of GCN and SGC are shown in Table 2. With Dataset 2 and Dataset 4, SGC outperforms GCN. But SGC falls short in Dataset 1 and Dataset 3. This can be explained by the nature of the sparse classes (i.e., points within the same class do not share a single mean). Since SGC performs feature propagation in the original feature space, points in these sparse classes got pulled towards other classes’ means. GCN can be very useful in these situations because it uses nonlinear transformation after each round of feature propagation. These nonlinear transformations map the points in sparse classes closer to each other, which helps improving the accuracy.

With the last eight datasets, SGC outperforms GCN in six datasets out of eight. This performance by SGC can be explained by the fact that most classes in these datasets have a single mean. Unlike small datasets, which are usually designed artificially with sparse classes, classes in real dataset have their own mean. GCN was the best performer in iris and wine datasets. These datasets contain sparse classes, which can be better separated by nonlinear transformation.

Another aspect to look at when running GCN and SGC is the training time. The training process in GCN involves feature propagation and nonlinear transformation. While in SGC, the training only involves feature propagation. Figure 4 shows the training time for all of the 12 datasets. In case of large-size datasets, the training time for SGC was very small compared to GCN training time. There is a clear advantage for SGC over GCN, especially if we consider that SGC outperforms GCN in most of the datasets. The takeaway from this experiment, we recommend to use SGC because it is faster and most likely will deliver similar or better results than GCN. An exception would be if the user is certain that the dataset contains sparse classes where nonlinearity in GCN can be helpful.

4.2 The effect of graph smoothing on GCN and SGC accuracy

In graph neural network (GNN), each node on the graph $v_{i}$ is associated with a feature vector $x_{i}$ . The smoothing operation makes the features of nodes in the same cluster similar. This operation eases the classification task. Figure 5 shows graph smoothing in SGC.

The number of hidden layers in GCN and SGC controls the graph smoothing. Smoothing in GCN is coupled with nonlinearity transition from one hidden layer to another. While in SGC, smoothing is just a multiplication of the adjacency matrix $A$ by the feature matrix $X$ . The risk of oversmoothing is that the features in different clusters became indistinguishable after a number of hidden layers Yang et al. (2020).

In this experiment, we test the effect of graph smoothing on GCN and SGC with our proposed graph construction. Eight datasets were tested, where designed the hidden layers to range from $1$ to $50$ layers for both GCN and SGC. For each network setup we took the test accuracy average for $10$ runs. Out of the eight datasets (see Figure 6), GCN turns to be more resilient than SGC in six datasets. By resilient we mean the ability to deliver better accuracy with an increasing number of layers. These results support the findings presented by Yang et al. (2020) where they stated that GCN has the ability to learn “anti-oversmoothing”.

The two datasets where SGC was better than GCN were Dataset 4 and BC-Wisc. This can be explained by the nature of classes in these two datasets, where points in each class share a single mean. In these situations, there is a low risk of oversmoothing. Because with each layer of smoothing, node representations in one class are turning to be the same as the class mean. This is not the case when we have a sparse class. Because with each layer of smoothing node representations in the sparse class are getting mixed with the neighboring classes.

4.3 The effect of the number of trees on GCN and SGC accuracy

One factor that affects our graph construction scheme is the reliance on principle axis trees (PA-trees) to construct the graph. In this experiment, we investigate if using another type of Binary Space Partitioning trees (BSP-trees) with an increasing number of trees would improve the test accuracy of GCN and SGC. We used random projection trees (RP-trees) Dasgupta and Freund (2008); Dasgupta and Sinha (2015). In RP-trees, a direction $\vec{r}$ is selected at random, and all points are projected to this random direction. Then the points are split into left and right nodes depending on if they are greater or less than a constant $c$ . $c$ is usually set to be the median point along $\vec{r}$ . We set the number of trees in a range from $20$ to $100$ .

Table 3 shows the average test accuracy for $10$ runs. The increasing number of RP-trees will not improve the performance of GCN and SGC. We can explain this by the effect of random directions in RP-trees and the principal component direction is the same. Therefore, the performance of a single PA-tree is similar to the performance of multiple RP-trees. It makes the cost of constructing and storing $100$ RP-trees unjustified.

4.4 Comparing the constructed graph with a ground-truth adjacency

There are several datasets with ground truth adjacency matrices. The most used ones in Graph Neural Networks (GNNs) research are Citeseer and Cora Yang et al. (2016). Both datasets are citation network datasets, where the set of nodes $V$ represents the documents and the set of edges $E$ represents the citation links. The features in Citeseer and Cora are bags of words (BoW) representation of documents. Citeseer contains $3,327$ nodes, $9,104$ edges, and $3,703$ features. Cora contains $2,708$ nodes, $10,556$ edges, and $1,433$ features. We used PyG (PyTorch Geometric) Fey and Lenssen (2019) to download Citeseer and Cora datasets.

For comparison, we used two graph construction methods that are usually used in Graph Neural Networks (GNNs) research: $\epsilon$ graph and $k-$ nn graph. The performance of these methods depends heavily on the selection of their hyperparameters. Therefore, we used the recommendations in von Luxburg (2007) to set these hyperparameters. It is advised to set $\epsilon$ equal to the longest edge in the minimum spanning tree (MST). For $k$ in the $k-$ nn graph, it is recommended to set it to $log(N)$ , where $N$ is the number of instances. For both graphs, we used scikit-learn implementation Pedregosa et al. (2011) to construct them.

We compared our proposed method to $\epsilon$ and $k-$ nn graphs. With the ground truth adjacency, we constructed a confusion matrix. This confusion matrix has four cases: 1) the edge does not exist in both our graph and the ground truth graph; 2) the edge exists in the ground truth graph but it was missed by our graph; 3) the edge was created by our graph but it does not exist in the ground truth graph; 4) the edge exists in both our graph and the ground truth graph. These confusion matrices are shown in Table 4 and Table 5.

Our proposed method removed $99.5\%$ of edges compared to Citeseer ground truth graph. That was much higher than the $\epsilon$ graph, which only removed $77.4\%$ . The $k$ -nn graph removed more edges than our method but it did not get any of the edges created by the ground truth graph. The same observation was noticed in Cora ground truth graph, where the proposed and $k$ -nn graphs performed better than the $\epsilon$ graph. The way the ground truth graph was created may be the reason why there is a difference between the ground truth graph and the constructed graphs. The ground truth graph contains citation links, but the features are bags of words. It is not necessarily that two documents with similar bags of words would cite each other. These high-level semantics are missing when we construct a graph only from the features.

4.5 Comparison with machine learning methods

This experiment was designed to compare our proposed method to well-known machine learning methods that do not require graph construction. We picked four machine learning methods: 1) $k$ -nearest neighbor ( $k$ -nn), 2) support vector machine with radial basis function kernel (RBF SVM), 3) decision tree (DT), 4) random forest (RF). We used scikit-learn implementation Pedregosa et al. (2011) to run these methods. Our selection for parameters was as follows: $k=5$ for $k$ -nn, $\gamma=2$ for RBF SVM, and for DT and RF classifiers we set the maximum depth to 5.

Using eight datasets, we compared our method score to the scores achieved by the other four machine learning methods. These scores are shown in Table 6. Our method achieved the best score on two datasets iris and digits. These scores were achieved by GCN, not SGC. RBF SVM and DT classifiers delivered incosestent performances. They were the best performers on 2-dimensional datasets. However their performance dropped significantly when tested on datasets with higher dimensions. $k$ -nn classifier got the highest score only once with wine dataset. This can be explained by the setting of the parameter $k$ , which needs to be optimized for each dataset independently.

4.6 Ablation study

The ablation study is carried out in a way where each component of the proposed method is tested independently. We designed this experiment using four cases. First, is the base case where the graph is constructed by adding edges from the intrinsic graph $A^{i}$ to the edges found be the PA-tree graph $A^{PA}$ , then the edges found in the penalty graph $A^{p}$ are removed. The base case adjacency is represented as follows $A=(A^{PA}+A^{i})-A^{p}$ . The second case is $A=A^{i}$ , where we only pass the intrinsic graph to the neural net. The third case is represented as $A=A^{PA}-A^{p}$ , where we removed penalty graph edges $A^{p}$ from PA-tree graph $A^{PA}$ . The final case is $A=A^{PA}$ , where we ignore all supervised information and only pass the PA-tree graph $A^{PA}$ to the neural net.

All these cases were tested on both SGC and GCN. The results are shown in Table 7. With 2-dimensional datasets, the intrinsic graph $A^{i}$ did not achieve the best score. The best scores were achieved mainly by the PA-tree graph $A^{PA}$ , either by removing edges from the penalty graph $A^{p}$ or by passing it as is to the neural net. The intrinsic graph $A^{i}$ achieved the best score with iris dataset. This could be explained by the percentage of training samples in iris dataset. It has 50 training samples, which is 33% of the entire dataset. This is the highest percentage of training samples among all tested datasets.

In some cases, we get the same score from $A=A^{PA}$ , the PA-tree graph and $A=A^{PA}-A^{p}$ the PA-tree graph with penalty graph edges removed. This indicates that the penalty graph removed edges that had already been removed by the PA-tree graph.

4.7 Discussion

We conducted six experiments to test all aspects of our proposed method. We started by evaluating the accuracy of classification, where SGC and GCN performed similarly. But with the running time advantage, we recommend using SGC instead of GCN because it is faster and most likely will deliver similar performance to GCN.

Two factors that may affect the performance of the proposed method are graph smoothing and the number of trees to construct the graph. We found that SGC performance is vulnerable to a drop in performance if the level of smoothing is not set carefully. For the number of trees factor, we did not find a correlation between increasing the number of trees and performance improvements. Therefore, we recommend using a single PA-tree to construct the graph.

We compared our constructed graph to ground truth graphs that are usually used in the literature. We found that our method constructed a graph that is much closer to ground truth than the one constructed by the $\epsilon$ graph. The $k$ -nn graph performed similarly to our method. However, our method has the advantage of not having hyperparameters such as $k$ , which might affect the performance of the $k$ -nn graph.

We also compared our method to well-known machine learning methods. Some of these methods delivered a strong performance with 2-dimensional data. However, their performance dropped when we tested them with higher dimensions. On the other hand, our method shows resilience to performance drops with higher dimensions.

Our last experiment was the ablation study. In that experiment, we found that the most effective component of our method is the PA-tree graph $A^{PA}$ . We constructed this graph based on unsupervised information.

5 Conclusion

Graph Neural Networks (GNNs) have become the go-to option for graph learning among researchers in the machine learning community. GNN eases the computational demands associated with traditional graph learning techniques such as spectral clustering. The adjacency matrix $A$ is crucial for learning in GNN. Despite some efforts to modify the adjacency matrix while training the GNN, these efforts are incompatible with some types of GNNs such as Simple Graph Convolution (SGC).

We present a graph construction scheme that uses unsupervised and supervised information to construct the adjacency matrix $A$ . The proposed scheme is independent of GNN training, which makes it compatible with both well-known types of GNNs like Graph Convolutional Networks (GCNs) and Graph Simple Graph Convolution (SGC). We used Principal Axis trees (PA-trees) as a source of unsupervised information to build the adjacency matrix. For supervised information, we used the concepts of penalty and intrinsic graphs from the dimensionality reduction field.

We designed the experiments to examine how GCN and SGC perform with the proposed graph construction in terms of test accuracy and training time. We also examined the factors that could affect the performance (e.g., graph smoothing and the number of BSP-trees). Other experiments compare the proposed method to ground truth adjacency matrices and machine learning methods. We found out that SGC can deliver better or similar test accuracy with far less training time compared to GCN. However, SGC was more vulnerable to the effects of graph smoothing than GCN. We also discovered that the proposed method constructed an adjacency matrix similar to the ground truth matrix. Based on these results, we recommend using SGC with the proposed graph construction because of its speed. But the level of graph smoothing has to be selected carefully.

The proposed method does not use feedback from GCN to optimize the adjacency matrix construction, which might be counted as a drawback. Some methods used a shared loss function for GCN training and adjacency matrix construction. Because we used two different neural net architectures (SGC and GCN), building a feedback loop or shared loss function in these two architectures could be an independent study by itself. Another direction to extend this study could be using other types of Binary Space-Partitioning Trees (BSP-trees) to explore unsupervised information.

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral networks and locally connected networks on graphs, 2013.
2Buitinck et al. (2013) Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake Vander Plas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning , pages 108–122, 2013.
3Chen et al. (2023) Zhaoliang Chen, Zhihao Wu, Shiping Wang, and Wenzhong Guo. Dual low-rank graph autoencoder for semantic and topological networks. Proceedings of the AAAI Conference on Artificial Intelligence , 37(4):4191–4198, Jun. 2023. doi: 10.1609/aaai.v 37i 4.25536 .
4Dai et al. (2018) Hanjun Dai, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. Adversarial attack on graph structured data. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 1115–1124. PMLR, 10–15 Jul 2018.
5Dasgupta and Freund (2008) Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing , STOC ’08, pages 537–546, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605580470. doi: https://doi.org/10.1145/1374376.1374452 .
6Dasgupta and Sinha (2015) Sanjoy Dasgupta and Kaushik Sinha. Randomized partition trees for nearest neighbor search. Algorithmica , 72(1):237–263, May 2015. ISSN 0178-4617. doi: https://doi.org/10.1007/s 00453-014-9885-5 .
7Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. 2016. doi: https://doi.org/10.48550/ARXIV.1606.09375 .
8Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml .