Some Might Say All You Need Is Sum
Eran Rosenbluth, Jan Toenshoff, Martin Grohe

TL;DR
This paper investigates the expressivity of different GNN aggregation functions, showing that Sum GNNs are limited compared to Mean and Max, especially for larger graphs and certain functions.
Contribution
It provides theoretical proofs that Sum GNNs cannot approximate basic functions computed by Mean or Max GNNs, highlighting the limitations of Sum aggregation.
Findings
Sum GNNs cannot approximate functions computed by Mean or Max GNNs.
Mean and Max GNNs are more expressive than Sum GNNs.
Combination of Sum with Mean or Max increases expressivity.
Abstract
The expressivity of Graph Neural Networks (GNNs) is dependent on the aggregation functions they employ. Theoretical works have pointed towards Sum aggregation GNNs subsuming every other GNNs, while certain practical works have observed a clear advantage to using Mean and Max. An examination of the theoretical guarantee identifies two caveats. First, it is size-restricted, that is, the power of every specific GNN is limited to graphs of a specific size. Successfully processing larger graphs may require an other GNN, and so on. Second, it concerns the power to distinguish non-isomorphic graphs, not the power to approximate general functions on graphs, and the former does not necessarily imply the latter. It is desired that a GNN's usability will not be limited to graphs of any specific size. Therefore, we explore the realm of unrestricted-size expressivity. We prove that basic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Bayesian Modeling and Causal Inference · Explainable Artificial Intelligence (XAI)
Some Might Say All You Need Is Sum
Eran Rosenbluth1, Funded by the German Research Council (DFG), RTG 2236 (UnRAVeL)
Jan Toenshoff1, Funded by the German Research Council (DFG), grants GR 1492/16-1; KI 2348/1-1 “Quantitative Reasoning About Database Queries”
Martin Grohe1
@informatik.rwth-aachen.de
1RWTH Aachen University
Abstract
The expressivity of Graph Neural Networks (GNNs) is dependent on the aggregation functions they employ. Theoretical works have pointed towards Sum aggregation GNNs subsuming every other GNNs, while certain practical works have observed a clear advantage to using Mean and Max. An examination of the theoretical guarantee identifies two caveats. First, it is size-restricted, that is, the power of every specific GNN is limited to graphs of a specific size. Successfully processing larger graphs may require an other GNN, and so on. Second, it concerns the power to distinguish non-isomorphic graphs, not the power to approximate general functions on graphs, and the former does not necessarily imply the latter.
It is desired that a GNN’s usability will not be limited to graphs of any specific size. Therefore, we explore the realm of unrestricted-size expressivity. We prove that basic functions, which can be computed exactly by Mean or Max GNNs, are inapproximable by any Sum GNN. We prove that under certain restrictions, every Mean or Max GNN can be approximated by a Sum GNN, but even there, a combination of (Sum, [Mean/Max]) is more expressive than Sum alone. Lastly, we prove further expressivity limitations for GNNs with a broad class of aggregations.
1 Introduction
Message passing graph neural networks (GNNs) are a fundamental deep learning architecture for machine learning on graphs. Most state-of-the-art machine learning techniques for graphs are based on GNNs. It is therefore worthwhile to understand their theoretical properties. Expressivity is one important aspect: which functions on graphs or their vertices can be computed by GNN models? To start with, functions computed by GNNs are always isomorphism invariant, or equivariant for node-level functions. A second important feature of GNNs is that a GNN can operate on input graphs of every size, since it is defined as a series of node-level computations with an optional graph-aggregating readout computation. These are desirable features that motivated the introduction of GNNs in the first place and may be seen as a crucial factor for their success. Research on the expressivity of GNNs has had a considerable impact in the field.
A GNN computation transforms a graph with an initial feature map (a.k.a. graph signal or node embedding) into a new feature map. The new map can represent a node-level function or can be “read out” as a function of the whole graph. The computation is carried out by a finite sequence of separate layers. On each layer, each node sends a real-valued message vector which depends on its current feature vector, to all its neighbours. Then each node aggregates the messages it receives, using an order-invariant multiset function, typically being entrywise summation (Sum), mean (Mean), or maximum (Max). Finally, the node features are updated using a neural network which receives as arguments the aggregation value and the node’s current feature. In the eyes of a GNN all vertices are euqal: the message, aggregation and update functions of every layer are identical for every node, making GNNs auto-scalable and isomorphism-invariant.
By now, numerous works have researched the expressivity of GNNs considering various variants of them. However, many of the theoretical results have the following caveats:
-
The expressivity considered is non-uniform: for a function that is defined on graphs of all sizes, it is asked if for every there exists a GNN that expresses the function on graphs of size . The expressing GNN may depend on , and it may even be exponentially large in . For some proofs, this exponential blow-up is necessary [?; ?]. This notion of expressivity is in contrast to uniform expressivity: for a function that is defined on graphs of all sizes, asking whether there exists one GNN that expresses the function on graphs of all sizes. In addition to being a significantly weaker theoretical notion, non-uniform expressivity leaves much to be desired also from a practical standpoint: It implies that a GNN may be no good for graphs of sizes larger than the sizes well-represented in the training data. This means that training may have to be done on very large graphs, and may have to be often repeated.
-
The expressivity considered is the power to distinguish non-isomorphic graphs. A key theoretical result is the characterisation of the power of GNNs in terms of the Weisfeiler-Leman (WL) isomorphism test [?; ?], and subsequent works have used WL as a yardstick (see ’Related Work’). In applications of GNNs though, the goal is not to distinguish graphs but to regress or classify them or their nodes. There seem to be a hidden assumption that higher distinguishing power implies better ability to express general functions. While this is indeed the case in some settings [?], it is not the case with uniform expressivity notion.
Our goal is to better understand the role that the aggregation function plays in the expressivity of GNNs. Specifically, we ask: Do Sum aggregation GNNs subsume Mean and Max GNNs, in terms of uniform expressivity of general functions?
A common perception is that an answer is already found in [?]: -GNNs strictly subsume all other aggregations GNNs. Examining the details though, what is actually proven there is: in the non-uniform notion, considering a finite input domain, the distinguishing power of -GNNs subsume the distinguishing power of all other aggregations GNNs. Furthermore, in practice it has been observed that for certain tasks there is a clear advantage to using Mean and Max aggregations [?; ?; ?], with one of the most common models in practice using a variation of Mean aggregation [?]. While the difference between theoretical belief and practical evidence may be attributed to a learnability rather than to expressivity, it calls for better theoretical understanding of expressivity.
1.1 Our Contribution
All our results are in the uniform expressivity notion. Mainly, we prove that -GNNs do not subsume -GNNs nor -GNNs (and vice versa), in terms of vertices-embedding expressivity as well as graph-embedding expressivity. The statements in this paper consider additive approximation, yet the no-subsumption ones hold true also for multiplicative approximation.
- •
Advantage Sum. For the sake of completeness, in Section 3 we prove that even with single-value input features, the neighbors-sum function which can be trivially exactly computed by a -GNN cannot be approximated by any -GNN or -GNN.
- •
Sum subsumes. In Section 4 we prove that if the input features are bounded, -GNNs can approximate all -GNNs or -GNNs, though not without an increase in size which depends polynomially on the required accuracy, and exponentially on the depth of the approximated -GNNs or -GNNs.
- •
Advantage Mean and Max. In Section 5.1 we show that if we allow unbounded input features then functions that are exactly computable by -GNNs ; -GNNs; and others, cannot be approximated by -GNNs.
- •
Essential also with finite input-features domain. In Section 5.2 we prove that even with just single-value input features, there are functions that can be exactly computed by a (Sum, Mean)-GNN (a GNN that use both Sum-aggregation and Mean-aggregation) or by a (Sum, Max)-GNN, but cannot be approximated by -GNNs.
- •
The world is not enough. In Section 6, we examine GNNs with any finite combination of Sum; Mean; Max and other aggregations, and prove upper bounds on their expressivity already in the single-value input features setting.
Lastly, in Section 7 we experiment with synthetic data and observe that what we proved to be expressible is to an extent also learnable, and that in practice inexpressivity is manifested in a significantly higher error than implied in theory.
All proofs, some of the lemmas, and extended illustration and analysis of the experimentation, are found in the appendix.
1.2 Related Work
The term Graph Neural Network, along with one of the basic models of GNNs, was introduced in [?]. Since then, more than a few works have explored aspects of expressivity of GNNs. Some have explored the distinguishing power of different models of GNNs [?; ?; ?; ?; ?; ?; ?], and some have examined the expressivity of GNNs depending on the aggregations they use [?; ?]. In [?], a connection between distinguishing power and function approximation is described. In all of the above, the non-uniform notion was considered. In the uniform notion, it was proven that -GNNs can express every logical formula in Guarded Countable Logic with 2 variables (GC2) [?; ?]. A theoretical survey of the expressivity of GNNs is found in [?], and a practical survey of different models of GNNs is found in [?].
2 Preliminaries
By we denote the sets of nonnegative integers, positive integers, rational numbers, an d real numbers, respectively. For we denote the set by . For we denote the set by . For , we denote the set by . We may use the terms ”average” and ”mean” interchangeably to denote the arithmetic mean. We use ”{}” as notation for a multiset. Let , we define the multiset consisting of instances of . Let and let a vector , we define . Let two vectors , we define : .
2.1 Graphs
An undirected graph is a pair, being a set of vertices and being a set of undirected edges. For a vertex we denote by the neighbourhood of in , and we denote the size of it by .
A (vertex) featured graph is a -tuple being a graph with a feature map , mapping each vertex to a -tuple over a set . We denote the set of graphs featured over by , we define , and we denote the set of all featured graphs by . The special set of graphs featured over {1} is denoted . We denote the set of all feature maps that map to by , we denote by , and we denote the set of all feature maps by . Let a featured-graph domain , a mapping to new feature maps is called a feature transformation.
For a featured graph and a vertex we define , , and . In this paper, we consider the size of a graph to be its number of vertices, that is, .
2.2 Feedforward Neural Networks
A feedforward neural network (FNN) is directed acyclic graph where each edge carries a weight , each node of positive in-degree carries a bias , and each node has an associated continuous activation function . The nodes of in-degree [math], usually , are the input nodes and the nodes of out-degree [math], usually , are the output nodes. We denote the underlying directed graph of an FNN by , and we call \big{(}V({\mathfrak{F}}),E({\mathfrak{F}}),({\mathfrak{a}}^{\mathfrak{F}}_{v})_{v\in V({\mathfrak{F}})}\big{)} the architecture of , notated . We drop the indices F at the weights and the activation function if is clear from the context.
The input dimension of an FNN is the number of input nodes, and the output dimension is the number of output nodes. The depth of an FNN is the maximum length of a path from an input node to an output node.
To define the semantics, let be an FNN of input dimension and output dimension . For each node , we define a function by for the th input node and
[TABLE]
for every node with incoming edges . Then computes the function defined by
[TABLE]
Let an FNN, we consider the size of to be the size of its underlying graph. That is, .
A common activation function is the ReLU activation, defined as . In this paper, we assume all FNNs to be ReLU activated. ReLU activated FNNs subsume every finitely-many-pieces piecewise-linear activated FNN, thus the results of this paper hold true for every such FNNs. Every ReLU activated FNN is Lipschitz-Continuous. That is, there exists a minimal such that for every input and output coordinates , for every specific input arguments , and for every , it holds that
[TABLE]
We call the Lipschitz-Constant of .
2.3 Graph Neural Networks
Several GNN models are described in the literature. In this paper, we define and consider the Aggregate-Combine (AC-GNN) model [?; ?]. Some of our results extend straightforwardly to the messaging scheme of MPNN [?], yet such extensions are out of scope of this paper.
A GNN layer, of input and output (I/O) dimensions , is a pair such that: is an FNN of I/O dimensions , and is an order-invariant -dimension multiset-to-one aggregation function. An -layer GNN , of I/O dimensions , is a sequence of GNN layers of I/O dimensions such that: , and . It determines a series of feature transformations as follows: Let a graph and vertex , then , and for we define a transformation
[TABLE]
We notate by the final output of for . We define the size of to be the sum of its underlying FNNs’ sizes. We call \big{(}(A({\mathfrak{F}}_{1}),agg_{1}),\ldots,(A({\mathfrak{F}}_{m}),agg_{m})\big{)} the architecture of , notated , and say that realizes . For an aggregation function , we denote by -GNNs the class of GNNs for which . For aggregation functions , we denote by -GNNs the class of GNNs with layers such that .
2.4 Expressivity
Let , and a set . Let a set of feature transformations, and let a feature transformation . We say uniformly additively approximates , notated , if and only if
[TABLE]
The essence of uniformity is that one function ”works” for graphs of all sizes, unlike non-uniformity where it is enough to have a specific function for each specific size of input graphs. The proximity measure is additive - as opposed to multiplicative where it is required that . In this paper, approximation always means uniform additive approximation and we use the term ”approximates” synonymously with expresses. Although our no-approximation statements consider additive approximation, they hold true also for multiplicative approximation, and the respective proofs (in the appendix) require not much additional argumentation to show that.
Let be sets of feature transformations , we say subsumes , notated if and only if for every it holds that . If the subsumption holds only for graphs featured with a subset we notate it as .
Let . We call an order-invariant mapping , from feature maps to -tuples, a readout function. Both and are commonly used to aggregate feature maps, possibly followed by an FNN that maps the aggregation value to a final output. We call a mapping , from featured graphs to -tuples, a graph embedding. Let , let a set of feature transformations , and let a readout , we notate the set of embeddings by . We use the expressivity terms and notations defined for feature transformations, for graph embeddings as well.
3 Mean and Max Do Not Subsume
It has already been stated that -GNNs can express functions that -GNNs and -GNNs cannot [?]. For the sake of completeness we provide formal proofs that -GNNs and -GNNs subsume neither -GNNs nor each other.
3.1 Mean and Max do not subsume Sum
Neither -GNNs nor -GNNs subsume -GNNs, even when the input-feature domain is a single value.
We define a featured star graph with (a parameter) leaves, (see Figure 3): For every :
- •
- •
- •
Let be an -layer GNN. We define , the feature of after operating the first layers of . Note that .
Lemma 3.1**.**
Assume is a -GNN or a -GNN . Let the maximum input dimension of any layer be , and let the maximum Lipschitz-Constant of any FNN of be . Then, for every it holds that .
Theorem 3.2**.**
Let a feature transformation such that for every it holds that . Then, {\text{\operatorname{Mean}-GNNs }\not\approx f} and -GNNs .
Note that by Theorem 3.2, a function such as neighbors-count is inexpressible by -GNNs and -GNNs .
Corollary 3.3**.**
We have that -GNNs -GNNs, -GNNs -GNNs.
3.2 Mean and Max do not subsume each other
-GNNs and -GNNs do not subsume each other, even in a finite input-feature domain setting. We define a parameterized graph in which, depending on the parameters’ arguments, the average of the center’s neighbors is in while their max can be either [math] or . For every and :
- •
- •
- •
Theorem 3.4**.**
Let a feature transformation such that for every it holds that . Then, -GNNs .
Theorem 3.5**.**
Let a feature transformation such that for every it holds that . Then, -GNNs .
Corollary 3.6**.**
We have that -GNNs -GNNs , -GNNs -GNNs .
4 Sometimes Sum Subsumes
In a bounded input-feature domain setting, -GNNs can express every function that -GNNs and -GNNs can. The bounded input-feature domain results in a bounded range for Mean and Max, a fact which can be exploited to approximate the target GNN with a Sum-GNN. The approximating Sum-GNNs, that we describe, come at a size cost. We do not know if an asymptotically-lower-cost construction exist.
4.1 Mean by Sum
-GNNs subsume -GNNs in a bounded input-feature domain setting.
Lemma 4.1**.**
For every and , there exists a -GNN of size such that for every featured graph it holds that .
Theorem 4.2**.**
Let a -GNN consisting of layers, let the maximum input dimension of any layer be , and let the maximum Lipschitz-Constant of any FNN of be . Then, for every there exists a -GNN such that:
.
- 2.
.
Corollary 4.3**.**
-GNNs -GNNs.
4.2 Max by Sum
-GNNs subsume -GNNs in a bounded input-feature domain setting.
Lemma 4.4**.**
For every and , there exists a -GNN of size such that for every featured graph and vertex it holds that .
Theorem 4.5**.**
Let a -GNN consisting of layers, let the maximum input dimension of any layer be , and let the maximum Lipschitz-Constant of any FNN of be . Then, for every there exists a -GNN such that:
.
- 2.
.
Corollary 4.6**.**
-GNNs -GNNs.
5 Mean and Max Have Their Place
In two important settings, Mean and Max aggregations enable expressing functions that cannot be expressed with Sum alone. As in Section 3, we define a graph parameterized by over domain . We define a feature transformation on that graph and prove that it cannot be approximated by -GNNs. The line of proofs (in the appendix) is as follows:
We show that for every -GNN there exists a finite set of polynomials of , those polynomials obtain a certain property , and it holds that:
[TABLE]
- 2.
We show that for every finite set of polynomials that obtain , it holds that:
[TABLE]
5.1 Unbounded, Countable, Input-Feature Domain
In an unbounded input-feature domain setting, Mean;Max and other GNNs are not subsumed by -GNNs. We define a graph (see Figure 3): For ,
- •
- •
- •
Theorem 5.1**.**
Let a feature transformation, such that for every it holds that . Then, \text{\operatorname{Sum}-GNNs }\not\approx f.
Corollary 5.2**.**
Denote by the set of all multisets over . Let an aggregation such that , that is, aggregates every homogeneous multiset to its single unique value. Then, -GNNs g-aggregation GNNs.
Corollary 5.2 implies a limitation of -GNNs compared to GNNs that use Mean; Max; or many other aggregations.
Graph Embedding
-GNNs are limited compared to Mean; Max; and other GNNs, not only when used to approximate vertices’ feature transformations but also when used in combination with a readout function to approximate graph embeddings. Consider another variant of : For ,
- •
- •
- •
Theorem 5.3**.**
Let a graph embedding such that . Let an aggregation and an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\approx f}.
Corollary 5.4**.**
Denote by the set of all multisets over . Let an aggregation such that . Let an aggregation and an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\geq^{{}_{{\mathbb{N}}}}\operatorname{avg}\circ\ g\text{-GNNs}}.
We have shown that -GNNs do not subsume Mean and Max (and many other) GNNs. The setting though, consisted of an input-feature domain , that is, countable unbounded.
5.2 Finite Input-Feature Domain
Mean and Max aggregations are essential also when the input-feature domain is just a single value i.e. when the input is featureless graphs. We define a new graph (see Figure 3): For every ,
- •
- •
- •
Theorem 5.5**.**
Let a feature transformation, such that for every it holds that . Then, -GNNs .
Corollary 5.6**.**
Denote by the set of all multisets over , and let an aggregation such that . Then, -GNNs (Sum, g)-GNNs.
Corollary 5.6 implies a limitation of -GNNs compared to stereo aggergation GNNs that combine Sum with Mean; Max; or many other aggregations. The limitation exists even when the input-feature domain consists of only a single value.
Graph Embedding
Completing the no-subsumption picture, -GNNs are not subsuming, in a 2-values input-feature domain setting, also when used in combination with a readout function to approximate graph embeddings. We define : For every ,
- •
- •
- •
Theorem 5.7**.**
Let a graph embedding such that . Let an aggregation and an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\approx f}.
Corollary 5.8**.**
Denote by the set of all multisets over . Let an aggregation such that . Let an aggregation and an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\geq^{{}_{\{0,1\}}}\operatorname{avg}\circ\ \text{(Sum, g)-GNNs}}.
6 Sum and More are Not Enough
In previous sections we showed that -GNNs do not subsume -GNNsand -GNNs , by proving that they cannot express specific functions. In this section, rather than comparing different GNNs classes we focus on one broad GNNs class and show that it is limited in its ability to express any one of a certain range of functions.
Denote by the set of all multisets over , and let an aggregation . We say that is a uniform polynomial aggregation (UPA) if and only if for every homogeneous multiset it holds that is either a polynomial of or a polynomial of . Note that Sum; Mean; and Max are all UPAs. We say that a GNN is an -GNN (Multiple UPA) if and only if the aggregation input to each of its layers is defined by a series of UPAs. That is, , for some UPAs.
We define a parameterized graph (see Figure 3): For every :
- •
- •
- •
Lemma 6.1**.**
Let an -layer -GNN architecture, let be the maximum depth of any FNN in , and let be the maximum in-degree of any node in any FNN in . Then, there exists such that: for every GNN that realizes it holds that is piecewise-polynomial (of ) with at most pieces, and each piece is of degree at most .
Lemma 6.1 implies that the architecture bounds (from above) the number of polynomial pieces, and their degrees, that make the function computed by any particular realization of the architecture. With Lemma 6.1 at our disposal, we consider any feature transformation that does not converge to a polynomial when applied to and viewed as a function of . We show that such a function is inexpressible by -GNNs.
Theorem 6.2**.**
Let a feature transformation, and define . Assume that does not converge to any polynomial, that is, there exists such that for every polynomial , for every , there exists such that . Then, -GNNs.
The last inexpressivity property we prove, concerns a class of functions which we call PIL (Polynomial-Intersection Limited). For denote by the set of all polynomials of degree . We say that a function is PIL if and only if for every there exists such that for every polynomial there exist at most consecutive integer points on which and assume the same value. Formally,
[TABLE]
We consider every feature transformation such that for it holds that is PIL. This is a different characterization than ”no polynomial-convergence” (in Theorem 6.2), and neither one implies the other. The result though, is weaker for the current characterization. We show that every -GNN architecture can approximate such a function only down to a certain . That is, every GNN that realizes the architecture - no matter the specific weights of its FNNs - is far from the function by at least (at least in one point). The following lemma is an adaptation of the Polynomial of Best Approximation theorem [?; ?] to the integer domain. There, it is a step in the proof of the Equioscillation theorem attributed to Chebyshev [?].
Lemma 6.3**.**
For define the set of consecutive integers starting at . Let be a PIL, let , and define
[TABLE]
Then, for every there exists such that: for every there exists for which . That is, for every starting point there is a bounded interval , and a gap , such that no polynomial of degree can approximate on that interval below that gap.
Lemma 6.4**.**
For every there exists a point and a gap such that: for every PIL , and every piecewise-polynomial with many pieces of degree , there exists for which . That is, the number of pieces and the max degree of a piecewise-polynomial determine a guaranteed minimum gap by which misses within a guaranteed interval.
Theorem 6.5**.**
Let a feature transformation, let , and assume that is PIL. Then, for every -GNN architecture , there exists such that for every -GNN that realizes there exists such that .
7 Experimentation
We experiment with vertex-level regression tasks. In previous sections we formally proved certain expressivity properties of Sum; Mean; and Max GNNs. Our goal in experimentation is to examine how these properties may affect practical learnability: searching for an approximating GNN using stochastic gradient-descend. With training data ranging over only a small subsection of the true-distribution range, does the existence of a uniformly-expressing GNN increase the chance that a well-generalizing GNN will be learned?
Specific details concerning training and architecture, as well additional illustrations and extended analysis, can be found in the appendix 111code for running the experiments is found at https://github.com/toenshoff/Uniform_Graph_Learning.
7.1 Data and Setup
For the graphs in the experiments, and with our GNN architecture consisting of two GNN layers (see appendix), Mean and Max aggregations output the same value for every vertex, up to machine precision. Thus, it is enough to experiment with Mean and assume identical results for Max.
We conduct experiments with two different datasets, one corresponds to the approximation task in Section 5.1, and the other to the task in Section 5.2:
Unbounded Countable Feature Domain (UC): This dataset consists of the star graphs from Section 5.1, for . The center’s ground truth value is , and it is the only vertex whose value we want to predict. 2. 2.
Single-Value Feature Domain (SV): This dataset consists of the graphs from Section 5.2, for . Again, the center’s ground truth value is , and we do not consider the other vertices’ predicted values.
As training data, we vary and . We therefore train on 10K graphs in each experiment. Afterwards, we test each GNN model on larger graphs with and . Here, we illustrate our results for two representing values of : , for all values of . Illustrations of the full results can be found in the appendix. The increased range of and in testing simulates the scenario of unbounded graph sizes and unbounded feature values, allowing us to study the performance in terms of uniform expressivity with unbounded features.
7.2 Results
Our primary evaluation metric is the relative error. Formally, if is the prediction of the GNN for the center vertex of an input graph , with truth label , we define the relative error as
[TABLE]
A relative error greater or equal to 1 is a strong evidence for inability to approximate, as the assessed approximation is no-better than an always-0 output. It is also reasonable that in practice, when judging the regression of a function whose range vary by a factor of 1000, relative error would be the relevant measure.
Unbounded, Countable, Feature Domain
Figure 4(a) provides the test results for UC. We plot the relative error against different values of . Note that the error has a logarithmic scale. -GNNs achieve very low relative errors of less than across all considered combinations of and . Their relative error falls to less than when is within the range seen during training (), Therefore, -GNNs do show some degree of overfitting. Notably, the value of has virtually no effect on the error of -GNNs . This is expected, since mean aggregation should not be affected by the degree of a center vertex whose neighbors are identical, up to machine precision. -GNNs yield a substantially higher relative error. For and the relative error is roughly , but this value increases as grows beyond the training range. Crucially, the relative error of -GNNs also increases with . For , the relative error is above even when is within the range seen during training. Therefore, -GNNs do generalize significantly worse than -GNNs in both parameters and . ‘
Single-Value Feature Domain
Figure 4(b) provides the test results for SV. Again, we plot the relative error against different values of . -GNNs yield similar relative errors as in the UC experiment. As expected, learned (Sum,Mean)-GNNs do perform significantly better than -GNNs. However, the learning of (Sum,Mean)-GNNs is not as successful as the learning of -GNNs in the UC experiment: relative error is around for , and slightly larger for , clearly worse than the UC-experiment performance. In particular, the learned (Sum,Mean)-GNN is sensitive to increases in . Note that each (Sum,Mean)-GNN layer receives both Sum and Mean aggregations arguments and needs to choose the right one, thus it is a different learning challenge than in the first experiment.
Appendix A Proofs
For the reader’s convenience, we re-state the results that are proven in this appendix.
Proofs for Section 3
Lemma 3.1
*Assume is a -GNN or a -GNN . Let the maximum input dimension of any layer be , and let the maximum Lipschitz-Constant of any FNN of be . Then, for every it holds that . *
Proof.
For every there is an automorphism of that maps to , thus they receive the same feature throughout the computation. We define for every . We view as functions of . First, assume Assume is a -GNN . We show by induction that for any it holds that . For , for some FNN whose Lipschitz-Constant is at most , hence . Also, . Assume correctness for . For we have for some FNN whose Lipschitz-Constant is at most . Hence, . Also, .
Next, assume is a -GNN . Notice that for every it holds that and . Hence, the proof idea for a -GNN applies also for a -GNN . ∎
Theorem 3.2
*Let a feature transformation such that for every it holds that . Then, {\text{\operatorname{Mean}-GNNs }\not\approx f} and -GNNs . *
Proof.
Choose any . Let be either -GNN or -GNN . Let the maximum input dimension of any layer be , and let the maximum Lipschitz-Constant of any FNN of be . Choose , then by Lemma 3.1 we have that . ∎
Corollary 3.3
*We have that -GNNs -GNNs, -GNNs -GNNs. *
Proof.
Clearly, there is a -GNN that computes exactly. By Theorem 3.2, there is no -GNN or -GNN that approximates . ∎
Theorem 3.4
*Let a feature transformation such that for every it holds that . Then, -GNNs . *
Proof.
Let be an -layer -GNNs. It is not hard to see by induction on that for every it holds that . Hence, . ∎
Theorem 3.5
*Let a feature transformation such that for every it holds that . Then, -GNNs . *
Proof.
Let be an -layer -GNNs. It is not hard to show that is Lipschitz-Continuous with respect to the aggregation and that with the aggregation being Mean we have that . ∎
Corollary 3.6
*We have that -GNNs -GNNs , -GNNs -GNNs . *
Proof.
Clearly, there is a -GNN that computes of Theorem 3.4 exactly, and by Theorem 3.4 there is no -GNN that approximates . Clearly, there is a -GNN that computes of Theorem 3.5 exactly, and by Theorem 3.5 there is no -GNN that approximates . ∎
Proofs for Section 4
Every reference in Lemma A.1 (and its proof) to a vertex-related value-vector is element-wise: for every vertex and a value-function of output dimension we use the notation to represent for all .
Lemma A.1**.**
Let , let , and let . Then, there is a -GNN such that for every featured graph and every vertex it holds that
[TABLE]
Proof.
Please refer to Figure 6 for an illustration of the construction. Let be the value of a vertex after layer and let the sum of neighbors’ values after layer . We denote the function computed in layer of by , that is, . First, we map the value of a vertex (and the sum of its neighbors) to a 2-tuple with the first coordinate being and the second being the vertex’ value. That is, we define to be . Then, we define to be . That is, . To see why fulfills the requirements, we describe the values of each of the three components for the different ranges of .
- •
- •
- •
- •
- •
∎
Lemma 4.1
*For every and , there exists a -GNN of size such that for every featured graph it holds that . *
Proof.
Please refer to Figure 6 for an illustration of the construction. We describe a construction of size which approximates Mean for one coordinate, the extension to is by a simple duplication. Every reference to a vertex-related value-vector is element-wise: for every vertex and a value-function of output dimension , we use the notation to represent for all .
Let be the minimal natural such that , and define . Define . The first layer of is identical to in the Lemma A.1. The second layer uses a copy of from the Lemma A.1, for each , multiplied by , and then sums the outputs. To see why this is correct, assume . For we have by Lemma A.1 zero contribution of to the final sum. Next, if then by Lemma A.1 we have a contribution of
[TABLE]
[TABLE]
Denoting the last term by and considering that we have that . Finally, if then by Lemma A.1 we have zero contribution of and a contribution of . Overall, we have that . ∎
Theorem 4.2
Let a -GNN consisting of layers, let the maximum input dimension of any layer be , and let the maximum Lipschitz-Constant of any FNN of be . Then, for every there exists a -GNN such that:
.
- 2.
.
Proof.
Let , that is, are the FNNs constituting ’s layers. Let and let the GNN constructed in Lemma 4.1, with parameter . Note that is indifferent to the aggregation parameter and is indifferent to the vertex’s state parameter, thus, for both parameters an argument of ’0’ is as good as any other. Define a -GNN with layers . For , each pair of layers approximates the operation of . For a graph and a vertex , denote the feature of after the layer of by , with . We define as follows.
[TABLE]
[TABLE]
[TABLE]
For denote the feature of after the layer of by , with , and denote by the maximum error of any coordinate of the output of the layer of . We prove by induction on that . Denote that upper bound by . For , we have
[TABLE]
[TABLE]
The first input coordinates to are identical. For each coordinate of the last coordinates, by definition of and we have
[TABLE]
That difference translates to a difference of at most in any coordinate of . In total, we have . Assume correctness for . Layer of is, by definition, the operation of on at most coordinates. The first coordinates constitute and the last coordinates constitute . The error of each of the first coordinates is, by assumption, at most . For each coordinate of the last coordinates, we have by assumption
[TABLE]
hence
[TABLE]
hence, by definition of and ,
[TABLE]
Combining the error bounds for the two types of input, we have that
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
With the induction proven, we have that
[TABLE]
Hence, the requirement that can be satisfied by setting
[TABLE]
implying
[TABLE]
Finally, using Lemma 4.1 we have that for each it holds
[TABLE]
hence
[TABLE]
∎
Lemma A.2**.**
Let , define , and define a function such that
[TABLE]
That is, is an almost-unary representation of in units of , ”almsot” because it may contain a fraction (between [math] and ) in its last coordinate. For a finite multiset , define
[TABLE]
a mapping from the multiset to the sum of its elements’ representation, coordinate-wise capped at . Then,
[TABLE]
Proof.
w.l.o.g assume . For the lower bound, it is not hard to verify that , hence . For the upper bound, assume , then necessarily and , hence . ∎
Lemma 4.4
*For every and , there exists a -GNN of size such that for every featured graph and vertex it holds that . *
Proof.
We describe a construction of size that approximates Max for one coordinate, the extension to is by a simple duplication. Every reference to a vertex-related value-vector is element-wise: for every vertex and a value-function of output dimension , we use the notation to represent for all .
Let be the minimal natural such that and define . The first GNN layer computes for each vertex a vector such that . Observe that the computation corresponds to the mapping in Lemma A.2. The second GNN layer first caps the sum-aggregation of the neighbors’ vectors, then sums the coordinates of the capped vector. That is, for a vertex , let , then . Using Lemma A.2, we get that . ∎
Theorem 4.5
Let a -GNN consisting of layers, let the maximum input dimension of any layer be , and let the maximum Lipschitz-Constant of any FNN of be . Then, for every there exists a -GNN such that:
.
- 2.
.
Proof.
The proof is identical to the Theorem 4.2 with the following adaptations:
Replacing any mention of ’Mean’, with ’Max’.
- 2.
Replacing any usage of Lemma 4.1, with Lemma 4.4.
- 3.
Replacing equations (1),(2), with equations (3),(4) hereinafter.
[TABLE]
[TABLE]
∎
Proofs for Section 5
A.1 Describability
Let be a set of polynomials in , and let be a function in .
We say that weakly-describes if and only if:
- a.
is finite.
- b.
.
We identify a polynomial as being good if and only if for some real coefficients and some maximum degree . That is, is a polynomial in with max degrees for , and every appearance of is with multiplication by a polynomial of of degree at least . We say that is good if and only if every polynomial in it is good.
We say that describes if and only if: weakly-describes and is good. We say that is describable (w-describable) if and only if there exists a set that (weakly-) describes it.
Let be a finite set of polynomials in , we denote by the building blocks of , that is, the degree combinations that appear in any of the polynomials in . Let , we define .
For every and a set of functions of , we define , and . For two sets of functions of , we define .
Lemma A.3**.**
- a.
Let a function (w-)describable by a set . Let the composition of over , then is (w-)describable by a set such that .
- b.
Let be functions (w-)describable by respectively. Then, for every real coefficients the affine function is (w-)describable by a set such that .
- c.
Each output of a ReLU activated FNN whose inputs are all (w-)describable by a set is (w-)describable by a set such that .
- d.
Let a function w-describable by a set , then is describable by some set such that , and is w-describable by a set such that .
Proof.
a. Let a set that (w-)describes the function . For any either or , hence is (w-)describable by .
b. It is not hard to verify that if is (w-)describable by then for every it holds that is (w-)describable by , and is (w-)describable by . It is also not hard to verify then that for any it holds that is (w-)describable by . A straightforward induction proves that a linear combination of arbitrarily many (w-)describable functions is (w-)describable. Finally, let a set that (w-)describes the linear combination, then is a set that (w-)describes the affine function.
c. Implied by (a)+(b).
d. It is not hard to verify that if is w-describable by then is describable by . Also, it is not hard to verify that if is w-describable by then is w-describable by . ∎
Lemma A.4**.**
Let a series of graphs , parametarized in , each having an identified vertex , such that for every -layer -GNN it holds that , viewed as a function of , is describable. Then, for every -GNN and for every there exist s.t .
Proof.
Let be a finite set of polynomials that describes . Fix any specific , and for denote by only those polynomials in that intersect with in the domain . Denote the polynomials in that are a constant, by . Let and assume by contradiction that for every it holds that . Then, there must exist for which . Otherwise, as is assumed to describe , any appearance of , in any , is tied to , and we would have
[TABLE]
and
[TABLE]
in contradiction to . By definition, is a subset of which is finite, and so . Denote the last term by . As our reasoning thus far is true for any , it holds that . Finally, for necessarily for all it holds that . ∎
Section 5.1
Define a series of featured star graphs as follows: For ,
- •
- •
- •
Let be an -layer -GNN. We define , the feature of after operating the first layers of . Note that . For every there is an automorphism of that maps to , thus they receive the same feature throughout the computation. We define for every . In our argumentation, we view as functions of .
Lemma A.5**.**
It holds that is describable.
Proof.
We show by induction that for every it holds that is w-describable and that is describable. For we have and the assumption holds. Assume correctness for . By definition, where is a ReLU FNN. By assumption, is w-describable and so by Lemma A.3 we have that is describable. Also, by assumption, is describable. Hence, by Lemma A.3 we have that is describable. The proof for is in similar fashion. ∎
Theorem 5.1
*Let a feature transformation, such that for every it holds that . Then, \text{\operatorname{Sum}-GNNs }\not\approx f. *
Proof.
Immediate from combining Lemma A.5 and Lemma A.4. ∎
Corollary 5.2
*Denote by the set of all multisets over . Let an aggregation such that , that is, aggregates every homogeneous multiset to its single unique value. Then, -GNNs g-aggregation GNNs. *
Proof.
Let a feature transformation, such that for every featured graph , and for every vertex , it holds that . Then, by Theorem 5.1, -GNNs . Clearly, there is a -aggregation GNN that exactly computes . ∎
Consider another variant of :
- •
- •
- •
Let be an -layer -GNN. We use the notations and with similar meaning to before, where now refers to each of the vertices.
Lemma A.6**.**
It holds that is describable by a set such that for every it holds that does not contain (with coefficient ).
Proof.
We prove the correctness of the following statements for every , from which the lemma clearly follows.
is describable.
- 2.
is weakly-describable by a set such that for every it holds that does not contain .
Proof is by induction on . Correctness for is clear. Assume correctness for .
-
By definition, for some FNN . By the induction assumption, is describable and clearly is also describable. Hence, by Lemma A.3 we have that is describable.
-
By definition, for some FNN . By the induction assumption, obtains the stated property, and clearly so does . By Lemma A.3, we have that the output of operating on obtains the stated property. ∎
Lemma A.7**.**
Let a graph embedding such that . Let an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\approx f}.
Proof.
Let a -GNN . By definition, . By Lemma A.6, is weakly-describable by a set such that for every it holds that does not contain . Using a similar technique to the one in proof of Lemma A.3, it is not hard to show that is weakly-describable by a set such that for every it holds that does not contain . Let any polynomial and let be the coefficient of in . It is not hard to verify that for every it holds that . The finiteness of implies that there is a maximal such over all , denote it by . The finiteness of also implies that:
Given and there exists such that for every and every with a finite limit (as ) it holds that .
- 2.
Given and there exists such that for every and every with an infinite limit (as ) it holds that .
Finally, for every it holds that . Let , then for there exists such that for every it holds that . ∎
Lemma A.8**.**
Let a graph embedding such that . Let an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\approx f}.
Proof.
Let , then . Clearly, is describable. Hence, by Lemma A.3, it holds that is describable. Let a describing set of , let any polynomial , and let be the coefficient of in . It is not hard to verify that for every it holds that . The finiteness of implies that there is a maximal such over all , denote it by . The finiteness of also implies that:
Given and there exists such that for every and every with a finite limit (as ) it holds that .
- 2.
Given and there exists such that for every and every with an infinite limit (as ) it holds that .
Finally, for every it holds that . Let , then for there exists such that for every it holds that . Let , then for there exists such that for every it holds that . ∎
Theorem 5.3
*Let a graph embedding such that . Let an aggregation and an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\approx f}. *
Proof.
Follows from combining Lemma A.7 and Lemma A.8. ∎
Corollary 5.4
*Denote by the set of all multisets over . Let an aggregation such that . Let an aggregation and an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\geq^{{}_{{\mathbb{N}}}}\operatorname{avg}\circ\ g\text{-GNNs}}. *
Proof.
Clearly, for a straightforward g-aggregation GNN it holds that and , hence . By Theorem 5.3, no composition of with a -GNN can approximate . ∎
Section 5.2
We define a new series of featured graphs (see Figure 3). For every :
- •
- •
- •
Let be an -layer -GNN. We define , , and , following a reasoning similar to Section 5.1, and view as functions of
Lemma A.9**.**
It holds that is describable.
Proof.
We show by induction that for every it holds that is w-describable and that are describable. For we have and the assumption holds. Assume correctness for . By definition, where is a ReLU FNN. By assumption, is w-describable and so by Lemma A.3 we have that is describable. Also by assumption, is describable. Hence, by Lemma A.3 we have that is describable. For , by definition, , and by assumption are w-describable. Hence, by Lemma A.3 we have that is w-describable. The proof for is in similar fashion. ∎
Theorem 5.5
*Let a feature transformation, such that for every it holds that . Then, -GNNs . *
Proof.
Immediate from combining Lemma A.9 and Lemma A.4. ∎
Corollary 5.6
*Denote by the set of all multisets over , and let an aggregation such that . Then, -GNNs (Sum, g)-GNNs. *
Proof.
Let a feature transformation such that for every featured graph , for every vertex , it holds that . Then, by Theorem 5.5, -GNNs . Clearly, there is a GNN that uses Sum aggregation in its first layer and aggregation in its second layer, that exactly computes . ∎
We define one last variant of a series:
- •
- •
- •
Let be an -layer -GNN. The notations , , and , are used as before.
Lemma A.10**.**
It holds that is describable by a set and for every it holds that does not contain (with coefficient ).
Proof.
We prove the correctness of the following statements, from which the lemma clearly follows.
is weakly-describable by a set such that for every it holds that does not contain .
- 2.
is describable.
- 3.
is weakly-describable by a set such that for every it holds that does not contain .
Proof is by induction on . Correctness for is immediate. Assume correctness for .
-
By definition, for some FNN . By the induction assumption, obtains the stated property and the same holds for . By Lemma A.3, we have that the output of operating on obtains the stated property.
-
By definition, for some FNN . By the induction assumption, obtains the stated property, and clearly so do . The rest follows similarly to the end of (1).
-
By definition, for some FNN . By the induction assumption, obtains the stated property, and clearly so does . The rest follows similarly to the end of (1). ∎
Lemma A.11**.**
Let a graph embedding such that . Let an aggregation and an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\approx f}.
Proof.
Let . Define , then . By Lemma A.10, is describable by a set such that for every it holds that does not contain , hence is describable. Hence, by Lemma A.3 is describable. Let a describing set be . Let any polynomial and let the coefficient of the component in . Then, it is not hard to verify that for every it holds that . The finiteness of implies that there is a maximal such over all , denote it by . The finiteness of also implies that:
Given and there exists such that for every and every with a finite limit (as ) it holds that .
- 2.
Given and there exists such that for every and every with an infinite limit (as ) it holds that .
Finally, for every it holds that . Hence, for there exists such that for every it holds that , implying . ∎
Lemma A.12**.**
Let a graph embedding, such that for every it holds that . Then, -GNNs .
Proof.
Let . Clearly, is describable. Let a describing set of , let any polynomial , and let be the coefficient of in . Then, it is not hard to verify that for every it holds that . The finiteness of implies that there is a maximal such over all , denote it by . The finiteness of also implies that:
Given and there exists such that for every and every with a finite limit (as ) it holds that .
- 2.
Given and there exists such that for every and every with an infinite limit (as ) it holds that .
Finally, for every it holds that . Hence, for there exists such that for every it holds that , implying . ∎
Theorem 5.7
*Let a graph embedding such that . Let an aggregation and an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\approx f}. *
Proof.
Follows from combining Lemma A.11 and Lemma A.12. ∎
Corollary 5.8
*Denote by the set of all multisets over . Let an aggregation such that . Let an aggregation and an FNN , and define a readout . Then, {{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{\operatorname{Sum}-GNNs }\not\geq^{{}_{\{0,1\}}}\operatorname{avg}\circ\ \text{(Sum, g)-GNNs}}. *
Proof.
Clearly, for a straightforward stereo aggregation (Sum,g)-GNN it holds that , , and , hence . By Theorem 5.7, no composition of with a -GNN can approximate the graph embedding . ∎
Proofs for Section 6
Lemma 6.1
*Let an -layer -GNN architecture, let be the maximum depth of any FNN in , and let be the maximum in-degree of any node in any FNN in . Then, there exists such that: for every GNN that realizes it holds that is piecewise-polynomial (of ) with at most pieces, and each piece is of degree at most . *
Proof.
Note the following observations:
a. Let be piecewise polynomial with pieces, then a linear combination of has at most pieces. This can be seen by considering the set of pieces-joint points of , and noticing that it is the union of such points of and such points of . Accordingly, let be piecewise polynomial with at most pieces each, then a linear combination of has at most pieces.
b. Let be piecewise polynomial with at most pieces, then has at most pieces.
c. Let be an output of a ReLU FNN of depth with maximal in-degree for any node, with inputs which are at most -pieces polynomial each. Then, by (a)+(b), is piecewise-polynomial with pieces.
d. Let be piecewise polynomial with at most pieces, and let a polynomial, then is piecewise polynomial, with at most pieces, each of degree at most
e. Let be piecewise polynomial with at most pieces, and let a polynomial, then is piecewise polynomial, with at most pieces, each of degree at most .
Let be a GNN that realizes . We define , the feature of after operating the first layers of . Note that . For every there is an automorphism of that maps to , thus they receive the same feature throughout the computation. We define for every . In our argumentation, we view as functions of .
Using observations [a..e] above, we prove by induction on that , in each coordinate, are piecewise polynomial in with no more than pieces, each of degree at most for some . For we have that are constants. Assume correctness for . By definition, where is a shorthand for the aggregation value . By (d),(e), and the induction assumption, each of the input coordinates to is piecewise polynomial in with at most pieces, each of degree at most for some . Hence, by (c), each coordinate of has at most pieces, each of degree at most . By similar reasoning, can be shown to have no more than pieces, each of a certain maximal degree. ∎
Theorem 6.2
*Let a feature transformation, and define . Assume that does not converge to any polynomial, that is, there exists such that for every polynomial , for every , there exists such that . Then, -GNNs. *
Proof.
Let an by which does not get forever close to any polynomial, and let a -GNN . By Lemma 6.1, there is a such that for every it holds that for some polynomial . By assumption, there exists such that . Hence, . ∎
Lemma 6.3
For define the set of consecutive integers starting at . Let be a PIL, let , and define
[TABLE]
*Then, for every there exists such that: for every there exists for which . That is, for every starting point there is a bounded interval , and a gap , such that no polynomial of degree can approximate on that interval below that gap. *
Proof.
Define . For a real-valued function whose domain contains , we define , the maximum absolute value attains on . Define , the distance of from the closest polynomial of degree , in the segment . We need to show that . For a vector denote by the Euclidean norm of . For we use as the metric in our continuity argumentation. Define the polynomial determined by . Note the following:
- a)
For , let , then is continuous.
- b)
For , let , then is continuous.
- c)
There exists such that
[TABLE]
Proof: Let and define . By (a), is continuous, and as is compact we have that there exists such that . Note that necessarily , then by definition of it must be that either or . Since , necessarily it is the former that holds. Hence, for every we have that , and by we have . Finally, note that , and let such that , then for all we have . Hence, .
By (b) and (c), is the infimum of a continuous function on a closed ball, hence there exists such that . By the assumption that is PIL, and the definition of , we have . ∎
Lemma 6.4
*For every there exists a point and a gap such that: for every PIL , and every piecewise-polynomial with many pieces of degree , there exists for which . That is, the number of pieces and the max degree of a piecewise-polynomial determine a guaranteed minimum gap by which misses within a guaranteed interval. *
Proof.
Define . Using the notation of from Lemma 6.3, for every define , define , and define . Note that by Lemma 6.3. Finally, define , . Assume by contradiction that is close to by less than for every , then, necessarily the first polynomial piece of ends at most at , the second at and the piece at , then the last polynomial piece starts the latest at and by it must have missed at least one point by at least . ∎
Theorem 6.5
*Let a feature transformation, let , and assume that is PIL. Then, for every -GNN architecture , there exists such that for every -GNN that realizes there exists such that . *
Proof.
Let the guaranteed by Lemma 6.1 for , and let the guaranteed by Lemma 6.4 for pieces of degree . Then, by Lemma 6.4, for and the statement holds. ∎
Appendix B Experimentation Ext.
Architecture and Training
We implement all GNNs using PyTorch Geometric [?]. The update function of each GNN layer is a standard 2-layer MLP with a ReLU-activated hidden layer and a linear output layer. We set the intermediate embedding dimension to 256 and use 2 message passing layers in all models. We minimize the smooth L1 loss on the training data using the Adam Optimizer [?]. No readout function is needed. For both considered graph families the ground truth is a label of the root vertex. The prediction and loss of all other vertices are simply masked out.
Before each training run we randomly choose 500 graphs from the training data as a validation dataset. Each model is trained for 500 epochs with a batch size of 100. The initial learning rate is selected from based on validation performance. The learning rate decays with a cosine annealing schedule [?] throughout training. We average all results over 5 models trained with different random seeds. All experiments are conducted on a machine with an NVIDIA RTX A6000 GPU (48GB) and 512GB of RAM running Ubuntu 22.04 LTS.
Extended Results
An illustration of the full experimental results can be seen in fig. 7. For both datasets, and each tested architecture, we provide the relative error (RE) over the full test range () as a 3D plot. The error is provided on the -axis, which is linearly scaled. The color map is linear as well and is scaled individually for each subplot to highlight additional details.
The results for the unbounded countable features (UC) experiment are provided in fig. 7(a). Note that the color map for the trained -GNN is scaled by , since the learned function is very close to the ground truth. The trained -GNN performs significantly worse. Relative to itself though, as long as is in the training range it generalizes well along the axis. Operating the trained -GNN , on in the training range, resembles the bounded input-feature domain setting examined in Section 4. Hence, the generalization in , when is in the training range, resembles the result in Section 4: -GNNs can approximate Mean when the input-feature domain is bounded. Once is beyond the training range, the relative error grows rapidly, both along the axis (for fixed ) and along the axis. Interestingly, the error of the trained -GNN also tends upwards at . The learned function therefore lacks robustness even towards the lower end of the training range of .
The results for the single value features (SV) experiment are provided in fig. 7(b). Overall, the trained (Sum,Mean)-GNN achieves a significantly lower error than the -GNN. Like in the UC experiment, as long as is in the training range the trained -GNN generalizes relatively well along the axis, and the performance deteriorates sharply (in both axis) when . We do note though, that the results of the (Sum,Mean)-GNN in this experiment are substantially worse than those of the -GNN in the UC experiment. While there exists a (Sum,Mean)-GNN that computes exactly the SV-experiment function (see proof of Corollary 5.6), Stochastic Gradient Descend (SGD) was not able to learn this function in fine detail. To arrive in a good (Sum,Mean)-GNN instance, the first GNN-layer has to learn to ignore the coordinates of the Mean-aggregation and to use the coordinates of the Sum-aggregation properly, and the second GNN-layer has to learn to ignore the Sum and use the Mean. These requirements constitute a more challenging learning problem than that of learning a good -GNN for the UC task, and the difference is reflected in the results. Interestingly, the relative error of the (Sum,Mean)-GNN is worst at the lower end of the training range for high values of .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ Abboud et al. , 2021 ] Ralph Abboud, İsmail İlkan Ceylan, Martin Grohe, and Thomas Lukasiewicz. The surprising power of graph neural networks with random node initialization. In Zhi-Hua Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021 , pages 2112–2118. ijcai.org, 2021.
- 2[ anonymous, 2022 ] anonymous. The equioscillation theorem. https://en.wikipedia.org/wiki/Equioscillation_theorem , 2022.
- 3[ Barceló et al. , 2020 a ] Pablo Barceló, Egor V Kostylev, Mikael Monet, Jorge Pérez, Juan Reutter, and Juan-Pablo Silva. The logical expressiveness of graph neural networks. In 8th International Conference on Learning Representations (ICLR 2020) , 2020.
- 4[ Barceló et al. , 2020 b ] Pablo Barceló, Egor V Kostylev, Mikaël Monet, Jorge Pérez, Juan L Reutter, and Juan-Pablo Silva. The expressive power of graph neural networks as a query language. ACM SIGMOD Record , 49(2):6–17, 2020.
- 5[ Barceló et al. , 2021 ] Pablo Barceló, Floris Geerts, Juan Reutter, and Maksimilian Ryschkov. Graph neural networks with local graph parameters. Advances in Neural Information Processing Systems , 34:25280–25293, 2021.
- 6[ Cappart et al. , 2021 ] Quentin Cappart, Didier Chételat, Elias B. Khalil, Andrea Lodi, Christopher Morris, and Petar Velickovic. Combinatorial optimization and reasoning with graph neural networks. In Zhi-Hua Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021 , pages 4348–4355. ijcai.org, 2021.
- 7[ Chen et al. , 2019 ] Zhengdao Chen, Soledad Villar, Lei Chen, and Joan Bruna. On the equivalence between graph isomorphism testing and function approximation with gnns. Advances in neural information processing systems , 32, 2019.
- 8[ Corso et al. , 2020 ] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal neighbourhood aggregation for graph nets. Advances in Neural Information Processing Systems , 33:13260–13271, 2020.
