Some Might Say All You Need Is Sum

Eran Rosenbluth; Jan Toenshoff; Martin Grohe

arXiv:2302.11603·cs.LG·May 22, 2023

Some Might Say All You Need Is Sum

Eran Rosenbluth, Jan Toenshoff, Martin Grohe

PDF

Open Access

TL;DR

This paper investigates the expressivity of different GNN aggregation functions, showing that Sum GNNs are limited compared to Mean and Max, especially for larger graphs and certain functions.

Contribution

It provides theoretical proofs that Sum GNNs cannot approximate basic functions computed by Mean or Max GNNs, highlighting the limitations of Sum aggregation.

Findings

01

Sum GNNs cannot approximate functions computed by Mean or Max GNNs.

02

Mean and Max GNNs are more expressive than Sum GNNs.

03

Combination of Sum with Mean or Max increases expressivity.

Abstract

The expressivity of Graph Neural Networks (GNNs) is dependent on the aggregation functions they employ. Theoretical works have pointed towards Sum aggregation GNNs subsuming every other GNNs, while certain practical works have observed a clear advantage to using Mean and Max. An examination of the theoretical guarantee identifies two caveats. First, it is size-restricted, that is, the power of every specific GNN is limited to graphs of a specific size. Successfully processing larger graphs may require an other GNN, and so on. Second, it concerns the power to distinguish non-isomorphic graphs, not the power to approximate general functions on graphs, and the former does not necessarily imply the latter. It is desired that a GNN's usability will not be limited to graphs of any specific size. Therefore, we explore the realm of unrestricted-size expressivity. We prove that basic…

Equations86

f_{F, v} (x) : = a_{v} (b_{v} + j = 1 \sum k f_{F, u_{j}} (x) \cdot w_{e_{j}})

f_{F, v} (x) : = a_{v} (b_{v} + j = 1 \sum k f_{F, u_{j}} (x) \cdot w_{e_{j}})

f_{\mathfrak{F}}(\vec{x})\coloneqq\big{(}f_{{\mathfrak{F}},Y_{1}}(\vec{x}),\ldots,f_{{\mathfrak{F}},Y_{q}}(\vec{x})\big{)}

f_{\mathfrak{F}}(\vec{x})\coloneqq\big{(}f_{{\mathfrak{F}},Y_{1}}(\vec{x}),\ldots,f_{{\mathfrak{F}},Y_{q}}(\vec{x})\big{)}

∣ f_{F} (x_{1}, \dots, x_{n})_{j} - f_{F} (x_{1}, \dots x_{i - 1}, x_{i} + δ, \dots, x_{n})_{j} ∣ / δ \leq a_{F}

∣ f_{F} (x_{1}, \dots, x_{n})_{j} - f_{F} (x_{1}, \dots x_{i - 1}, x_{i} + δ, \dots, x_{n})_{j} ∣ / δ \leq a_{F}

N^{(i)} (G, v) : = f_{F_{i}} (N^{(i - 1)} (G, v), a g g_{i} (N^{(i - 1)} (G, w) : w \in N (v)))

N^{(i)} (G, v) : = f_{F_{i}} (N^{(i - 1)} (G, v), a g g_{i} (N^{(i - 1)} (G, w) : w \in N (v)))

\forall ε > 0\exists f \in F : \forall G \in G_{S^{p}} \forall v \in V (G) ∣ f (G) (v) - h (G) (v) ∣ \leq ε

\forall ε > 0\exists f \in F : \forall G \in G_{S^{p}} \forall v \in V (G) ∣ f (G) (v) - h (G) (v) ∣ \leq ε

\forall θ \in Θ \exists u_{θ} \in V (G_{θ}) \exists p \in F_{N} : N (G_{θ}, u_{θ}) = p (θ)

\forall θ \in Θ \exists u_{θ} \in V (G_{θ}) \exists p \in F_{N} : N (G_{θ}, u_{θ}) = p (θ)

\forall ε > 0 \exists θ \in Θ : \forall p \in F ∣ p (θ) - f (G_{θ}) (u_{θ}) ∣ > ε

\forall ε > 0 \exists θ \in Θ : \forall p \in F ∣ p (θ) - f (G_{θ}) (u_{θ}) ∣ > ε

{\sup\big{(}k:\forall p\in P_{n}\;\forall x\in{\mathbb{N}}\;\forall y\in[x..(x+k-1)]\;f(y)=p(y)\big{)}\in{\mathbb{N}}}

{\sup\big{(}k:\forall p\in P_{n}\;\forall x\in{\mathbb{N}}\;\forall y\in[x..(x+k-1)]\;f(y)=p(y)\big{)}\in{\mathbb{N}}}

1 + max (k : \forall p \in P_{n} \forall x \in N \forall y \in [x .. (x + k - 1)] f (y) = p (y))

1 + max (k : \forall p \in P_{n} \forall x \in N \forall y \in [x .. (x + k - 1)] f (y) = p (y))

RE (y_{pred}, c) = \frac{∣ y _{pred} - c ∣}{∣ c ∣} .

RE (y_{pred}, c) = \frac{∣ y _{pred} - c ∣}{∣ c ∣} .

N (G, v) = ⎩ ⎨ ⎧ 0 \frac{n _{v} ( s - avg ( v ))}{a} 1 1 - \frac{n _{v} ( s - avg ( v ) - a )}{a} 0 s \leq avg (v) s - \frac{a}{n _{v}} < avg (v) < s s - a \leq avg (v) \leq s - \frac{a}{n _{v}} s - a - \frac{a}{n _{v}} < avg (v) < s - a avg (v) \leq s - a - \frac{a}{n _{v}} \par

N (G, v) = ⎩ ⎨ ⎧ 0 \frac{n _{v} ( s - avg ( v ))}{a} 1 1 - \frac{n _{v} ( s - avg ( v ) - a )}{a} 0 s \leq avg (v) s - \frac{a}{n _{v}} < avg (v) < s s - a \leq avg (v) \leq s - \frac{a}{n _{v}} s - a - \frac{a}{n _{v}} < avg (v) < s - a avg (v) \leq s - a - \frac{a}{n _{v}} \par

s_{i+1}\Big{(}1-\frac{n_{v}(s_{i+1}-\operatorname{avg}(v)-a)}{a}\Big{)}+s_{i}\Big{(}\frac{n_{v}(s_{i}-\operatorname{avg}(v))}{a}\Big{)}=

s_{i+1}\Big{(}1-\frac{n_{v}(s_{i+1}-\operatorname{avg}(v)-a)}{a}\Big{)}+s_{i}\Big{(}\frac{n_{v}(s_{i}-\operatorname{avg}(v))}{a}\Big{)}=

s_{i+1}\Big{(}1-\frac{n_{v}(s_{i}-\operatorname{avg}(v))}{a}\Big{)}+s_{i}\Big{(}\frac{n_{v}(s_{i}-\operatorname{avg}(v))}{a}\Big{)}

s_{i+1}\Big{(}1-\frac{n_{v}(s_{i}-\operatorname{avg}(v))}{a}\Big{)}+s_{i}\Big{(}\frac{n_{v}(s_{i}-\operatorname{avg}(v))}{a}\Big{)}

\hat{f}_{2 j + 1} (\overset{v}{^}^{(2 j)}, Σ_{w \in N (v)} \overset{w}{^}^{(2 j)}) : = (\overset{v}{^}^{(2 j)}, g_{1} (\overset{v}{^}^{(2 j)}, 0))

\hat{f}_{2 j + 1} (\overset{v}{^}^{(2 j)}, Σ_{w \in N (v)} \overset{w}{^}^{(2 j)}) : = (\overset{v}{^}^{(2 j)}, g_{1} (\overset{v}{^}^{(2 j)}, 0))

\hat{f}_{2 (j + 1)} ((\overset{v}{^}^{(2 j)}, g_{1} (\overset{v}{^}^{(2 j)}, 0)), Σ_{w \in N (v)} (\overset{w}{^}^{(2 j)}, g_{1} (\overset{w}{^}^{(2 j)}, 0)) : =

\hat{f}_{2 (j + 1)} ((\overset{v}{^}^{(2 j)}, g_{1} (\overset{v}{^}^{(2 j)}, 0)), Σ_{w \in N (v)} (\overset{w}{^}^{(2 j)}, g_{1} (\overset{w}{^}^{(2 j)}, 0)) : =

f_{j + 1} (\overset{v}{^}^{(2 j)}, g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(2 j)}, 0)))

f_{j + 1} (\overset{v}{^}^{(2 j)}, g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(2 j)}, 0)))

e_{1} = \overset{v}{^}^{(2)} - v^{(1)} = f_{1} (\overset{v}{^}^{(0)}, g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(0)}, 0))) -

e_{1} = \overset{v}{^}^{(2)} - v^{(1)} = f_{1} (\overset{v}{^}^{(0)}, g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(0)}, 0))) -

f_{1} (v^{(0)}, Mean ({w^{(0)} ∣ w \in N (v)}) ∣

f_{1} (v^{(0)}, Mean ({w^{(0)} ∣ w \in N (v)}) ∣

g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(0)}, 0))_{i} - Mean ({w^{(0)} : w \in N (v)})_{i} \leq \overset{ε}{^}

g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(0)}, 0))_{i} - Mean ({w^{(0)} : w \in N (v)})_{i} \leq \overset{ε}{^}

\forall w \in N (v) \overset{w}{^}_{i}^{(2 n)} - w_{i}^{(n)} \leq b_{n}

\forall w \in N (v) \overset{w}{^}_{i}^{(2 n)} - w_{i}^{(n)} \leq b_{n}

\frac{1}{∣ N ( v ) ∣} Σ_{w \in N (v)} \overset{w}{^}_{i}^{(2 n)} - \frac{1}{∣ N ( v ) ∣} Σ_{w \in N (v)} w_{i}^{(n)} \leq b_{n}

\frac{1}{∣ N ( v ) ∣} Σ_{w \in N (v)} \overset{w}{^}_{i}^{(2 n)} - \frac{1}{∣ N ( v ) ∣} Σ_{w \in N (v)} w_{i}^{(n)} \leq b_{n}

g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(2 n)}, 0))_{i} - \frac{1}{∣ N ( v ) ∣} Σ_{w \in N (v)} w_{i}^{(n)} \leq

g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(2 n)}, 0))_{i} - \frac{1}{∣ N ( v ) ∣} Σ_{w \in N (v)} w_{i}^{(n)} \leq

b_{n} + \overset{ε}{^}

e_{n + 1} = max (\overset{v}{^}_{i}^{(2 (n + 1))} - v_{i}^{(n + 1)} : i \in [d]) =

e_{n + 1} = max (\overset{v}{^}_{i}^{(2 (n + 1))} - v_{i}^{(n + 1)} : i \in [d]) =

max (∣ f_{n + 1} (\overset{v}{^}^{(2 n)}, g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(2 n)}, 0)))_{i} -

max (∣ f_{n + 1} (\overset{v}{^}^{(2 n)}, g_{2} (0, Σ_{w \in N (v)} g_{1} (\overset{w}{^}^{(2 n)}, 0)))_{i} -

f_{n + 1} (v^{(n)}, Mean ({w^{(n)} : w \in N (v)}))_{i} ∣ : i \in [d]) \leq

f_{n + 1} (v^{(n)}, Mean ({w^{(n)} : w \in N (v)}))_{i} ∣ : i \in [d]) \leq

a d b_{n} + a d (b_{n} + \overset{ε}{^}) = 2 a d b_{n} + a d \overset{ε}{^} =

a d b_{n} + a d (b_{n} + \overset{ε}{^}) = 2 a d b_{n} + a d \overset{ε}{^} =

a d \overset{ε}{^} Σ_{i = 2}^{n + 1} (2 a d)^{i - 1} + a d \overset{ε}{^} =

a d \overset{ε}{^} Σ_{i = 2}^{n + 1} (2 a d)^{i - 1} + a d \overset{ε}{^} =

a d \overset{ε}{^} Σ_{i \in [n + 1]} (2 a d)^{i - 1}

a d \overset{ε}{^} Σ_{i \in [n + 1]} (2 a d)^{i - 1}

b_{m} = a d \overset{ε}{^} Σ_{i \in [m]} (2 a d)^{i - 1} = \overset{ε}{^} a d \frac{( 1 - ( 2 a d ) ^{m} )}{1 - 2 a d}

b_{m} = a d \overset{ε}{^} Σ_{i \in [m]} (2 a d)^{i - 1} = \overset{ε}{^} a d \frac{( 1 - ( 2 a d ) ^{m} )}{1 - 2 a d}

\overset{ε}{^} = ε \frac{1 - 2 a d}{a d ( 1 - ( 2 a d ) ^{m} )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Bayesian Modeling and Causal Inference · Explainable Artificial Intelligence (XAI)

Full text

Some Might Say All You Need Is Sum

Eran Rosenbluth1, Funded by the German Research Council (DFG), RTG 2236 (UnRAVeL)

Jan Toenshoff1, Funded by the German Research Council (DFG), grants GR 1492/16-1; KI 2348/1-1 “Quantitative Reasoning About Database Queries”

Martin Grohe1

@informatik.rwth-aachen.de

1RWTH Aachen University

Abstract

The expressivity of Graph Neural Networks (GNNs) is dependent on the aggregation functions they employ. Theoretical works have pointed towards Sum aggregation GNNs subsuming every other GNNs, while certain practical works have observed a clear advantage to using Mean and Max. An examination of the theoretical guarantee identifies two caveats. First, it is size-restricted, that is, the power of every specific GNN is limited to graphs of a specific size. Successfully processing larger graphs may require an other GNN, and so on. Second, it concerns the power to distinguish non-isomorphic graphs, not the power to approximate general functions on graphs, and the former does not necessarily imply the latter.

It is desired that a GNN’s usability will not be limited to graphs of any specific size. Therefore, we explore the realm of unrestricted-size expressivity. We prove that basic functions, which can be computed exactly by Mean or Max GNNs, are inapproximable by any Sum GNN. We prove that under certain restrictions, every Mean or Max GNN can be approximated by a Sum GNN, but even there, a combination of (Sum, [Mean/Max]) is more expressive than Sum alone. Lastly, we prove further expressivity limitations for GNNs with a broad class of aggregations.

1 Introduction

Message passing graph neural networks (GNNs) are a fundamental deep learning architecture for machine learning on graphs. Most state-of-the-art machine learning techniques for graphs are based on GNNs. It is therefore worthwhile to understand their theoretical properties. Expressivity is one important aspect: which functions on graphs or their vertices can be computed by GNN models? To start with, functions computed by GNNs are always isomorphism invariant, or equivariant for node-level functions. A second important feature of GNNs is that a GNN can operate on input graphs of every size, since it is defined as a series of node-level computations with an optional graph-aggregating readout computation. These are desirable features that motivated the introduction of GNNs in the first place and may be seen as a crucial factor for their success. Research on the expressivity of GNNs has had a considerable impact in the field.

A GNN computation transforms a graph with an initial feature map (a.k.a. graph signal or node embedding) into a new feature map. The new map can represent a node-level function or can be “read out” as a function of the whole graph. The computation is carried out by a finite sequence of separate layers. On each layer, each node sends a real-valued message vector which depends on its current feature vector, to all its neighbours. Then each node aggregates the messages it receives, using an order-invariant multiset function, typically being entrywise summation (Sum), mean (Mean), or maximum (Max). Finally, the node features are updated using a neural network which receives as arguments the aggregation value and the node’s current feature. In the eyes of a GNN all vertices are euqal: the message, aggregation and update functions of every layer are identical for every node, making GNNs auto-scalable and isomorphism-invariant.

By now, numerous works have researched the expressivity of GNNs considering various variants of them. However, many of the theoretical results have the following caveats:

The expressivity considered is non-uniform: for a function that is defined on graphs of all sizes, it is asked if for every $n$ there exists a GNN that expresses the function on graphs of size $n$ . The expressing GNN may depend on $n$ , and it may even be exponentially large in $n$ . For some proofs, this exponential blow-up is necessary [?; ?]. This notion of expressivity is in contrast to uniform expressivity: for a function that is defined on graphs of all sizes, asking whether there exists one GNN that expresses the function on graphs of all sizes. In addition to being a significantly weaker theoretical notion, non-uniform expressivity leaves much to be desired also from a practical standpoint: It implies that a GNN may be no good for graphs of sizes larger than the sizes well-represented in the training data. This means that training may have to be done on very large graphs, and may have to be often repeated.
The expressivity considered is the power to distinguish non-isomorphic graphs. A key theoretical result is the characterisation of the power of GNNs in terms of the Weisfeiler-Leman (WL) isomorphism test [?; ?], and subsequent works have used WL as a yardstick (see ’Related Work’). In applications of GNNs though, the goal is not to distinguish graphs but to regress or classify them or their nodes. There seem to be a hidden assumption that higher distinguishing power implies better ability to express general functions. While this is indeed the case in some settings [?], it is not the case with uniform expressivity notion.

Our goal is to better understand the role that the aggregation function plays in the expressivity of GNNs. Specifically, we ask: Do Sum aggregation GNNs subsume Mean and Max GNNs, in terms of uniform expressivity of general functions?

A common perception is that an answer is already found in [?]: $\operatorname{Sum}$ -GNNs strictly subsume all other aggregations GNNs. Examining the details though, what is actually proven there is: in the non-uniform notion, considering a finite input domain, the distinguishing power of $\operatorname{Sum}$ -GNNs subsume the distinguishing power of all other aggregations GNNs. Furthermore, in practice it has been observed that for certain tasks there is a clear advantage to using Mean and Max aggregations [?; ?; ?], with one of the most common models in practice using a variation of Mean aggregation [?]. While the difference between theoretical belief and practical evidence may be attributed to a learnability rather than to expressivity, it calls for better theoretical understanding of expressivity.

1.1 Our Contribution

All our results are in the uniform expressivity notion. Mainly, we prove that $\operatorname{Sum}$ -GNNs do not subsume $\operatorname{Mean}$ -GNNs nor $\operatorname{Max}$ -GNNs (and vice versa), in terms of vertices-embedding expressivity as well as graph-embedding expressivity. The statements in this paper consider additive approximation, yet the no-subsumption ones hold true also for multiplicative approximation.

•

Advantage Sum. For the sake of completeness, in Section 3 we prove that even with single-value input features, the neighbors-sum function which can be trivially exactly computed by a $\operatorname{Sum}$ -GNN cannot be approximated by any $\operatorname{Mean}$ -GNN or $\operatorname{Max}$ -GNN.

•

Sum subsumes. In Section 4 we prove that if the input features are bounded, $\operatorname{Sum}$ -GNNs can approximate all $\operatorname{Mean}$ -GNNs or $\operatorname{Max}$ -GNNs, though not without an increase in size which depends polynomially on the required accuracy, and exponentially on the depth of the approximated $\operatorname{Mean}$ -GNNs or $\operatorname{Max}$ -GNNs.

•

Advantage Mean and Max. In Section 5.1 we show that if we allow unbounded input features then functions that are exactly computable by $\operatorname{Mean}$ -GNNs ; $\operatorname{Max}$ -GNNs; and others, cannot be approximated by $\operatorname{Sum}$ -GNNs.

•

Essential also with finite input-features domain. In Section 5.2 we prove that even with just single-value input features, there are functions that can be exactly computed by a (Sum, Mean)-GNN (a GNN that use both Sum-aggregation and Mean-aggregation) or by a (Sum, Max)-GNN, but cannot be approximated by $\operatorname{Sum}$ -GNNs.

•

The world is not enough. In Section 6, we examine GNNs with any finite combination of Sum; Mean; Max and other aggregations, and prove upper bounds on their expressivity already in the single-value input features setting.

Lastly, in Section 7 we experiment with synthetic data and observe that what we proved to be expressible is to an extent also learnable, and that in practice inexpressivity is manifested in a significantly higher error than implied in theory.

All proofs, some of the lemmas, and extended illustration and analysis of the experimentation, are found in the appendix.

1.2 Related Work

The term Graph Neural Network, along with one of the basic models of GNNs, was introduced in [?]. Since then, more than a few works have explored aspects of expressivity of GNNs. Some have explored the distinguishing power of different models of GNNs [?; ?; ?; ?; ?; ?; ?], and some have examined the expressivity of GNNs depending on the aggregations they use [?; ?]. In [?], a connection between distinguishing power and function approximation is described. In all of the above, the non-uniform notion was considered. In the uniform notion, it was proven that $\operatorname{Sum}$ -GNNs can express every logical formula in Guarded Countable Logic with 2 variables (GC2) [?; ?]. A theoretical survey of the expressivity of GNNs is found in [?], and a practical survey of different models of GNNs is found in [?].

2 Preliminaries

By ${\mathbb{N}},{\mathbb{N}}_{>0},{\mathbb{Q}},{\mathbb{R}}$ we denote the sets of nonnegative integers, positive integers, rational numbers, an d real numbers, respectively. For $a,b\in{\mathbb{N}}:a\leq b$ we denote the set $\{n\in{\mathbb{N}}:a\leq n\leq b\}$ by $[a..b]$ . For $b\in{\mathbb{N}}_{>0}$ we denote the set $[1..b]$ by $[b]$ . For $a,b\in{\mathbb{R}}:a\leq b$ , we denote the set $\{r\in{\mathbb{R}}:a\leq r\leq b\}$ by $[a,b]$ . We may use the terms ”average” and ”mean” interchangeably to denote the arithmetic mean. We use ”{}” as notation for a multiset. Let $x\in{\mathbb{R}},b\in{\mathbb{N}}_{>0}$ , we define ${\{x\}\choose b}\coloneqq\{x,\ldots,x\}$ the multiset consisting of $b$ instances of $x$ . Let $d\in{\mathbb{N}}_{>0}$ and let a vector $v\in{\mathbb{R}}^{d}$ , we define $\left\lvert v\right\rvert\coloneqq max(\left\lvert v_{i}\right\rvert_{i\in[d]})$ . Let two vectors $u,v\in{\mathbb{R}}^{d}$ , we define ${}^{\prime}\leq^{\prime}$ : $u\leq v\Leftrightarrow\forall i\in[d]u_{i}\leq v_{i}$ .

2.1 Graphs

An undirected graph $G=\langle V(G),E(G)\rangle$ is a pair, $V(G)$ being a set of vertices and $E(G)\subseteq\{\{u,v\}\mid u,v\in V(G)\}$ being a set of undirected edges. For a vertex $v\in V(G)$ we denote by $N(v)\coloneqq\{w\in V(G)\mid\{w,v\}\in E(G)\}$ the neighbourhood of $v$ in $G$ , and we denote the size of it by $n_{v}\coloneqq|N(v)|$ .

A (vertex) featured graph $G=\langle V(G),E(G),S^{d},Z(G)\rangle$ is a $4$ -tuple being a graph with a feature map $Z(G):V(G)\rightarrow S^{d}$ , mapping each vertex to a $d$ -tuple over a set $S$ . We denote the set of graphs featured over $S^{d}$ by ${\mathcal{G}}_{S^{d}}$ , we define ${{\mathcal{G}}_{S}\coloneqq\bigcup_{d\in{\mathbb{N}}}{\mathcal{G}}_{S^{d}}}$ , and we denote the set of all featured graphs by ${\mathcal{G}}_{*}$ . The special set of graphs featured over {1} is denoted ${\mathcal{G}}_{1}$ . We denote the set of all feature maps that map to $S^{d}$ by ${\mathcal{Z}}_{S^{d}}$ , we denote $\bigcup_{d\in{\mathbb{N}}}{\mathcal{Z}}_{S^{d}}$ by ${\mathcal{Z}}_{S}$ , and we denote the set of all feature maps by ${\mathcal{Z}}_{*}$ . Let a featured-graph domain $D\subseteq{\mathcal{G}}_{*}$ , a mapping $f:{\mathcal{G}}_{D}\rightarrow{\mathcal{Z}}_{*}$ to new feature maps is called a feature transformation.

For a featured graph $G$ and a vertex $v\in V(G)$ we define $\operatorname{sum}(v)\coloneqq\Sigma_{w\in N(v)}Z(G)(w)$ , $\operatorname{avg}(v)\coloneqq\frac{1}{n_{v}}\operatorname{sum}(v)$ , and $\max(v)\coloneqq\max(Z(G)(w):w\in N(v))$ . In this paper, we consider the size of a graph $G$ to be its number of vertices, that is, $\left\lvert G\right\rvert\coloneqq\left\lvert V(G)\right\rvert$ .

2.2 Feedforward Neural Networks

A feedforward neural network (FNN) ${\mathfrak{F}}$ is directed acyclic graph where each edge $e$ carries a weight $w_{e}^{\mathfrak{F}}\in{\mathbb{R}}$ , each node $v$ of positive in-degree carries a bias $b_{v}^{\mathfrak{F}}\in{\mathbb{R}}$ , and each node $v$ has an associated continuous activation function ${\mathfrak{a}}_{v}^{\mathfrak{F}}:{\mathbb{R}}\to{\mathbb{R}}$ . The nodes of in-degree [math], usually $X_{1},\ldots,X_{p}$ , are the input nodes and the nodes of out-degree [math], usually $Y_{1},\ldots,Y_{q}$ , are the output nodes. We denote the underlying directed graph of an FNN ${\mathfrak{F}}$ by $(V({\mathfrak{F}}),E({\mathfrak{F}}))$ , and we call $\big{(}V({\mathfrak{F}}),E({\mathfrak{F}}),({\mathfrak{a}}^{\mathfrak{F}}_{v})_{v\in V({\mathfrak{F}})}\big{)}$ the architecture of ${\mathfrak{F}}$ , notated $A({\mathfrak{F}})$ . We drop the indices F at the weights and the activation function if ${\mathfrak{F}}$ is clear from the context.

The input dimension of an FNN is the number of input nodes, and the output dimension is the number of output nodes. The depth $\operatorname{depth}({\mathfrak{F}})$ of an FNN ${\mathfrak{F}}$ is the maximum length of a path from an input node to an output node.

To define the semantics, let ${\mathfrak{F}}$ be an FNN of input dimension $p$ and output dimension $q$ . For each node $v\in V({\mathfrak{F}})$ , we define a function $f_{{\mathfrak{F}},v}:{\mathbb{R}}^{p}\to{\mathbb{R}}$ by $f_{{\mathfrak{F}},X_{i}}(x_{1},\ldots,x_{p})\coloneqq x_{i}$ for the $i$ th input node $X_{i}$ and

[TABLE]

for every node $v$ with incoming edges $e_{j}=(u_{j},v)$ . Then ${\mathfrak{F}}$ computes the function $f_{\mathfrak{F}}:{\mathbb{R}}^{p}\to{\mathbb{R}}^{q}$ defined by

[TABLE]

Let ${\mathfrak{F}}$ an FNN, we consider the size of ${\mathfrak{F}}$ to be the size of its underlying graph. That is, $\left\lvert{\mathfrak{F}}\right\rvert=\left\lvert V({\mathfrak{F}})\right\rvert$ .

A common activation function is the ReLU activation, defined as $ReLU(x)\coloneqq max(0,x)$ . In this paper, we assume all FNNs to be ReLU activated. ReLU activated FNNs subsume every finitely-many-pieces piecewise-linear activated FNN, thus the results of this paper hold true for every such FNNs. Every ReLU activated FNN ${\mathfrak{F}}$ is Lipschitz-Continuous. That is, there exists a minimal $a_{\mathfrak{F}}\in{\mathbb{R}}_{\geq 0}$ such that for every input and output coordinates $(i,j)$ , for every specific input arguments $x_{1},\ldots,x_{n}$ , and for every $\delta>0$ , it holds that

[TABLE]

We call $a_{\mathfrak{F}}$ the Lipschitz-Constant of $f$ .

2.3 Graph Neural Networks

Several GNN models are described in the literature. In this paper, we define and consider the Aggregate-Combine (AC-GNN) model [?; ?]. Some of our results extend straightforwardly to the messaging scheme of MPNN [?], yet such extensions are out of scope of this paper.

A GNN layer, of input and output (I/O) dimensions ${p;q}$ , is a pair $({\mathfrak{F}},agg)$ such that: ${\mathfrak{F}}$ is an FNN of I/O dimensions ${2p;q}$ , and $agg$ is an order-invariant $p$ -dimension multiset-to-one aggregation function. An $m$ -layer GNN ${\mathcal{N}}=(({\mathfrak{F}}_{1},agg_{1}),\ldots$ $,({\mathfrak{F}}_{m},agg_{m}))$ , of I/O dimensions $p;q$ , is a sequence of $m$ GNN layers of I/O dimensions $p^{(i)};q^{(i)}$ such that: $p^{(1)}=p$ , $q^{(m)}=q$ and $\forall i\in[m-1]\ p^{(i+1)}=q^{(i)}$ . It determines a series of $m$ feature transformations as follows: Let a graph $G\in{\mathcal{G}}_{{\mathbb{R}}^{p}}$ and vertex $v\in V(G)$ , then ${\mathcal{N}}^{(0)}(G,v)\coloneqq Z(G)(v)$ , and for ${i\in[m]}$ we define a transformation

[TABLE]

We notate by ${\mathcal{N}}(G,v)\coloneqq{\mathcal{N}}^{(m)}(G,v)$ the final output of ${\mathcal{N}}$ for $v$ . We define the size of ${\mathcal{N}}$ to be ${\left\lvert{\mathcal{N}}\right\rvert\coloneqq\Sigma_{i\in[m]}\left\lvert{\mathfrak{F}}_{i}\right\rvert}$ the sum of its underlying FNNs’ sizes. We call $\big{(}(A({\mathfrak{F}}_{1}),agg_{1}),\ldots,(A({\mathfrak{F}}_{m}),agg_{m})\big{)}$ the architecture of ${\mathcal{N}}$ , notated $A({\mathcal{N}})$ , and say that ${\mathcal{N}}$ realizes $A({\mathcal{N}})$ . For an aggregation function $agg$ , we denote by $agg$ -GNNs the class of GNNs for which $\forall i\in[m]\ agg_{i}=agg$ . For aggregation functions $agg_{1},agg_{2}$ , we denote by $(agg_{1},agg_{2})$ -GNNs the class of GNNs with $m=2n$ layers such that ${\forall i\in[n]\ agg_{2i-1}=agg_{1},agg_{2i}=agg_{2}}$ .

2.4 Expressivity

Let $p,q\in{\mathbb{N}}$ , and a set $S$ . Let $F=\{f:{\mathcal{G}}_{S^{p}}\rightarrow{\mathcal{Z}}_{{\mathbb{R}}^{q}}\}$ a set of feature transformations, and let a feature transformation ${h:{\mathcal{G}}_{S^{p}}\rightarrow{\mathcal{Z}}_{{\mathbb{R}}^{q}}}$ . We say $F$ uniformly additively approximates $h$ , notated $F\approx h$ , if and only if

[TABLE]

The essence of uniformity is that one function ”works” for graphs of all sizes, unlike non-uniformity where it is enough to have a specific function for each specific size of input graphs. The proximity measure is additive - as opposed to multiplicative where it is required that ${\left\lvert\frac{f(G)(v)-h(G)(v)}{h(G)(v)}\right\rvert\leq\varepsilon}$ . In this paper, approximation always means uniform additive approximation and we use the term ”approximates” synonymously with expresses. Although our no-approximation statements consider additive approximation, they hold true also for multiplicative approximation, and the respective proofs (in the appendix) require not much additional argumentation to show that.

Let $F,H$ be sets of feature transformations $f:{\mathcal{G}}_{S^{p}}\rightarrow{\mathcal{Z}}_{{\mathbb{R}}^{q}}$ , we say $F$ subsumes $H$ , notated $F\geq H$ if and only if for every $h:{\mathcal{G}}_{S^{p}}\rightarrow{\mathcal{Z}}_{{\mathbb{R}}^{q}}$ it holds that $H\approx h\Rightarrow F\approx h$ . If the subsumption holds only for graphs featured with a subset $T^{p}\subset S^{p}$ we notate it as $F\geq^{{}_{T}}H$ .

Let $p,q\in{\mathbb{N}}$ . We call an order-invariant mapping ${f:{\mathcal{Z}}_{{\mathbb{R}}^{p}}\rightarrow{\mathbb{R}}^{q}}$ , from feature maps to $q$ -tuples, a readout function. Both $\operatorname{sum}$ and $\operatorname{avg}$ are commonly used to aggregate feature maps, possibly followed by an FNN that maps the aggregation value to a final output. We call a mapping ${f:{\mathcal{G}}_{S^{p}}\rightarrow{\mathbb{R}}^{q}}$ , from featured graphs to $q$ -tuples, a graph embedding. Let $w\in{\mathbb{N}}$ , let a set of feature transformations $F={\{f:{\mathcal{G}}_{S^{p}}\rightarrow{\mathcal{Z}}_{{\mathbb{R}}^{q}}\}}$ , and let a readout ${r:{\mathcal{Z}}_{{\mathbb{R}}^{q}}\rightarrow{\mathbb{R}}^{w}}$ , we notate the set of embeddings ${\{r\circ f:f\in F\}}$ by $r\circ F$ . We use the expressivity terms and notations defined for feature transformations, for graph embeddings as well.

3 Mean and Max Do Not Subsume

It has already been stated that $\operatorname{Sum}$ -GNNs can express functions that $\operatorname{Mean}$ -GNNs and $\operatorname{Max}$ -GNNs cannot [?]. For the sake of completeness we provide formal proofs that $\operatorname{Mean}$ -GNNs and $\operatorname{Max}$ -GNNs subsume neither $\operatorname{Sum}$ -GNNs nor each other.

3.1 Mean and Max do not subsume Sum

Neither $\operatorname{Mean}$ -GNNs nor $\operatorname{Max}$ -GNNs subsume $\operatorname{Sum}$ -GNNs, even when the input-feature domain is a single value.

We define a featured star graph with (a parameter) $k$ leaves, $G_{k}$ (see Figure 3): For every $k\in{\mathbb{N}}_{>0}$ :

•

$V(G_{k})=\{u\}\cup\{v_{1},\ldots,v_{k}\}$

•

$E(G_{k})=\bigcup_{i\in[k]}\{\{u,v_{i}\}\}$

•

$Z(G_{k})=\{(u,1)\}\bigcup_{i\in[k]}\{(v_{i},1)\}$

Let ${\mathcal{N}}$ be an $m$ -layer GNN. We define $u^{(t)}_{k}\coloneqq{\mathcal{N}}^{(t)}(G_{k},u)$ , the feature of $u\in V(G_{k})$ after operating the first $t$ layers of ${\mathcal{N}}$ . Note that $u^{(m)}_{k}={\mathcal{N}}(G_{k},u)$ .

Lemma 3.1.

Assume ${\mathcal{N}}$ is a $\operatorname{Mean}$ -GNN or a $\operatorname{Max}$ -GNN . Let the maximum input dimension of any layer be $d$ , and let the maximum Lipschitz-Constant of any FNN of ${\mathcal{N}}$ be $a$ . Then, for every $k$ it holds that $\left\lvert u^{(m)}_{k}\right\rvert\leq(da)^{m}$ .

Theorem 3.2.

Let $f:{\mathcal{G}}_{1}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation such that for every $k$ it holds that $f(G_{k})(u)=k$ . Then, ${\text{$ \operatorname{Mean} $-GNNs }\not\approx f}$ and $\operatorname{Max}$ -GNNs $\not\approx f$ .

Note that by Theorem 3.2, a function such as neighbors-count is inexpressible by $\operatorname{Mean}$ -GNNs and $\operatorname{Max}$ -GNNs .

Corollary 3.3.

We have that $\operatorname{Mean}$ -GNNs $\not\geq^{{}_{\{1\}}}$ $\operatorname{Sum}$ -GNNs, $\operatorname{Max}$ -GNNs $\not\geq^{{}_{\{1\}}}$ $\operatorname{Sum}$ -GNNs.

3.2 Mean and Max do not subsume each other

$\operatorname{Mean}$ -GNNs and $\operatorname{Max}$ -GNNs do not subsume each other, even in a finite input-feature domain setting. We define a parameterized graph in which, depending on the parameters’ arguments, the average of the center’s neighbors is in $[0,\frac{1}{2}]$ while their max can be either [math] or $1$ . For every $k\in{\mathbb{N}}$ and $b\in\{0,1\}$ :

•

$V(G_{k,b})=\{u\}\cup\{v_{1},\ldots,v_{k}\}\cup\{w\}$

•

$E(G_{k,b})=\bigcup_{i\in[k]}\{\{u,v_{i}\}\}\cup\{\{u,w\}\}$

•

$Z(G_{k,b})=\{(u,0)\}\bigcup_{i\in[k]}\{(v_{i},0)\}\cup\{(w,b)\}$

Theorem 3.4.

Let $f:{\mathcal{G}}_{\{0,1\}}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation such that for every $k$ it holds that $f(G_{k,b})(u)=\frac{b}{k+1}$ . Then, $\operatorname{Max}$ -GNNs $\not\approx f$ .

Theorem 3.5.

Let $f:{\mathcal{G}}_{\{0,1\}}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation such that for every $k$ it holds that $f(G_{k,b})(u)=b$ . Then, $\operatorname{Mean}$ -GNNs $\not\approx f$ .

Corollary 3.6.

We have that $\operatorname{Mean}$ -GNNs $\not\geq^{{}_{\{0,1\}}}$ $\operatorname{Max}$ -GNNs , $\operatorname{Max}$ -GNNs $\not\geq^{{}_{\{0,1\}}}$ $\operatorname{Mean}$ -GNNs .

4 Sometimes Sum Subsumes

In a bounded input-feature domain setting, $\operatorname{Sum}$ -GNNs can express every function that $\operatorname{Mean}$ -GNNs and $\operatorname{Max}$ -GNNs can. The bounded input-feature domain results in a bounded range for Mean and Max, a fact which can be exploited to approximate the target GNN with a Sum-GNN. The approximating Sum-GNNs, that we describe, come at a size cost. We do not know if an asymptotically-lower-cost construction exist.

4.1 Mean by Sum

$\operatorname{Sum}$ -GNNs subsume $\operatorname{Mean}$ -GNNs in a bounded input-feature domain setting.

Lemma 4.1.

For every $\varepsilon>0$ and $d\in{\mathbb{N}}_{>0}$ , there exists a $\operatorname{Sum}$ -GNN ${\mathcal{N}}$ of size $O(d\frac{1}{\varepsilon})$ such that for every featured graph ${G\in{\mathcal{G}}_{[0,1]\subset{\mathbb{R}}^{d}}}$ it holds that ${\forall v\in V(G)}\;\left\lvert N(G,v)-\operatorname{avg}(v)\right\rvert\leq\varepsilon$ .

Theorem 4.2.

Let a $\operatorname{Mean}$ -GNN ${\mathcal{N}}_{M}$ consisting of $m$ layers, let the maximum input dimension of any layer be $d$ , and let the maximum Lipschitz-Constant of any FNN of ${\mathcal{N}}_{M}$ be $a$ . Then, for every $\varepsilon>0$ there exists a $\operatorname{Sum}$ -GNN ${\mathcal{N}}_{S}$ such that:

${\forall G\in{\mathcal{G}}_{[0,1]^{d}}\;\forall v\in V(G)\quad|{\mathcal{N}}_{M}(G,v)-{\mathcal{N}}_{S}(G,v)|\leq\varepsilon}$ .

2.

$\left\lvert{\mathcal{N}}_{S}\right\rvert\leq O(\left\lvert{\mathcal{N}}_{M}\right\rvert+\frac{d\cdot m\cdot ad(1-(2ad)^{m})}{\varepsilon(1-(2ad))})$ .

Corollary 4.3.

$\operatorname{Sum}$ -GNNs $\geq^{{}_{[0,1]}}$ $\operatorname{Mean}$ -GNNs.

4.2 Max by Sum

$\operatorname{Sum}$ -GNNs subsume $\operatorname{Max}$ -GNNs in a bounded input-feature domain setting.

Lemma 4.4.

For every $\varepsilon>0$ and $d\in{\mathbb{N}}_{>0}$ , there exists a $\operatorname{Sum}$ -GNN ${\mathcal{N}}$ of size $O(d\frac{1}{\varepsilon})$ such that for every featured graph $G\in{\mathcal{G}}_{[0,1]^{d}}$ and vertex $v\in V(G)$ it holds that $\left\lvert N(G,v)-\max(v)\right\rvert\leq\varepsilon$ .

Theorem 4.5.

Let a $\operatorname{Max}$ -GNN ${\mathcal{N}}_{M}$ consisting of $m$ layers, let the maximum input dimension of any layer be $d$ , and let the maximum Lipschitz-Constant of any FNN of ${\mathcal{N}}_{M}$ be $a$ . Then, for every $\varepsilon>0$ there exists a $\operatorname{Sum}$ -GNN ${\mathcal{N}}_{S}$ such that:

${\forall G\in{\mathcal{G}}_{[0,1]^{d}}\;\forall v\in V(G)\quad|{\mathcal{N}}_{M}(G,v)-{\mathcal{N}}_{S}(G,v)|\leq\varepsilon}$ .

2.

$\left\lvert{\mathcal{N}}_{S}\right\rvert\leq O(\left\lvert{\mathcal{N}}_{M}\right\rvert+\frac{d\cdot m\cdot ad(1-(2ad)^{m})}{\varepsilon(1-(2ad))})$ .

Corollary 4.6.

$\operatorname{Sum}$ -GNNs $\geq^{{}_{[0,1]}}$ $\operatorname{Max}$ -GNNs.

5 Mean and Max Have Their Place

In two important settings, Mean and Max aggregations enable expressing functions that cannot be expressed with Sum alone. As in Section 3, we define a graph $G_{\theta}$ parameterized by $\theta$ over domain $\Theta$ . We define a feature transformation $f$ on that graph and prove that it cannot be approximated by $\operatorname{Sum}$ -GNNs. The line of proofs (in the appendix) is as follows:

We show that for every $\operatorname{Sum}$ -GNN ${\mathcal{N}}$ there exists a finite set $F_{{\mathcal{N}}}$ of polynomials of $\theta$ , those polynomials obtain a certain property $\varphi$ , and it holds that:

[TABLE]

2.

We show that for every finite set $F$ of polynomials ${\text{(of }\theta\text{)}}$ that obtain $\varphi$ , it holds that:

[TABLE]

5.1 Unbounded, Countable, Input-Feature Domain

In an unbounded input-feature domain setting, Mean;Max and other GNNs are not subsumed by $\operatorname{Sum}$ -GNNs. We define a graph $G_{k,c}$ (see Figure 3): For $(k,c)\in{\mathbb{N}}_{>0}^{2}$ ,

•

$V(G_{k,c})=\{u\}\cup\{v_{1},\ldots,v_{k}\}$

•

$E(G_{k,c})=\bigcup_{i\in[k]}\{\{u,v_{i}\}\}$

•

$Z(G_{k,c})=\{(u,0)\}\bigcup_{i\in[k]}\{(v_{i},c)\}$

Theorem 5.1.

Let $f:{\mathcal{G}}_{{\mathbb{N}}^{1}}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation, such that for every $k,c$ it holds that $f(G_{k,c})(u)=c$ . Then, $\text{$ \operatorname{Sum} $-GNNs }\not\approx f$ .

Corollary 5.2.

Denote by $S$ the set of all multisets over ${\mathbb{N}}_{>0}$ . Let $g:S\rightarrow{\mathbb{R}}$ an aggregation such that $\forall a,b\in{\mathbb{N}}_{>0}\ g({\{a\}\choose b})=a$ , that is, $g$ aggregates every homogeneous multiset to its single unique value. Then, $\operatorname{Sum}$ -GNNs $\not\geq^{{}_{{\mathbb{N}}}}$ g-aggregation GNNs.

Corollary 5.2 implies a limitation of $\operatorname{Sum}$ -GNNs compared to GNNs that use Mean; Max; or many other aggregations.

Graph Embedding

$\operatorname{Sum}$ -GNNs are limited compared to Mean; Max; and other GNNs, not only when used to approximate vertices’ feature transformations but also when used in combination with a readout function to approximate graph embeddings. Consider another variant of $G_{k,c}$ : For $(k,c)\in{\mathbb{N}}_{>0}^{2}$ ,

•

$V(G_{k,c})=\{u_{1},\ldots,u_{k^{2}}\}\cup\{v_{1},\ldots,v_{k}\}$

•

$E(G_{k,c})=\bigcup_{i\in[k^{2}],j\in[k]}\{\{u_{i},v_{j}\}\}$

•

$Z(G_{k,c})=\bigcup_{i\in[k^{2}]}\{(u_{i},0)\}\bigcup_{i\in[k]}\{(v_{i},c)\}$

Theorem 5.3.

Let $f:{\mathcal{G}}_{{\mathbb{N}}^{1}}\rightarrow{\mathbb{R}}$ a graph embedding such that $\forall k,c\ f(G_{k,c})=\frac{kc}{k+1}$ . Let an aggregation ${{\mathfrak{a}}\in\{\operatorname{sum},\operatorname{avg}\}}$ and an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ{\mathfrak{a}}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\approx f}$ .

Corollary 5.4.

Denote by $S$ the set of all multisets over ${\mathbb{N}}_{>0}$ . Let $g:S\rightarrow{\mathbb{R}}$ an aggregation such that $\forall a,b\in{\mathbb{N}}_{>0}\ g({\{a\}\choose b})=a$ . Let an aggregation ${{\mathfrak{a}}\in\{\operatorname{sum},\operatorname{avg}\}}$ and an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ{\mathfrak{a}}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\geq^{{}_{{\mathbb{N}}}}\operatorname{avg}\circ\ g\text{-GNNs}}$ .

We have shown that $\operatorname{Sum}$ -GNNs do not subsume Mean and Max (and many other) GNNs. The setting though, consisted of an input-feature domain ${\mathbb{N}}_{>0}$ , that is, countable unbounded.

5.2 Finite Input-Feature Domain

Mean and Max aggregations are essential also when the input-feature domain is just a single value i.e. when the input is featureless graphs. We define a new graph $G_{k,c}$ (see Figure 3): For every $(k,c)\in{\mathbb{N}}_{>0}^{2}$ ,

•

$V(G_{k,c})=\{u\}\cup\{v_{1},\ldots,v_{k}\}\cup\{w_{1},\ldots,w_{c}\}$

•

$E(G_{k,c})=\bigcup_{i\in[k]}\{\{u,v_{i}\}\}\bigcup_{i\in[k],j\in[c]}\{\{v_{i},w_{j}\}\}$

•

$Z(G_{k,c})=\{(u,1)\}\bigcup_{i\in[k]}\{(v_{i},1)\}\bigcup_{i\in[c]}\{(w_{i},1)\}$

Theorem 5.5.

Let $f:{\mathcal{G}}_{1}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation, such that for every $k,c$ it holds that $f(G_{k,c})(u)=c$ . Then, $\operatorname{Sum}$ -GNNs $\not\approx f$ .

Corollary 5.6.

Denote by $S$ the set of all multisets over ${\mathbb{N}}_{>0}$ , and let $g:S\rightarrow{\mathbb{R}}$ an aggregation such that $\forall a,b\in{\mathbb{N}}_{>0}\ g({\{a\}\choose b})=a$ . Then, $\operatorname{Sum}$ -GNNs $\not\geq^{{}_{\{1\}}}$ (Sum, g)-GNNs.

Corollary 5.6 implies a limitation of $\operatorname{Sum}$ -GNNs compared to stereo aggergation GNNs that combine Sum with Mean; Max; or many other aggregations. The limitation exists even when the input-feature domain consists of only a single value.

Graph Embedding

Completing the no-subsumption picture, $\operatorname{Sum}$ -GNNs are not subsuming, in a 2-values input-feature domain setting, also when used in combination with a readout function to approximate graph embeddings. We define $G_{k,c}$ : For every $(k,c)\in{\mathbb{N}}_{>0}^{2}$ ,

•

$V(G_{k,c})=\{u_{1},\ldots,u_{k^{2}}\}\cup\{v_{1},\ldots,v_{k^{3}}\}\cup\{w_{1},\ldots,w_{kc}\}$

•

$E(G_{k,c})=\bigcup_{j\in[k^{2}],i\in[k^{3}]}\{\{u_{j},v_{i}\}\}\bigcup_{i\in[k^{3}],j\in[kc]}\{\{v_{i},w_{j}\}\}$

•

$Z(G_{k,c})=\bigcup_{i\in[k^{2}]}\{(u_{i},0)\}\bigcup_{i\in[k^{3}]}\{(v_{i},0)\}\bigcup_{i\in[kc]}\{(w_{i},1)\}$

Theorem 5.7.

Let $f:{\mathcal{G}}_{\{0,1\}^{1}}\rightarrow{\mathbb{R}}$ a graph embedding such that $\forall k,c\ f(G_{k,c})=\frac{(k^{2}+kc)kc}{k^{3}+k^{2}+kc}$ . Let an aggregation ${{\mathfrak{a}}\in\{\operatorname{sum},\operatorname{avg}\}}$ and an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ{\mathfrak{a}}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\approx f}$ .

Corollary 5.8.

Denote by $S$ the set of all multisets over ${\mathbb{N}}_{>0}$ . Let ${g:S\rightarrow{\mathbb{R}}}$ an aggregation such that ${\forall a,b\in{\mathbb{N}}_{>0}\ g({\{a\}\choose b})=a}$ . Let an aggregation ${{\mathfrak{a}}\in\{\operatorname{sum},\operatorname{avg}\}}$ and an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ{\mathfrak{a}}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\geq^{{}_{\{0,1\}}}\operatorname{avg}\circ\ \text{(Sum, g)-GNNs}}$ .

6 Sum and More are Not Enough

In previous sections we showed that $\operatorname{Sum}$ -GNNs do not subsume $\operatorname{Mean}$ -GNNsand $\operatorname{Max}$ -GNNs , by proving that they cannot express specific functions. In this section, rather than comparing different GNNs classes we focus on one broad GNNs class and show that it is limited in its ability to express any one of a certain range of functions.

Denote by $S$ the set of all multisets over ${\mathbb{R}}$ , and let an aggregation ${{\mathfrak{a}}:S\rightarrow{\mathbb{R}}}$ . We say that ${\mathfrak{a}}$ is a uniform polynomial aggregation (UPA) if and only if for every homogeneous multiset ${\{x\}\choose b},x\in{\mathbb{R}},b\in{\mathbb{N}}_{>0}$ it holds that ${\mathfrak{a}}({\{x\}\choose b})$ is either a polynomial of $x$ or a polynomial of $(bx)$ . Note that Sum; Mean; and Max are all UPAs. We say that a GNN ${\mathcal{N}}=({\mathcal{L}}^{(1)},\ldots,{\mathcal{L}}^{(m)})$ is an $\operatorname{MUPA}$ -GNN (Multiple UPA) if and only if the aggregation input to each of its layers is defined by a series of UPAs. That is, ${\mathcal{L}}^{(i)}=({\mathfrak{F}}^{(i)},({\mathfrak{a}}^{(i)}_{1},\ldots,{\mathfrak{a}}^{(i)}_{b_{i}}))$ , for some $b_{i}$ UPAs.

We define a parameterized graph $G_{k}$ (see Figure 3): For every $k\in{\mathbb{N}}_{>0}$ :

•

$V(G_{k})=\{u\}\cup\{v_{1},\ldots,v_{k}\}$

•

$E(G_{k})=\bigcup_{i\in[k]}\{\{u,v_{i}\}\}$

•

$Z(G_{k})=\{(u,1)\}\bigcup_{i\in[k]}\{(v_{i},1)\}$

Lemma 6.1.

Let ${\mathcal{A}}$ an $m$ -layer $\operatorname{MUPA}$ -GNN architecture, let $l$ be the maximum depth of any FNN in ${\mathcal{A}}$ , and let $d$ be the maximum in-degree of any node in any FNN in ${\mathcal{A}}$ . Then, there exists $r\in{\mathbb{N}}$ such that: for every GNN ${\mathcal{N}}$ that realizes ${\mathcal{A}}$ it holds that ${\mathcal{N}}(G_{k},u)$ is piecewise-polynomial (of $k$ ) with at most $((d+1)^{l})^{m}$ pieces, and each piece is of degree at most $r$ .

Lemma 6.1 implies that the architecture bounds (from above) the number of polynomial pieces, and their degrees, that make the function computed by any particular realization of the architecture. With Lemma 6.1 at our disposal, we consider any feature transformation that does not converge to a polynomial when applied to $u\in V(G_{k})$ and viewed as a function of $k$ . We show that such a function is inexpressible by $\operatorname{MUPA}$ -GNNs.

Theorem 6.2.

Let $f:{\mathcal{G}}_{1}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation, and define $g(k)\coloneqq f(G_{k})(u)$ . Assume that $g$ does not converge to any polynomial, that is, there exists $\varepsilon>0$ such that for every polynomial $p$ , for every $K_{0}$ , there exists $k\geq K_{0}$ such that $\left\lvert g(k)-p(k)\right\rvert\geq\varepsilon$ . Then, $\operatorname{MUPA}$ -GNNs $\not\approx f$ .

The last inexpressivity property we prove, concerns a class of functions which we call PIL (Polynomial-Intersection Limited). For $n\in{\mathbb{N}}$ denote by $P_{n}$ the set of all polynomials of degree $\leq n$ . We say that a function $f:\mathbb{N}\rightarrow\mathbb{R}$ is PIL if and only if for every $n\in{\mathbb{N}}$ there exists $k_{n}\in{\mathbb{N}}$ such that for every polynomial $p\in P_{n}$ there exist at most $k_{n}-1$ consecutive integer points on which $p$ and $f$ assume the same value. Formally,

[TABLE]

We consider every feature transformation $f$ such that for $g(k)\coloneqq f(G_{k})(u)$ it holds that $g$ is PIL. This is a different characterization than ”no polynomial-convergence” (in Theorem 6.2), and neither one implies the other. The result though, is weaker for the current characterization. We show that every $\operatorname{MUPA}$ -GNN architecture can approximate such a function only down to a certain $\varepsilon>0$ . That is, every GNN that realizes the architecture - no matter the specific weights of its FNNs - is far from the function by at least $\varepsilon$ (at least in one point). The following lemma is an adaptation of the Polynomial of Best Approximation theorem [?; ?] to the integer domain. There, it is a step in the proof of the Equioscillation theorem attributed to Chebyshev [?].

Lemma 6.3.

For $x,k\in{\mathbb{N}}$ define $I_{x,k}\coloneqq\{x,x+1,\ldots,x+k-1\}$ the set of consecutive $k$ integers starting at $x$ . Let $f:\mathbb{N}\rightarrow\mathbb{R}$ be a PIL, let $n\in\mathbb{N}$ , and define $k_{n}\coloneqq$

[TABLE]

Then, for every $x\in\mathbb{N}$ there exists $\varepsilon_{x,k_{n}}>0$ such that: for every $p\in P_{n}$ there exists $y\in I_{x,k_{n}}$ for which $\left\lvert p(y)-f(y)\right\rvert\geq\varepsilon_{x,k_{n}}$ . That is, for every starting point $x$ there is a bounded interval $I_{x,k_{n}}$ , and a gap $\varepsilon_{x,k_{n}}$ , such that no polynomial of degree $\leq n$ can approximate $f$ on that interval below that gap.

Lemma 6.4.

For every $q,n\in{\mathbb{N}}$ there exists a point $T_{q,n}\in{\mathbb{N}}$ and a gap $\delta_{T_{q,n}}>0$ such that: for every PIL $f:\mathbb{N}\rightarrow\mathbb{R}$ , and every piecewise-polynomial $g$ with $q$ many pieces of degree $\leq n$ , there exists $y\in\mathbb{N},\;0\leq y\leq T_{q,n}$ for which $\left\lvert g(y)-f(y)\right\rvert\geq\delta_{T_{q,n}}$ . That is, the number of pieces and the max degree of a piecewise-polynomial $g$ determine a guaranteed minimum gap by which $g$ misses $f$ within a guaranteed interval.

Theorem 6.5.

Let $f:{\mathcal{G}}_{1}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation, let $g(k)\coloneqq f(G_{k})(u)$ , and assume that $g$ is PIL. Then, for every $\operatorname{MUPA}$ -GNN architecture ${\mathcal{A}}$ , there exists $\varepsilon_{{\mathcal{A}}}>0$ such that for every $\operatorname{MUPA}$ -GNN ${\mathcal{N}}$ that realizes ${\mathcal{A}}$ there exists $k$ such that $\left\lvert{\mathcal{N}}(G_{k},u)-f(G_{k})(u)\right\rvert\geq\varepsilon$ .

7 Experimentation

We experiment with vertex-level regression tasks. In previous sections we formally proved certain expressivity properties of Sum; Mean; and Max GNNs. Our goal in experimentation is to examine how these properties may affect practical learnability: searching for an approximating GNN using stochastic gradient-descend. With training data ranging over only a small subsection of the true-distribution range, does the existence of a uniformly-expressing GNN increase the chance that a well-generalizing GNN will be learned?

Specific details concerning training and architecture, as well additional illustrations and extended analysis, can be found in the appendix 111code for running the experiments is found at https://github.com/toenshoff/Uniform_Graph_Learning.

7.1 Data and Setup

For the graphs in the experiments, and with our GNN architecture consisting of two GNN layers (see appendix), Mean and Max aggregations output the same value for every vertex, up to machine precision. Thus, it is enough to experiment with Mean and assume identical results for Max.

We conduct experiments with two different datasets, one corresponds to the approximation task in Section 5.1, and the other to the task in Section 5.2:

Unbounded Countable Feature Domain (UC): This dataset consists of the star graphs $\{G_{k,c}\}$ from Section 5.1, for $k,c\in[1..1000]$ . The center’s ground truth value is $c$ , and it is the only vertex whose value we want to predict. 2. 2.

Single-Value Feature Domain (SV): This dataset consists of the graphs $\{G_{k,c}\}$ from Section 5.2, for $k,c\in[1..1000]$ . Again, the center’s ground truth value is $c$ , and we do not consider the other vertices’ predicted values.

As training data, we vary $k\in[1..100]$ and $c\in[1..100]$ . We therefore train on 10K graphs in each experiment. Afterwards, we test each GNN model on larger graphs with $k\in[101..1000\}$ and $c\in[101..1000]$ . Here, we illustrate our results for two representing values of $k$ : $500,1000$ , for all values of $c$ . Illustrations of the full results can be found in the appendix. The increased range of $k$ and $c$ in testing simulates the scenario of unbounded graph sizes and unbounded feature values, allowing us to study the performance in terms of uniform expressivity with unbounded features.

7.2 Results

Our primary evaluation metric is the relative error. Formally, if $y_{\text{pred}}$ is the prediction of the GNN for the center vertex of an input graph $G$ , with truth label $c$ , we define the relative error as

[TABLE]

A relative error greater or equal to 1 is a strong evidence for inability to approximate, as the assessed approximation is no-better than an always-0 output. It is also reasonable that in practice, when judging the regression of a function whose range vary by a factor of 1000, relative error would be the relevant measure.

Unbounded, Countable, Feature Domain

Figure 4(a) provides the test results for UC. We plot the relative error against different values of $c$ . Note that the error has a logarithmic scale. $\operatorname{Mean}$ -GNNs achieve very low relative errors of less than $10^{-4}$ across all considered combinations of $k$ and $c$ . Their relative error falls to less than $10^{-6}$ when $c$ is within the range seen during training ( $\leq 100$ ), Therefore, $\operatorname{Mean}$ -GNNs do show some degree of overfitting. Notably, the value of $k$ has virtually no effect on the error of $\operatorname{Mean}$ -GNNs . This is expected, since mean aggregation should not be affected by the degree $k$ of a center vertex whose neighbors are identical, up to machine precision. $\operatorname{Sum}$ -GNNs yield a substantially higher relative error. For $k=500$ and $c\leq 100$ the relative error is roughly $1$ , but this value increases as $c$ grows beyond the training range. Crucially, the relative error of $\operatorname{Sum}$ -GNNs also increases with $k$ . For $k=1000$ , the relative error is above $1$ even when $c$ is within the range seen during training. Therefore, $\operatorname{Sum}$ -GNNs do generalize significantly worse than $\operatorname{Mean}$ -GNNs in both parameters $k$ and $c$ . ‘

Single-Value Feature Domain

Figure 4(b) provides the test results for SV. Again, we plot the relative error against different values of $c$ . $\operatorname{Sum}$ -GNNs yield similar relative errors as in the UC experiment. As expected, learned (Sum,Mean)-GNNs do perform significantly better than $\operatorname{Sum}$ -GNNs. However, the learning of (Sum,Mean)-GNNs is not as successful as the learning of $\operatorname{Mean}$ -GNNs in the UC experiment: relative error is around $10^{-1}$ for $k=500$ , and slightly larger for $k=1000$ , clearly worse than the UC-experiment performance. In particular, the learned (Sum,Mean)-GNN is sensitive to increases in $k$ . Note that each (Sum,Mean)-GNN layer receives both Sum and Mean aggregations arguments and needs to choose the right one, thus it is a different learning challenge than in the first experiment.

Appendix A Proofs

For the reader’s convenience, we re-state the results that are proven in this appendix.

Proofs for Section 3

Lemma 3.1

*Assume ${\mathcal{N}}$ is a $\operatorname{Mean}$ -GNN or a $\operatorname{Max}$ -GNN . Let the maximum input dimension of any layer be $d$ , and let the maximum Lipschitz-Constant of any FNN of ${\mathcal{N}}$ be $a$ . Then, for every $k$ it holds that $\left\lvert u^{(m)}_{k}\right\rvert\leq(da)^{m}$ . *

Proof.

For every $i,j\in[k]$ there is an automorphism of $G_{k}$ that maps $v_{i}$ to $v_{j}$ , thus they receive the same feature throughout the computation. We define $v^{(t)}_{k}\coloneqq{\mathcal{N}}^{(t)}(G_{k},v_{i})$ for every ${i\in[k]}$ . We view $u^{(t)}_{k},v^{(t)}_{k}$ as functions of $k$ . First, assume Assume ${\mathcal{N}}$ is a $\operatorname{Mean}$ -GNN . We show by induction that for any $t\in[m]$ it holds that $\left\lvert v^{(t)}_{k}\right\rvert\leq(2da)^{t},\left\lvert u^{(t)}_{k}\right\rvert\leq(2da)^{t}$ . For $t=1$ , $v^{(t)}_{k}=f_{1}(1,1)$ for some FNN $f_{1}$ whose Lipschitz-Constant is at most $a$ , hence $\left\lvert f_{1}(1,\frac{1}{1})\right\rvert\leq 2a$ . Also, $u^{(t)}_{k}=f_{1}(1,\frac{k}{k})=f_{1}(1,1)\leq 2a$ . Assume correctness for $t=n$ . For $t=n+1$ we have $v^{(n+1)}_{k}=f_{n+1}(v^{(n)}_{k},u^{(n)}_{k})$ for some FNN $f_{n+1}$ whose Lipschitz-Constant is at most $a$ . Hence, $v^{(n+1)}_{k}\leq 2da(2da)^{n}=(2da)^{n+1}$ . Also, $u^{(n+1)}_{k}=f_{n+1}(u^{(n)}_{k},\frac{kv^{(n)}_{k})}{k})=f_{n+1}(u^{(n)}_{k},v^{(n)}_{k})\leq 2da(2da)^{n}=(2da)^{n+1}$ .

Next, assume ${\mathcal{N}}$ is a $\operatorname{Max}$ -GNN . Notice that for every $t\in[0..(m-1)]$ it holds that $\frac{u^{(t)}_{k}}{1}=\max(u^{(t)}_{k})$ and $\frac{kv^{(t)}_{k}}{k}=\max(v_{k}^{(t)},\ldots,v_{k}^{(t)})$ . Hence, the proof idea for a $\operatorname{Mean}$ -GNN applies also for a $\operatorname{Max}$ -GNN . ∎

Theorem 3.2

*Let $f:{\mathcal{G}}_{1}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation such that for every $k$ it holds that $f(G_{k})(u)=k$ . Then, ${\text{$ \operatorname{Mean} $-GNNs }\not\approx f}$ and $\operatorname{Max}$ -GNNs $\not\approx f$ . *

Proof.

Choose any $\varepsilon>0$ . Let ${\mathcal{N}}$ be either $\operatorname{Mean}$ -GNN or $\operatorname{Max}$ -GNN . Let the maximum input dimension of any layer be $d$ , and let the maximum Lipschitz-Constant of any FNN of ${\mathcal{N}}$ be $a$ . Choose $k=(2da)^{m}+\varepsilon$ , then by Lemma 3.1 we have that $\left\lvert{\mathcal{N}}(G_{k},u)-f(G_{k})(u)\right\rvert\geq\varepsilon$ . ∎

Corollary 3.3

*We have that $\operatorname{Mean}$ -GNNs $\not\geq^{{}_{\{1\}}}$ $\operatorname{Sum}$ -GNNs, $\operatorname{Max}$ -GNNs $\not\geq^{{}_{\{1\}}}$ $\operatorname{Sum}$ -GNNs. *

Proof.

Clearly, there is a $\operatorname{Sum}$ -GNN that computes $f$ exactly. By Theorem 3.2, there is no $\operatorname{Mean}$ -GNN or $\operatorname{Max}$ -GNN that approximates $f$ . ∎

Theorem 3.4

*Let $f:{\mathcal{G}}_{\{0,1\}}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation such that for every $k$ it holds that $f(G_{k,b})(u)=\frac{b}{k+1}$ . Then, $\operatorname{Max}$ -GNNs $\not\approx f$ . *

Proof.

Let ${\mathcal{N}}$ be an $m$ -layer $\operatorname{Max}$ -GNNs. It is not hard to see by induction on $m$ that for every $i>0,j>0$ it holds that ${\mathcal{N}}(G_{i,1},u)={\mathcal{N}}(G_{j,1},u)$ . Hence, $\exists k:\left\lvert f(G_{k,1})(u)-{\mathcal{N}}(G_{k,1},u)\right\rvert>0.24$ . ∎

Theorem 3.5

*Let $f:{\mathcal{G}}_{\{0,1\}}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation such that for every $k$ it holds that $f(G_{k,b})(u)=b$ . Then, $\operatorname{Mean}$ -GNNs $\not\approx f$ . *

Proof.

Let ${\mathcal{N}}$ be an $m$ -layer $\operatorname{Mean}$ -GNNs. It is not hard to show that $N(G_{k,b},u)$ is Lipschitz-Continuous with respect to the aggregation and that with the aggregation being Mean we have that $\lim_{k\rightarrow\infty}\left\lvert N(G_{k,0},u)-N(G_{k,1},u)\right\rvert=0$ . ∎

Corollary 3.6

*We have that $\operatorname{Mean}$ -GNNs $\not\geq^{{}_{\{0,1\}}}$ $\operatorname{Max}$ -GNNs , $\operatorname{Max}$ -GNNs $\not\geq^{{}_{\{0,1\}}}$ $\operatorname{Mean}$ -GNNs . *

Proof.

Clearly, there is a $\operatorname{Mean}$ -GNN that computes $f$ of Theorem 3.4 exactly, and by Theorem 3.4 there is no $\operatorname{Max}$ -GNN that approximates $f$ . Clearly, there is a $\operatorname{Max}$ -GNN that computes $f$ of Theorem 3.5 exactly, and by Theorem 3.5 there is no $\operatorname{Mean}$ -GNN that approximates $f$ . ∎

Proofs for Section 4

Every reference in Lemma A.1 (and its proof) to a vertex-related value-vector is element-wise: for every vertex $v$ and a value-function $f(v)$ of output dimension $d$ we use the notation $f(v)$ to represent $f(v)_{i}$ for all $i\in[d]$ .

Lemma A.1.

Let $d\in{\mathbb{N}}_{>0}$ , let $s\in[0,1]$ , and let $0<a\leq s$ . Then, there is a $\operatorname{Sum}$ -GNN ${\mathcal{N}}$ such that for every featured graph $G\in{\mathcal{G}}_{[0,1]^{d}}$ and every vertex $v\in V(G)$ it holds that

[TABLE]

Proof.

Please refer to Figure 6 for an illustration of the construction. Let $v^{(t)}$ be the value of a vertex $v$ after layer $t$ and let $g^{(t)}_{v}=\sum_{w\in N(v)}w^{(t)}$ the sum of $v^{\prime}s$ neighbors’ values after layer $t$ . We denote the function computed in layer $t$ of ${\mathcal{N}}$ by $f_{t}$ , that is, $v^{(t)}=f_{t}(v^{(t-1)},g^{(t-1)}_{v})$ . First, we map the value of a vertex (and the sum of its neighbors) to a 2-tuple with the first coordinate being $1$ and the second being the vertex’ value. That is, we define $f_{1}:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2}$ to be $f_{1}(x,y)=(1,x)$ . Then, we define $f_{2}:\mathbb{R}^{2}\times\mathbb{R}^{2}\rightarrow\mathbb{R}$ to be $f_{2}(x,y)=\text{ReLU}(\frac{sy_{1}-y_{2}}{a})-\text{ReLU}(\frac{sy_{1}-y_{2}}{a}-1)+\text{ReLU}(\frac{sy_{1}-y_{2}}{a}-n_{v}-1)-\text{ReLU}(\frac{sy_{1}-y_{2}}{a}-n_{v})$ . That is, $v^{(2)}=\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a})-\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-1)+\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v}-1)-\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v})$ . To see why $v^{(2)}$ fulfills the requirements, we describe the values of each of the three components for the different ranges of $\operatorname{avg}(v)$ .

•

$s\leq\operatorname{avg}(v)\Rightarrow\frac{n_{v}(s-\operatorname{avg}(v))}{a}\leq 0\Rightarrow\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a})=\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-1)=\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v})=\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v})=0\Rightarrow v^{(2)}=0$

•

$s-\frac{a}{n_{v}}<\operatorname{avg}(v)<s\Rightarrow 0<\frac{n_{v}(s-\operatorname{avg}(v))}{a}<1\Rightarrow\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a})-\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-1)=\frac{n_{v}(s-\operatorname{avg}(v))}{a};$ $\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v}-1)=\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v})=0\Rightarrow v^{(2)}=\frac{n_{v}(s-\operatorname{avg}(v))}{a}$

•

$s-a\leq\operatorname{avg}(v)\leq s-\frac{a}{n_{v}}\Rightarrow 1\leq\frac{n_{v}(s-\operatorname{avg}(v))}{a}\leq n_{v}\Rightarrow\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a})-\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-1)=1;$ $\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v}-1)=\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v})=0\Rightarrow v^{(2)}_{v}=1$

•

$s-a-\frac{a}{n_{v}}<\operatorname{avg}(v)<s-a\Rightarrow n_{v}<\frac{n_{v}(s-\operatorname{avg}(v))}{a}<n_{v}+1\Rightarrow\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a})-\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-1)=1;$ $\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v}-1)=0;$ $\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v})=\frac{n_{v}(s-\operatorname{avg}(v)-a)}{a}\Rightarrow h^{(2)}_{v}=1-\frac{n_{v}(s-\operatorname{avg}(v)-a)}{a}$

•

$\operatorname{avg}(v)\leq s-a-\frac{a}{n_{v}}\Rightarrow n_{v}+1\leq\frac{n_{v}(s-\operatorname{avg}(v))}{a}\Rightarrow\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a})-\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-1)=1;\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v}-1)+\text{ReLU}(\frac{n_{v}(s-\operatorname{avg}(v))}{a}-n_{v})=1\Rightarrow h^{(2)}_{v}=0$

∎

Lemma 4.1

*For every $\varepsilon>0$ and $d\in{\mathbb{N}}_{>0}$ , there exists a $\operatorname{Sum}$ -GNN ${\mathcal{N}}$ of size $O(d\frac{1}{\varepsilon})$ such that for every featured graph ${G\in{\mathcal{G}}_{[0,1]\subset{\mathbb{R}}^{d}}}$ it holds that ${\forall v\in V(G)}\;\left\lvert N(G,v)-\operatorname{avg}(v)\right\rvert\leq\varepsilon$ . *

Proof.

Please refer to Figure 6 for an illustration of the construction. We describe a construction of size $O(\frac{1}{\varepsilon})$ which approximates Mean for one coordinate, the extension to $d$ is by a simple duplication. Every reference to a vertex-related value-vector is element-wise: for every vertex $v$ and a value-function $f(v)$ of output dimension $d$ , we use the notation $f(v)$ to represent $f(v)_{i}$ for all $i\in[d]$ .

Let $q\in\mathbb{N}_{>0}$ be the minimal natural such that $\frac{1}{q}<\varepsilon$ , and define $a=\frac{1}{q}$ . Define $\{s_{1}=a,s_{2}=2a,\ldots,s_{q+1}=1+a)\}$ . The first layer of ${\mathcal{N}}$ is identical to $f_{1}$ in the Lemma A.1. The second layer uses a copy of $f_{2}$ from the Lemma A.1, for each $s_{i}$ , multiplied by $s_{i}$ , and then sums the $q+1$ outputs. To see why this is correct, assume $s_{i}-a\leq\operatorname{avg}(v)\leq s_{i}$ . For $j<i\text{\; or\; }j>i+1$ we have by Lemma A.1 zero contribution of $s_{j}$ to the final sum. Next, if $s_{i}-\frac{a}{n_{v}}\leq\operatorname{avg}(v)$ then by Lemma A.1 we have a contribution of

[TABLE]

Denoting the last term by $x$ and considering that $s_{i}-\frac{a}{n_{v}}\leq\operatorname{avg}(v)\leq s_{i}$ we have that $\operatorname{avg}(v)\leq x\leq\operatorname{avg}(v)+a$ . Finally, if $\operatorname{avg}(v)\leq s-\frac{a}{n_{v}}$ then by Lemma A.1 we have zero contribution of $s_{i+1}$ and a contribution of $s_{i}\leq\operatorname{avg}(v)+a$ . Overall, we have that $\operatorname{avg}(v)\leq{\mathcal{N}}(G,v)\leq\operatorname{avg}(v)+a$ . ∎

Theorem 4.2

Let a $\operatorname{Mean}$ -GNN ${\mathcal{N}}_{M}$ consisting of $m$ layers, let the maximum input dimension of any layer be $d$ , and let the maximum Lipschitz-Constant of any FNN of ${\mathcal{N}}_{M}$ be $a$ . Then, for every $\varepsilon>0$ there exists a $\operatorname{Sum}$ -GNN ${\mathcal{N}}_{S}$ such that:

${\forall G\in{\mathcal{G}}_{[0,1]^{d}}\;\forall v\in V(G)\quad|{\mathcal{N}}_{M}(G,v)-{\mathcal{N}}_{S}(G,v)|\leq\varepsilon}$ .

2.

$\left\lvert{\mathcal{N}}_{S}\right\rvert\leq O(\left\lvert{\mathcal{N}}_{M}\right\rvert+\frac{d\cdot m\cdot ad(1-(2ad)^{m})}{\varepsilon(1-(2ad))})$ .

Proof.

Let ${\mathcal{N}}_{M}=((f_{1},\text{Mean}),\ldots,(f_{m},\text{Mean}))$ , that is, $f_{1},\ldots,f_{m}$ are the FNNs constituting ${\mathcal{N}}_{M}$ ’s layers. Let $\hat{\varepsilon}>0$ and let ${\mathcal{N}}_{\hat{\varepsilon}}=((g_{1},\text{Sum}),(g_{2},\text{Sum}))$ the GNN constructed in Lemma 4.1, with parameter $\hat{\varepsilon}$ . Note that $g_{1}$ is indifferent to the aggregation parameter and $g_{2}$ is indifferent to the vertex’s state parameter, thus, for both parameters an argument of ’0’ is as good as any other. Define a $\operatorname{Sum}$ -GNN with $2m$ layers ${\mathcal{N}}_{S}=((\hat{f_{1}},\text{Sum}),\ldots,(\hat{f}_{2m},\text{Sum}))$ . For $j=0\ldots(m-1)$ , each pair of layers $(\hat{f}_{2j+1},\text{Sum}),(\hat{f}_{2(j+1)},\text{Sum})$ approximates the operation of $(f_{j+1},\text{Mean})$ . For a graph $G$ and a vertex $v\in V(G)$ , denote the feature of $v$ after the $(2(j+1))^{th}$ layer of ${\mathcal{N}}_{S}$ by $\hat{v}^{(2(j+1))}$ , with $\hat{v}^{(0)}\coloneqq Z(G)(v)$ . We define $(\hat{f}_{2j+1},\hat{f}_{2(j+1)})$ as follows.

[TABLE]

For $t\in[m]$ denote the feature of $v$ after the $t^{th}$ layer of ${\mathcal{N}}_{M}$ by $v^{(t)}$ , with $v^{(0)}\coloneqq Z(G)(v)$ , and denote by $e_{t}\coloneqq\left\lvert\hat{v}^{(2t)}_{i}-v^{(t)}_{i}\right\rvert$ the maximum error of any coordinate of the output of the $(2t)^{th}$ layer of ${\mathcal{N}}_{S}$ . We prove by induction on $t$ that $e_{t}\leq ad\hat{\varepsilon}\Sigma_{i\in[t]}(2ad)^{i-1}$ . Denote that upper bound by $b_{t}$ . For $t=1$ , we have

[TABLE]

The first $d$ input coordinates to $f_{1}$ are identical. For each coordinate $i$ of the last $d$ coordinates, by definition of $g_{1}$ and $g_{2}$ we have

[TABLE]

That difference translates to a difference of at most $a\hat{\varepsilon}$ in any coordinate of $\left\lvert\hat{v}^{(2)}-v^{(1)}\right\rvert$ . In total, we have $e_{1}\leq ad\hat{\varepsilon}$ . Assume correctness for $t=n$ . Layer $2(n+1)$ of ${\mathcal{N}}_{S}$ is, by definition, the operation of $f_{n+1}$ on at most $2\cdot d$ coordinates. The first $d$ coordinates constitute $\hat{v}^{(2n)}$ and the last $d$ coordinates constitute $g_{2}(0,\Sigma_{w\in N(v)}g_{1}(\hat{w}^{(2n)},0))$ . The error of each of the first $d$ coordinates is, by assumption, at most $b_{n}$ . For each coordinate $i$ of the last $d$ coordinates, we have by assumption

[TABLE]

hence

[TABLE]

hence, by definition of $g_{1}$ and $g_{2}$ ,

[TABLE]

Combining the error bounds for the two types of input, we have that

[TABLE]

With the induction proven, we have that

[TABLE]

Hence, the requirement that $b_{m}\leq\varepsilon$ can be satisfied by setting

[TABLE]

implying

[TABLE]

Finally, using Lemma 4.1 we have that for each $i\in[m]$ it holds

[TABLE]

hence

[TABLE]

∎

Lemma A.2.

Let $q\in{\mathbb{N}}_{>0}$ , define $a\coloneqq\frac{1}{q}$ , and define a function $f:[0,1]\rightarrow\mathbb{R}^{q}$ such that

[TABLE]

That is, $f$ is an almost-unary representation of $x$ in units of $\frac{1}{q}$ , ”almsot” because it may contain a fraction (between [math] and $1$ ) in its last coordinate. For a finite multiset $x=\{x_{1},\ldots,x_{n}\},x_{i}\in[0,1]$ , define

[TABLE]

a mapping from the multiset to the sum of its elements’ representation, coordinate-wise capped at $a$ . Then,

[TABLE]

Proof.

w.l.o.g assume $\max(x)=x_{1}$ . For the lower bound, it is not hard to verify that $\forall i\in[q]\;g(x)_{i}\geq f(x_{1})_{i}$ , hence $\Sigma_{i\in[q]}g(x)_{i}\geq\Sigma_{i\in[q]}f(x_{1})_{i}=x_{1}$ . For the upper bound, assume $j=\max(i:g(x)_{i}>0)$ , then necessarily $x_{1}\geq(j-1)a$ and $\Sigma_{i\in[q]}g(x)_{i}\leq ja$ , hence $\Sigma_{i\in[q]}g(x)_{i}\leq x_{1}+a$ . ∎

Lemma 4.4

*For every $\varepsilon>0$ and $d\in{\mathbb{N}}_{>0}$ , there exists a $\operatorname{Sum}$ -GNN ${\mathcal{N}}$ of size $O(d\frac{1}{\varepsilon})$ such that for every featured graph $G\in{\mathcal{G}}_{[0,1]^{d}}$ and vertex $v\in V(G)$ it holds that $\left\lvert N(G,v)-\max(v)\right\rvert\leq\varepsilon$ . *

Proof.

We describe a construction of size $O(\frac{1}{\varepsilon})$ that approximates Max for one coordinate, the extension to $d$ is by a simple duplication. Every reference to a vertex-related value-vector is element-wise: for every vertex $v$ and a value-function $f(v)$ of output dimension $d$ , we use the notation $f(v)$ to represent $f(v)_{i}$ for all $i\in[d]$ .

Let $q\in\mathbb{N}_{>0}$ be the minimal natural such that $\frac{1}{q}<\varepsilon$ and define $a\coloneqq\frac{1}{q}$ . The first GNN layer computes for each vertex $v$ a vector $v^{(1)}\in[0,a]^{q}$ such that $(v^{(1)})_{i}=ReLU(Z(v)-(i-1)a)-ReLU(Z(v)-(i-1)a-a)$ . Observe that the computation corresponds to the mapping $f$ in Lemma A.2. The second GNN layer first caps the sum-aggregation of the neighbors’ vectors, then sums the coordinates of the capped vector. That is, for a vertex $v$ , let $y_{v}=\Sigma_{w\in N(v)}w^{(1)}$ , then $v^{(2)}=\Sigma_{i\in[q]}(ReLU((y_{v})_{i})-ReLU((y_{v})_{i}-a))$ . Using Lemma A.2, we get that $\max(v)\leq v^{(2)}\leq\max(v)+a<\max(v)+\varepsilon$ . ∎

Theorem 4.5

Let a $\operatorname{Max}$ -GNN ${\mathcal{N}}_{M}$ consisting of $m$ layers, let the maximum input dimension of any layer be $d$ , and let the maximum Lipschitz-Constant of any FNN of ${\mathcal{N}}_{M}$ be $a$ . Then, for every $\varepsilon>0$ there exists a $\operatorname{Sum}$ -GNN ${\mathcal{N}}_{S}$ such that:

${\forall G\in{\mathcal{G}}_{[0,1]^{d}}\;\forall v\in V(G)\quad|{\mathcal{N}}_{M}(G,v)-{\mathcal{N}}_{S}(G,v)|\leq\varepsilon}$ .

2.

$\left\lvert{\mathcal{N}}_{S}\right\rvert\leq O(\left\lvert{\mathcal{N}}_{M}\right\rvert+\frac{d\cdot m\cdot ad(1-(2ad)^{m})}{\varepsilon(1-(2ad))})$ .

Proof.

The proof is identical to the Theorem 4.2 with the following adaptations:

Replacing any mention of ’Mean’, with ’Max’.

2.

Replacing any usage of Lemma 4.1, with Lemma 4.4.

3.

Replacing equations (1),(2), with equations (3),(4) hereinafter.

[TABLE]

∎

Proofs for Section 5

A.1 Describability

Let $F$ be a set of polynomials in $k,c$ , and let $g(k,c)$ be a function in $k,c$ .

We say that $F$ weakly-describes $g$ if and only if:

a.

$F$ is finite.

b.

$\forall k,c\in{\mathbb{N}}\ \ \exists p\in F\ :\ p(k,c)=g(k,c)$ .

We identify a polynomial $p(k,c)$ as being good if and only if $p(k,c)=\Sigma_{i\in[n],j\in[n]}a_{i,j}k^{i}c^{j}+\Sigma_{i=0}^{n}b_{i}k^{i}$ for some real coefficients $\{a_{i,j}\},\{b_{i}\}$ and some maximum degree $n\in{\mathbb{N}}$ . That is, $p(k,c)$ is a polynomial in $k,c$ with max degrees $n$ for $k,c$ , and every appearance of $c$ is with multiplication by a polynomial of $k$ of degree at least $1$ . We say that $F$ is good if and only if every polynomial in it is good.

We say that $F$ describes $g$ if and only if: $F$ weakly-describes $g$ and $F$ is good. We say that $g$ is describable (w-describable) if and only if there exists a set that (weakly-) describes it.

Let $F$ be a finite set of polynomials in $k,c$ , we denote by ${{\mathcal{B}}(F)\coloneqq\{k^{i}c^{j}\ :\ \exists p\in F\quad p=(...+a_{i,j}k^{i}c^{j})\quad a_{i,j}\neq 0\}}$ the building blocks of $F$ , that is, the degree combinations that appear in any of the polynomials in $F$ . Let $b\in\{k,c\}$ , we define ${b{\mathcal{B}}(F)\coloneqq\{b\cdot k^{i}c^{j}:k^{i}c^{j}\in{\mathcal{B}}(F)\}}$ .

For every $a\in{\mathbb{R}}$ and a set of functions $F$ of $k,c$ , we define $aF\coloneqq\{af:f\in F\}$ , and $(a+F)\coloneqq\{a+f:f\in F\}$ . For two sets of functions $F,H$ of $k,c$ , we define $F+H\coloneqq\{f+h:f\in F,h\in H\}$ .

Lemma A.3.

a.

Let $f(k,c)$ a function (w-)describable by a set $F$ . Let $g(k,c):=ReLU(f(k,c))$ the composition of $ReLU$ over $f(k,c)$ , then $g$ is (w-)describable by a set $F^{\prime}$ such that ${\mathcal{B}}(F^{\prime})\subseteq({\mathcal{B}}(F)\cup\{k^{0}c^{0}\})$ .

b.

Let $f_{1}(k,c),\ldots,f_{l}(k,c)$ be functions (w-)describable by $F_{1},\ldots,F_{l}$ respectively. Then, for every real coefficients $\{a_{i}\},b$ the affine function $(\Sigma_{i=1}^{n}a_{i}f_{i})+b$ is (w-)describable by a set $F$ such that ${\mathcal{B}}(F)\subseteq(\{k^{0}c^{0}\}\bigcup_{i\in[l]}{\mathcal{B}}(F_{i}))$ .

c.

Each output of a ReLU activated FNN whose inputs are all (w-)describable by a set $F$ is (w-)describable by a set $F^{\prime}$ such that ${\mathcal{B}}(F^{\prime})\subseteq({\mathcal{B}}(F)\cup\{k^{0}c^{0}\})$ .

d.

Let $f(k,c)$ a function w-describable by a set $F$ , then $kf(k,c)$ is describable by some set $F^{\prime}$ such that ${\mathcal{B}}(F^{\prime})\subseteq k{\mathcal{B}}(F)$ , and $cf(k,c)$ is w-describable by a set $F^{\prime\prime}$ such that ${\mathcal{B}}(F^{\prime\prime})\subseteq c{\mathcal{B}}(F)$ .

Proof.

a. Let $F$ a set that (w-)describes the function $f$ . For any $k,c$ either $g(k,c)=f(k,c)$ or $g(k,c)=0$ , hence $ReLU(f)$ is (w-)describable by $F\cup\{0\}$ .

b. It is not hard to verify that if $f_{i}$ is (w-)describable by $F_{i}$ then for every $a\in\mathbb{R}$ it holds that $af_{i}$ is (w-)describable by $aF_{i}$ , and $f_{i}+a$ is (w-)describable by $F_{i}+a$ . It is also not hard to verify then that for any $a_{i},a_{j}\in\mathbb{R}$ it holds that $a_{i}f_{i}+a_{j}f_{j}$ is (w-)describable by $(a_{i}F_{i})+(a_{j}F_{j})$ . A straightforward induction proves that a linear combination of arbitrarily many (w-)describable functions is (w-)describable. Finally, let $F$ a set that (w-)describes the linear combination, then $F+b$ is a set that (w-)describes the affine function.

c. Implied by (a)+(b).

d. It is not hard to verify that if $f$ is w-describable by $F$ then $kf$ is describable by $kF$ . Also, it is not hard to verify that if $f$ is w-describable by $F$ then $cf$ is w-describable by $cF$ . ∎

Lemma A.4.

Let a series of graphs $\{H_{k,c}\}$ , parametarized in $k,c\in{\mathbb{N}}_{>0}$ , each having an identified vertex $u$ , such that for every $m$ -layer $\operatorname{Sum}$ -GNN ${\mathcal{N}}$ it holds that ${\mathcal{N}}(H_{k,c},u)$ , viewed as a function of $k,c$ , is describable. Then, for every $\operatorname{Sum}$ -GNN ${\mathcal{N}}$ and for every $\varepsilon>0$ there exist $k,c$ s.t $\left\lvert{\mathcal{N}}(H_{k,c},u)-c\right\rvert>\varepsilon$ .

Proof.

Let $F$ be a finite set of polynomials that describes ${\mathcal{N}}(H_{k,c},u)$ . Fix any specific $c\in{\mathbb{N}}_{>0}$ , and for $K\in{\mathbb{N}}_{>0}$ denote by $F_{K,c}=\{p\in F:\exists k\geq K:{\mathcal{N}}(H_{k,c},u)=p(k,c)\}$ only those polynomials in $F$ that intersect with $u^{(m)}_{k,c}$ in the domain $[K,\infty)\times\{c\}$ . Denote the polynomials in $F_{K,c}$ that are a constant, by $\widehat{F}_{K,c}=\{p:p\in F_{K,c},\ p\text{ constant}\}$ . Let $\varepsilon>0$ and assume by contradiction that for every $k\in{\mathbb{N}}_{>0}$ it holds that $\left\lvert{\mathcal{N}}(H_{k,c},u)-c\right\rvert\leq\varepsilon$ . Then, there must exist $K_{c}\in{\mathbb{N}}_{>0}$ for which $\widehat{F}_{K_{c},c}=F_{K_{c},c}$ . Otherwise, as $F$ is assumed to describe ${\mathcal{N}}(H_{k,c},u)$ , any appearance of $c$ , in any $p\in F_{k,c}$ , is tied to $k$ , and we would have

[TABLE]

and

[TABLE]

in contradiction to $\left\lvert{\mathcal{N}}(H_{k,c},u)-c\right\rvert\leq\varepsilon$ . By definition, $\widehat{F}_{K,c}$ is a subset of $F$ which is finite, and so $\max(\widehat{F}_{K_{c},c})\leq\max(p\in F:p\text{ constant})$ . Denote the last term by $max_{F}$ . As our reasoning thus far is true for any $c$ , it holds that $\max(\max(\widehat{F}_{K_{c},c}):{c\in{\mathbb{N}}})\leq max_{F}$ . Finally, for $c=\left\lceil max_{F}+\varepsilon+1\right\rceil$ necessarily for all $k\geq K_{c}$ it holds that $\left\lvert{\mathcal{N}}(H_{k,c},u)-c\right\rvert>c-max_{F}>\varepsilon$ . ∎

Section 5.1

Define a series of featured star graphs $\{G_{k,c}\}$ as follows: For $(k,c)\in{\mathbb{N}}_{>0}^{2}$ ,

•

$V(G_{k,c})=\{u\}\cup\{v_{1},\ldots,v_{k}\}$

•

$E(G_{k,c})=\bigcup_{i\in[k]}\{\{u,v_{i}\}\}$

•

$Z(G_{k,c})=\{(u,0)\}\bigcup_{i\in[k]}\{(v_{i},c)\}$

Let ${\mathcal{N}}$ be an $m$ -layer $\operatorname{Sum}$ -GNN. We define $u^{(t)}_{k,c}\coloneqq{\mathcal{N}}^{(t)}(G_{k,c},u)$ , the feature of $u\in V(G_{k,c})$ after operating the first $t$ layers of ${\mathcal{N}}$ . Note that $u^{(m)}_{k,c}={\mathcal{N}}(G_{k,c},u)$ . For every $i,j\in[k]$ there is an automorphism of $G_{k}$ that maps $v_{i}$ to $v_{j}$ , thus they receive the same feature throughout the computation. We define $v^{(t)}_{k,c}\coloneqq{\mathcal{N}}^{(t)}(G_{k,c},v_{i})$ for every ${i\in[k]}$ . In our argumentation, we view $u^{(t)}_{k,c},v^{(t)}_{k,c}$ as functions of $k,c$ .

Lemma A.5.

It holds that $u^{(m)}_{k,c}$ is describable.

Proof.

We show by induction that for every $t\in[m]$ it holds that $v_{k,c}^{(t)}$ is w-describable and that $u_{k,c}^{(t)}$ is describable. For $t=0$ we have $u_{k,c}^{(t)}=0,v_{k,c}^{(t)}=c$ and the assumption holds. Assume correctness for $t=n$ . By definition, $u_{k,c}^{(n+1)}=f_{n+1}(u_{k,c}^{(n)},kv_{k,c}^{(n)})$ where $f_{n+1}$ is a ReLU FNN. By assumption, $v_{k,c}^{(n)}$ is w-describable and so by Lemma A.3 we have that $kv_{k,c}^{(n)}$ is describable. Also, by assumption, $u_{k,c}^{(n)}$ is describable. Hence, by Lemma A.3 we have that $u_{k,c}^{(n+1)}$ is describable. The proof for $v_{k,c}^{(n+1)}$ is in similar fashion. ∎

Theorem 5.1

*Let $f:{\mathcal{G}}_{{\mathbb{N}}^{1}}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation, such that for every $k,c$ it holds that $f(G_{k,c})(u)=c$ . Then, $\text{$ \operatorname{Sum} $-GNNs }\not\approx f$ . *

Proof.

Immediate from combining Lemma A.5 and Lemma A.4. ∎

Corollary 5.2

*Denote by $S$ the set of all multisets over ${\mathbb{N}}_{>0}$ . Let $g:S\rightarrow{\mathbb{R}}$ an aggregation such that $\forall a,b\in{\mathbb{N}}_{>0}\ g({\{a\}\choose b})=a$ , that is, $g$ aggregates every homogeneous multiset to its single unique value. Then, $\operatorname{Sum}$ -GNNs $\not\geq^{{}_{{\mathbb{N}}}}$ g-aggregation GNNs. *

Proof.

Let $f:{\mathcal{G}}_{{\mathbb{N}}^{1}}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation, such that for every featured graph $G$ , and for every vertex $v\in V(G)$ , it holds that $f(G)(v)\coloneqq g(N(v))$ . Then, by Theorem 5.1, $\operatorname{Sum}$ -GNNs $\not\approx f$ . Clearly, there is a $g$ -aggregation GNN that exactly computes $f$ . ∎

Consider another variant of $\{G_{k,c}\}$ :

•

$V(G_{k,c})=\{u_{1},\ldots,u_{k^{2}}\}\cup\{v_{1},\ldots,v_{k}\}$

•

$E(G_{k,c})=\bigcup_{i\in[k^{2}],j\in[k]}\{\{u_{i},v_{j}\}\}$

•

$Z(G_{k,c})=\bigcup_{i\in[k^{2}]}\{(u_{i},0)\}\bigcup_{i\in[k]}\{(v_{i},c)\}$

Let ${\mathcal{N}}$ be an $m$ -layer $\operatorname{Sum}$ -GNN. We use the notations $u_{k,c}^{(t)}$ and $v_{k,c}^{(t)}$ with similar meaning to before, where $u_{k,c}^{(t)}$ now refers to each of the $u_{i}$ vertices.

Lemma A.6.

It holds that $k^{2}u_{k,c}^{(m)}+kv_{k,c}^{(m)}$ is describable by a set $F$ such that for every $p\in F$ it holds that $p$ does not contain $k^{2}c$ (with coefficient $\neq 0$ ).

Proof.

We prove the correctness of the following statements for every $t\in[m]$ , from which the lemma clearly follows.

$u_{k,c}^{(t)}$ is describable.

2.

$v_{k,c}^{(t)}$ is weakly-describable by a set $F$ such that for every $p\in F$ it holds that $p$ does not contain $kc$ .

Proof is by induction on $t$ . Correctness for $t=0$ is clear. Assume correctness for $t=n$ .

By definition, ${u_{k,c}^{(n+1)}=f_{n+1}(u_{k,c}^{(n)},kv_{k,c}^{(n)})}$ for some FNN $f_{n+1}$ . By the induction assumption, $u_{k,c}^{(n)}$ is describable and clearly $kv_{k,c}^{(n)}$ is also describable. Hence, by Lemma A.3 we have that $u_{k,c}^{(n+1)}$ is describable.
By definition, ${v_{k,c}^{(n+1)}=f_{n+1}(v_{k,c}^{(n)},k^{2}u_{k,c}^{(n)})}$ for some FNN $f_{n+1}$ . By the induction assumption, $v_{k,c}^{(n)}$ obtains the stated property, and clearly so does $k^{2}u_{k,c}^{(n)}$ . By Lemma A.3, we have that the output of operating $f_{n+1}$ on $v_{k,c}^{(n)},k^{2}u_{k,c}^{(n)}$ obtains the stated property. ∎

Lemma A.7.

Let $f:{\mathcal{G}}_{{\mathbb{N}}^{1}}\rightarrow{\mathbb{R}}$ a graph embedding such that $\forall k,c\ f(G_{k,c})=\frac{kc}{k+1}$ . Let an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ\operatorname{avg}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\approx f}$ .

Proof.

Let a $\operatorname{Sum}$ -GNN ${\mathcal{N}}$ . By definition, ${\operatorname{avg}\circ{\mathcal{N}}(G_{k,c})=\frac{k^{2}\cdot u_{k,c}^{(m)}+k\cdot v_{k,c}^{(m)}}{k(k+1)}=\frac{k\cdot u_{k,c}^{(m)}+v_{k,c}^{(m)}}{(k+1)}}$ . By Lemma A.6, $k\cdot u_{k,c}^{(m)}+v_{k,c}^{(m)}$ is weakly-describable by a set $F^{\prime}$ such that for every $p\in F^{\prime}$ it holds that $p$ does not contain $kc$ . Using a similar technique to the one in proof of Lemma A.3, it is not hard to show that $f_{\mathfrak{F}}\circ\operatorname{avg}\circ{\mathcal{N}}(G_{k,c})$ is weakly-describable by a set $F$ such that for every $p\in F$ it holds that $p$ does not contain $kc$ . Let any polynomial ${p\in F}$ and let $b\in{\mathbb{R}}$ be the coefficient of $k$ in $p$ . It is not hard to verify that for every $c$ it holds that ${\lim_{k\rightarrow\infty}\left\lvert\frac{p(k,c)}{k+1}\right\rvert\in\{0,|b|,\infty\}}$ . The finiteness of $F$ implies that there is a maximal such $|b|$ over all $p\in F$ , denote it by $b_{max}$ . The finiteness of $F$ also implies that:

Given $c$ and $\delta>0$ there exists $K_{0}$ such that for every $l>K_{0}$ and every $p\in F$ with a finite limit (as ${k\rightarrow\infty}$ ) it holds that ${\left\lvert\frac{p(l,c)}{l+1}-\lim_{k\rightarrow\infty}\frac{p(k,c)}{k+1}\right\rvert<\delta}$ .

2.

Given $c$ and $\delta>0$ there exists $K_{0}$ such that for every $l>K_{0}$ and every $p\in F$ with an infinite limit (as ${k\rightarrow\infty}$ ) it holds that ${\left\lvert\frac{p(l,c)}{l+1}-c\right\rvert>\delta}$ .

Finally, for every $c$ it holds that ${\lim_{k\rightarrow\infty}\frac{kc}{k+1}=c}$ . Let $\varepsilon>0$ , then for $c=\left\lceil 2\varepsilon+b_{max}\right\rceil$ there exists $k$ such that for every $p\in F$ it holds that ${\left\lvert\frac{p(k,c)-kc}{k+1}\right\rvert>\varepsilon}$ . ∎

Lemma A.8.

Let $f:{\mathcal{G}}_{{\mathbb{N}}^{1}}\rightarrow{\mathbb{R}}$ a graph embedding such that $\forall k,c\ f(G_{k,c})=\frac{kc}{k+1}$ . Let an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ\operatorname{sum}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\approx f}$ .

Proof.

Let $\varepsilon>0$ , then ${\operatorname{sum}\circ{\mathcal{N}}(G_{k,c})=k^{2}\cdot u_{k,c}^{(m)}+k\cdot v_{k,c}^{(m)}}$ . Clearly, $k^{2}\cdot u_{k,c}^{(m)}+k\cdot v_{k,c}^{(m)}$ is describable. Hence, by Lemma A.3, it holds that $f_{\mathfrak{F}}\circ\operatorname{sum}\circ{\mathcal{N}}(G_{k,c})$ is describable. Let $F$ a describing set of $k^{2}\cdot u_{k,c}^{(m)}+k\cdot v_{k,c}^{(m)}$ , let any polynomial $p\in F$ , and let $b\in{\mathbb{R}}$ be the coefficient of $k^{0}$ in $p$ . It is not hard to verify that for every $c$ it holds that $\lim_{k\rightarrow\infty}\left\lvert p(k,c)\right\rvert\in\{0,|b|,\infty\}$ . The finiteness of $F$ implies that there is a maximal such $|b|$ over all $p\in F$ , denote it by $b_{max}$ . The finiteness of $F$ also implies that:

Given $c$ and $\delta>0$ there exists $K_{0}$ such that for every $l>K_{0}$ and every $p\in F$ with a finite limit (as ${k\rightarrow\infty}$ ) it holds that ${p(l,c)-\lim_{k\rightarrow\infty}p(k,c)<\delta}$ .

2.

Given $c$ and $\delta>0$ there exists $K_{0}$ such that for every $l>K_{0}$ and every $p\in F$ with an infinite limit (as ${k\rightarrow\infty}$ ) it holds that ${\left\lvert p(l,c)-c\right\rvert>\delta}$ .

Finally, for every $c$ it holds that ${\lim_{k\rightarrow\infty}\frac{kc}{k+1}=c}$ . Let $\varepsilon>0$ , then for $c=\left\lceil 2\varepsilon+b_{max}\right\rceil$ there exists $k$ such that for every $p\in F$ it holds that ${\left\lvert\frac{p(k,c)-kc}{k+1}\right\rvert>\varepsilon}$ . Let $\varepsilon>0$ , then for $c=\left\lceil 2\varepsilon+b_{max}\right\rceil$ there exists $k$ such that for every $p\in F$ it holds that ${\left\lvert p(k,c)-\frac{kc}{k+1}\right\rvert>\varepsilon}$ . ∎

Theorem 5.3

*Let $f:{\mathcal{G}}_{{\mathbb{N}}^{1}}\rightarrow{\mathbb{R}}$ a graph embedding such that $\forall k,c\ f(G_{k,c})=\frac{kc}{k+1}$ . Let an aggregation ${{\mathfrak{a}}\in\{\operatorname{sum},\operatorname{avg}\}}$ and an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ{\mathfrak{a}}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\approx f}$ . *

Proof.

Follows from combining Lemma A.7 and Lemma A.8. ∎

Corollary 5.4

*Denote by $S$ the set of all multisets over ${\mathbb{N}}_{>0}$ . Let $g:S\rightarrow{\mathbb{R}}$ an aggregation such that $\forall a,b\in{\mathbb{N}}_{>0}\ g({\{a\}\choose b})=a$ . Let an aggregation ${{\mathfrak{a}}\in\{\operatorname{sum},\operatorname{avg}\}}$ and an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ{\mathfrak{a}}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\geq^{{}_{{\mathbb{N}}}}\operatorname{avg}\circ\ g\text{-GNNs}}$ . *

Proof.

Clearly, for a straightforward g-aggregation GNN ${\mathcal{N}}_{g}$ it holds that ${\mathcal{N}}_{g}(G_{k,c})(u_{i})=c$ and ${\mathcal{N}}_{g}(G_{k,c})(v_{i})=0$ , hence $\operatorname{avg}\circ{\mathcal{N}}_{g}(G_{k,c})=\frac{k^{2}c}{k^{2}+k}=\frac{kc}{k+1}$ . By Theorem 5.3, no composition of ${{\mathfrak{r}}{\mathfrak{o}}}$ with a $\operatorname{Sum}$ -GNN can approximate $f(G)={\mathcal{N}}_{g}(G)$ . ∎

Section 5.2

We define a new series of featured graphs $\{G_{k,c}\}$ (see Figure 3). For every $(k,c)\in{\mathbb{N}}_{>0}^{2}$ :

•

$V(G_{k,c})=\{u\}\cup\{v_{1},\ldots,v_{k}\}\cup\{w_{1},\ldots,w_{c}\}$

•

$E(G_{k,c})=\bigcup_{i\in[k]}\{\{u,v_{i}\}\}\bigcup_{i\in[k],j\in[c]}\{\{v_{i},w_{j}\}\}$

•

$Z(G_{k,c})=\{(u,1)\}\bigcup_{i\in[k]}\{(v_{i},1)\}\bigcup_{i\in[c]}\{(w_{i},1)\}$

Let ${\mathcal{N}}$ be an $m$ -layer $\operatorname{Sum}$ -GNN. We define $u^{(t)}_{k,c}\coloneqq{\mathcal{N}}^{(t)}(G_{k,c},u)$ , $v^{(t)}_{k,c}\coloneqq{\mathcal{N}}^{(t)}(G_{k,c},v_{i})$ , and $w^{(t)}_{k,c}\coloneqq{\mathcal{N}}^{(t)}(G_{k,c},w_{i})$ , following a reasoning similar to Section 5.1, and view $u^{(t)}_{k,c},v^{(t)}_{k,c},w^{(t)}_{k,c}$ as functions of $k,c$

Lemma A.9.

It holds that $u_{k,c}^{(m)}$ is describable.

Proof.

We show by induction that for every $t\in[m]$ it holds that $v_{k,c}^{(t)}$ is w-describable and that $u_{k,c}^{(t)},w_{k,c}^{(t)}$ are describable. For $t=0$ we have $u_{k,c}^{(t)}=v_{k,c}^{(t)}=w_{k,c}^{(t)}=1$ and the assumption holds. Assume correctness for $t=n$ . By definition, $u_{k,c}^{(n+1)}=f_{n+1}(u_{k,c}^{(n)},kv_{k,c}^{(n)})$ where $f_{n+1}$ is a ReLU FNN. By assumption, $v_{k,c}^{(n)}$ is w-describable and so by Lemma A.3 we have that $kv_{k,c}^{(n)}$ is describable. Also by assumption, $u_{k,c}^{(n)}$ is describable. Hence, by Lemma A.3 we have that $u_{k,c}^{(n+1)}$ is describable. For $v_{k,c}^{(n+1)}$ , by definition, $v_{k,c}^{(n+1)}=f_{n+1}(v_{k,c}^{(n)},cw_{k,c}^{(n)}+u_{k,c}^{(n)})$ , and by assumption $u_{k,c}^{(n)},v_{k,c}^{(n)},w_{k,c}^{(n)}$ are w-describable. Hence, by Lemma A.3 we have that $v_{k,c}^{(n+1)}$ is w-describable. The proof for $w_{k,c}^{(n+1)}$ is in similar fashion. ∎

Theorem 5.5

*Let $f:{\mathcal{G}}_{1}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation, such that for every $k,c$ it holds that $f(G_{k,c})(u)=c$ . Then, $\operatorname{Sum}$ -GNNs $\not\approx f$ . *

Proof.

Immediate from combining Lemma A.9 and Lemma A.4. ∎

Corollary 5.6

*Denote by $S$ the set of all multisets over ${\mathbb{N}}_{>0}$ , and let $g:S\rightarrow{\mathbb{R}}$ an aggregation such that $\forall a,b\in{\mathbb{N}}_{>0}\ g({\{a\}\choose b})=a$ . Then, $\operatorname{Sum}$ -GNNs $\not\geq^{{}_{\{1\}}}$ (Sum, g)-GNNs. *

Proof.

Let $f:{\mathcal{G}}_{\{1\}}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation such that for every featured graph $G$ , for every vertex $v\in V(G)$ , it holds that $f(G)(v)\coloneqq g(\{\operatorname{sum}(w):w\in N(v)\})$ . Then, by Theorem 5.5, $\operatorname{Sum}$ -GNNs $\not\approx f$ . Clearly, there is a GNN that uses Sum aggregation in its first layer and $g$ aggregation in its second layer, that exactly computes $f$ . ∎

We define one last variant of a $\{G_{k,c}\}$ series:

•

$V(G_{k,c})=\{u_{1},\ldots,u_{k^{2}}\}\cup\{v_{1},\ldots,v_{k^{3}}\}\cup\{w_{1},\ldots,w_{kc}\}$

•

$E(G_{k,c})=\bigcup_{j\in[k^{2}],i\in[k^{3}]}\{\{u_{j},v_{i}\}\}\bigcup_{i\in[k^{3}],j\in[kc]}\{\{v_{i},w_{j}\}\}$

•

$Z(G_{k,c})=\bigcup_{i\in[k^{2}]}\{(u_{i},0)\}\bigcup_{i\in[k^{3}]}\{(v_{i},0)\}\bigcup_{i\in[kc]}\{(w_{i},1)\}$

Let ${\mathcal{N}}$ be an $m$ -layer $\operatorname{Sum}$ -GNN. The notations $u_{k,c}^{(t)}$ , $v_{k,c}^{(t)}$ , and $w_{k,c}^{(t)}$ , are used as before.

Lemma A.10.

It holds that $k^{2}u_{k,c}^{(m)}+k^{3}v_{k,c}^{(m)}+kcw_{k,c}^{(m)}$ is describable by a set $F$ and for every $p\in F$ it holds that $p$ does not contain $k^{3}c$ (with coefficient $\neq 0$ ).

Proof.

We prove the correctness of the following statements, from which the lemma clearly follows.

$u_{k,c}^{(t)}$ is weakly-describable by a set $F$ such that for every $p\in F$ it holds that $p$ does not contain $kc$ .

2.

$v_{k,c}^{(t)}$ is describable.

3.

$w_{k,c}^{(t)}$ is weakly-describable by a set $F$ such that for every $p\in F$ it holds that $p$ does not contain $k^{2}$ .

Proof is by induction on $t$ . Correctness for $t=0$ is immediate. Assume correctness for $t=n$ .

By definition, ${u_{k,c}^{(n+1)}=f_{n+1}(u_{k,c}^{(n)},k^{3}v_{k,c}^{(n)})}$ for some FNN $f_{n+1}$ . By the induction assumption, $u_{k,c}^{(n)}$ obtains the stated property and the same holds for $k^{3}v_{k,c}^{(n)}$ . By Lemma A.3, we have that the output of operating $f_{n+1}$ on $u_{k,c}^{(n)},k^{3}v_{k,c}^{(n)}$ obtains the stated property.
By definition, ${v_{k,c}^{(n+1)}=f_{n+1}(v_{k,c}^{(n)},k^{2}u_{k,c}^{(n)}+kcw_{k,c}^{(n)})}$ for some FNN $f_{n+1}$ . By the induction assumption, $v_{k,c}^{(n)}$ obtains the stated property, and clearly so do $k^{2}u_{k,c}^{(n)},kcw_{k,c}^{(n)}$ . The rest follows similarly to the end of (1).
By definition, ${w_{k,c}^{(n+1)}=f_{n+1}(w_{k,c}^{(n)},k^{3}v_{k,c}^{(n)})}$ for some FNN $f_{n+1}$ . By the induction assumption, $w_{k,c}^{(n)}$ obtains the stated property, and clearly so does $k^{3}v_{k,c}^{(n)}$ . The rest follows similarly to the end of (1). ∎

Lemma A.11.

Let $f:{\mathcal{G}}_{\{0,1\}^{1}}\rightarrow{\mathbb{R}}$ a graph embedding such that $\forall k,c\ f(G_{k,c})=\frac{(k^{2}+kc)kc}{k^{3}+k^{2}+kc}$ . Let an aggregation ${{\mathfrak{a}}\in\{\operatorname{sum},\operatorname{avg}\}}$ and an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ{\mathfrak{a}}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\approx f}$ .

Proof.

Let $\varepsilon>0$ . Define ${A\coloneqq k^{2}u_{k,c}^{(m)}+k^{3}v_{k,c}^{(m)}+kcw_{k,c}^{(m)}}$ , then ${\operatorname{avg}\circ{\mathcal{N}}(G_{k,c})=\frac{A}{k^{3}+k^{2}+kc}}$ . By Lemma A.10, $A$ is describable by a set $F^{\prime}$ such that for every $p\in F^{\prime}$ it holds that $p$ does not contain $k^{3}c$ , hence $\operatorname{avg}\circ{\mathcal{N}}(G_{k,c})$ is describable. Hence, by Lemma A.3 $f_{\mathfrak{F}}\circ\operatorname{avg}\circ{\mathcal{N}}(G_{k,c})$ is describable. Let $F$ a describing set be . Let any polynomial $p\in F$ and let $b\in{\mathbb{R}}$ the coefficient of the component $k^{3}$ in $p$ . Then, it is not hard to verify that for every $c$ it holds that ${\lim_{k\rightarrow\infty}\left\lvert\frac{p(k,c)}{k^{3}+k^{2}+kc}\right\rvert\in\{0,|b|,\infty\}}$ . The finiteness of $F$ implies that there is a maximal such $|b|$ over all ${p\in F}$ , denote it by $b_{max}$ . The finiteness of $F$ also implies that:

Given $c$ and $\delta>0$ there exists $K_{0}$ such that for every $l>K_{0}$ and every $p\in F$ with a finite limit (as ${k\rightarrow\infty}$ ) it holds that ${\left\lvert\frac{p(l,c)}{l^{3}+l^{2}+lc}-\lim_{k\rightarrow\infty}\frac{p(k,c)}{k^{3}+k^{2}+kc}\right\rvert<\delta}$ .

2.

Given $c$ and $\delta>0$ there exists $K_{0}$ such that for every $l>K_{0}$ and every $p\in F$ with an infinite limit (as ${k\rightarrow\infty}$ ) it holds that ${\frac{p(l,c)}{l^{3}+l^{2}+lc}-c>\delta}$ .

Finally, for every $c$ it holds that ${\lim_{k\rightarrow\infty}\frac{(k^{2}+kc)kc}{k^{3}+k^{2}+kc}=c}$ . Hence, for $c=\left\lceil 2\varepsilon+b_{{max}}\right\rceil$ there exists $k$ such that for every $p\in F$ it holds that ${\left\lvert\frac{p(k,c)-(k^{2}+kc)kc}{k^{3}+k^{2}+kc}\right\rvert>\varepsilon}$ , implying ${\left\lvert\operatorname{avg}\ \circ{\mathcal{N}}(G_{k,c})-f(G_{k,c})\right\rvert>\varepsilon}$ . ∎

Lemma A.12.

Let $f:{\mathcal{G}}_{{\mathbb{N}}^{1}}\rightarrow{\mathbb{R}}$ a graph embedding, such that for every $k,c$ it holds that $f(G_{k,c})=\frac{(k^{2}+kc)kc}{k^{3}+k^{2}+kc}$ . Then, $\operatorname{sum}\circ$ $\operatorname{Sum}$ -GNNs $\not\approx f$ .

Proof.

Let $\varepsilon>0$ . Clearly, ${k^{2}u_{k,c}^{(m)}+k^{3}v_{k,c}^{(m)}+kcw_{k,c}^{(m)}}$ is describable. Let $F$ a describing set of ${k^{2}u_{k,c}^{(m)}+k^{3}v_{k,c}^{(m)}+kcw_{k,c}^{(m)}}$ , let any polynomial $p\in F$ , and let $b\in{\mathbb{R}}$ be the coefficient of $k^{0}$ in $p$ . Then, it is not hard to verify that for every $c$ it holds that ${\lim_{k\rightarrow\infty}\left\lvert p(k,c)\right\rvert\in\{0,|b|,\infty\}}$ . The finiteness of $F$ implies that there is a maximal such $|b|$ over all $p\in F$ , denote it by $b_{max}$ . The finiteness of $F$ also implies that:

Given $c$ and $\delta>0$ there exists $K_{0}$ such that for every $l>K_{0}$ and every $p\in F$ with a finite limit (as ${k\rightarrow\infty}$ ) it holds that ${\left\lvert p(l,c)-\lim_{k\rightarrow\infty}p(k,c)\right\rvert<\delta}$ .

2.

Given $c$ and $\delta>0$ there exists $K_{0}$ such that for every $l>K_{0}$ and every $p\in F$ with an infinite limit (as ${k\rightarrow\infty}$ ) it holds that ${\left\lvert p(l,c)-c\right\rvert>\delta}$ .

Finally, for every $c$ it holds that ${\lim_{k\rightarrow\infty}\frac{(k^{2}+kc)kc}{k^{3}+k^{2}+kc}=c}$ . Hence, for $c=\left\lceil 2\varepsilon+max(0,b_{{max}})\right\rceil$ there exists $k$ such that for every $p\in F$ it holds that ${\left\lvert p(k,c)-\frac{(k^{2}+kc)kc}{k^{3}+k^{2}+kc}\right\rvert>\varepsilon}$ , implying ${\left\lvert\operatorname{sum}\ \circ{\mathcal{N}}(G_{k,c})-f(G_{k,c})\right\rvert>\varepsilon}$ . ∎

Theorem 5.7

*Let $f:{\mathcal{G}}_{\{0,1\}^{1}}\rightarrow{\mathbb{R}}$ a graph embedding such that $\forall k,c\ f(G_{k,c})=\frac{(k^{2}+kc)kc}{k^{3}+k^{2}+kc}$ . Let an aggregation ${{\mathfrak{a}}\in\{\operatorname{sum},\operatorname{avg}\}}$ and an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ{\mathfrak{a}}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\approx f}$ . *

Proof.

Follows from combining Lemma A.11 and Lemma A.12. ∎

Corollary 5.8

*Denote by $S$ the set of all multisets over ${\mathbb{N}}_{>0}$ . Let ${g:S\rightarrow{\mathbb{R}}}$ an aggregation such that ${\forall a,b\in{\mathbb{N}}_{>0}\ g({\{a\}\choose b})=a}$ . Let an aggregation ${{\mathfrak{a}}\in\{\operatorname{sum},\operatorname{avg}\}}$ and an FNN ${\mathfrak{F}}$ , and define a readout ${{\mathfrak{r}}{\mathfrak{o}}}\coloneqq f_{\mathfrak{F}}\circ{\mathfrak{a}}$ . Then, ${{{\mathfrak{r}}{\mathfrak{o}}}\circ\text{$ \operatorname{Sum} $-GNNs }\not\geq^{{}_{\{0,1\}}}\operatorname{avg}\circ\ \text{(Sum, g)-GNNs}}$ . *

Proof.

Clearly, for a straightforward stereo aggregation (Sum,g)-GNN ${\mathcal{N}}_{g}$ it holds that ${\mathcal{N}}_{g}(G_{k,c})(u_{i})=kc$ , ${\mathcal{N}}_{g}(G_{k,c})(v_{i})=0$ , and ${\mathcal{N}}_{g}(G_{k,c})(w_{i})=kc$ , hence $\operatorname{avg}\circ{\mathcal{N}}_{g}(G_{k,c})=\frac{(k^{2}+kc)kc}{k^{3}+k^{2}+kc}$ . By Theorem 5.7, no composition of ${{\mathfrak{r}}{\mathfrak{o}}}$ with a $\operatorname{Sum}$ -GNN can approximate the graph embedding $f(G)\coloneqq\operatorname{avg}\circ{\mathcal{N}}_{g}(G)$ . ∎

Proofs for Section 6

Lemma 6.1

*Let ${\mathcal{A}}$ an $m$ -layer $\operatorname{MUPA}$ -GNN architecture, let $l$ be the maximum depth of any FNN in ${\mathcal{A}}$ , and let $d$ be the maximum in-degree of any node in any FNN in ${\mathcal{A}}$ . Then, there exists $r\in{\mathbb{N}}$ such that: for every GNN ${\mathcal{N}}$ that realizes ${\mathcal{A}}$ it holds that ${\mathcal{N}}(G_{k},u)$ is piecewise-polynomial (of $k$ ) with at most $((d+1)^{l})^{m}$ pieces, and each piece is of degree at most $r$ . *

Proof.

Note the following observations:

a. Let $f_{1},f_{2}$ be piecewise polynomial with $p_{1},p_{2}$ pieces, then a linear combination of $f_{1},f_{2}$ has at most $p_{1}+p_{2}$ pieces. This can be seen by considering the set of pieces-joint points of $f_{1}+f_{2}$ , and noticing that it is the union of such points of $f_{1}$ and such points of $f_{2}$ . Accordingly, let $f_{1},\ldots,f_{d}$ be piecewise polynomial with at most $p$ pieces each, then a linear combination of $f_{1},\ldots,f_{d}$ has at most $p\cdot d$ pieces.

b. Let $f$ be piecewise polynomial with at most $p$ pieces, then $ReLU(f)$ has at most $p+1$ pieces.

c. Let $g$ be an output of a ReLU FNN of depth $l$ with maximal in-degree $d$ for any node, with inputs which are at most $p$ -pieces polynomial each. Then, by (a)+(b), $g$ is piecewise-polynomial with $(((pd+1)d+1)d+1)..\leq p\cdot(d+1)^{l}$ pieces.

d. Let $f(x)$ be piecewise polynomial with at most $p$ pieces, and let $g(x)$ a polynomial, then $g(f(x))$ is piecewise polynomial, with at most $p$ pieces, each of degree at most $deg(f)deg(g)$

e. Let $f(x)$ be piecewise polynomial with at most $p$ pieces, and let $g(y)$ a polynomial, then $g(xf(x))$ is piecewise polynomial, with at most $p$ pieces, each of degree at most $(deg(f)+1)deg(g)$ .

Let ${\mathcal{N}}$ be a GNN that realizes ${\mathcal{A}}$ . We define $u^{(t)}_{k}\coloneqq{\mathcal{N}}^{(t)}(G_{k},u)$ , the feature of $u\in V(G_{k})$ after operating the first $t$ layers of ${\mathcal{N}}$ . Note that $u^{(m)}_{k}={\mathcal{N}}(G_{k},u)$ . For every $i,j\in[k]$ there is an automorphism of $G_{k}$ that maps $v_{i}$ to $v_{j}$ , thus they receive the same feature throughout the computation. We define $v^{(t)}_{k}\coloneqq{\mathcal{N}}^{(t)}(G_{k},v_{i})$ for every ${i\in[k]}$ . In our argumentation, we view $u^{(t)}_{k},v^{(t)}_{k}$ as functions of $k$ .

Using observations [a..e] above, we prove by induction on $t$ that $v_{k}^{(t)},u_{k}^{(t)}$ , in each coordinate, are piecewise polynomial in $k$ with no more than $((d+1)^{l})^{t}$ pieces, each of degree at most $r_{t}$ for some $r_{t}\in{\mathbb{N}}$ . For $t=0$ we have that $v_{k}^{(t)},u_{k}^{(t)}$ are constants. Assume correctness for $t=n$ . By definition, ${u_{k}^{(n+1)}=f_{n+1}(u_{k}^{(n)},{\mathfrak{a}}^{(n+1)}_{1},\ldots,{\mathfrak{a}}^{(n+1)}_{b_{n+1}})}$ where ${\mathfrak{a}}^{(n+1)}_{j}$ is a shorthand for the aggregation value ${{\mathfrak{a}}^{(n+1)}_{j}(\{v_{k,c}^{(n)}\}^{k})}$ . By (d),(e), and the induction assumption, each of the input coordinates to $f_{n+1}$ is piecewise polynomial in $k$ with at most $((d+1)^{l})^{n}$ pieces, each of degree at most $r_{n+1}$ for some $r_{n+1}\in{\mathbb{N}}$ . Hence, by (c), each coordinate of $u_{k,c}^{(n+1)}$ has at most $((d+1)^{l})^{n}\cdot(d+1)^{l}=((d+1)^{l})^{n+1}$ pieces, each of degree at most $r_{n+1}$ . By similar reasoning, $v_{k}^{(n+1)}$ can be shown to have no more than $((d+1)^{l})^{n+1}$ pieces, each of a certain maximal degree. ∎

Theorem 6.2

*Let $f:{\mathcal{G}}_{1}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation, and define $g(k)\coloneqq f(G_{k})(u)$ . Assume that $g$ does not converge to any polynomial, that is, there exists $\varepsilon>0$ such that for every polynomial $p$ , for every $K_{0}$ , there exists $k\geq K_{0}$ such that $\left\lvert g(k)-p(k)\right\rvert\geq\varepsilon$ . Then, $\operatorname{MUPA}$ -GNNs $\not\approx f$ . *

Proof.

Let an $\varepsilon$ by which $g$ does not get forever close to any polynomial, and let a $\operatorname{MUPA}$ -GNN ${\mathcal{N}}$ . By Lemma 6.1, there is a $K_{0}$ such that for every $k\geq K_{0}$ it holds that ${\mathcal{N}}(G_{k},u)=p(k)$ for some polynomial $p$ . By assumption, there exists $k>K_{0}$ such that $\left\lvert g(k)-p(k)\right\rvert\geq\varepsilon$ . Hence, $\left\lvert{\mathcal{N}}(G_{k},u)-f(G_{k},u)\right\rvert\geq\varepsilon$ . ∎

Lemma 6.3

For $x,k\in{\mathbb{N}}$ define $I_{x,k}\coloneqq\{x,x+1,\ldots,x+k-1\}$ the set of consecutive $k$ integers starting at $x$ . Let $f:\mathbb{N}\rightarrow\mathbb{R}$ be a PIL, let $n\in\mathbb{N}$ , and define $k_{n}\coloneqq$

[TABLE]

*Then, for every $x\in\mathbb{N}$ there exists $\varepsilon_{x,k_{n}}>0$ such that: for every $p\in P_{n}$ there exists $y\in I_{x,k_{n}}$ for which $\left\lvert p(y)-f(y)\right\rvert\geq\varepsilon_{x,k_{n}}$ . That is, for every starting point $x$ there is a bounded interval $I_{x,k_{n}}$ , and a gap $\varepsilon_{x,k_{n}}$ , such that no polynomial of degree $\leq n$ can approximate $f$ on that interval below that gap. *

Proof.

Define $I\coloneqq I_{x,k_{n}}$ . For a real-valued function $h$ whose domain contains $I$ , we define $\left\lVert h\right\rVert_{I}\coloneqq\max(\left\lvert h(y)\right\rvert:y\in I_{x,k_{n}})$ , the maximum absolute value $h$ attains on $I_{x,k_{n}}$ . Define ${\varepsilon_{x,k_{n}}\coloneqq\inf(\left\lVert f-p\right\rVert_{I}:p\in P_{n})}$ , the distance of $f$ from the closest polynomial of degree $\leq n$ , in the segment $I_{x,k_{n}}$ . We need to show that $\varepsilon_{x,k_{n}}>0$ . For a vector $a=(a_{0},\ldots,a_{n})\in{\mathbb{R}}^{n+1}$ denote by $\left\lVert a\right\rVert_{2}$ the Euclidean norm of $a$ . For $a,b\in{\mathbb{R}}^{n+1}$ we use $d(a,b)\coloneqq\left\lVert a-b\right\rVert_{2}$ as the metric in our continuity argumentation. Define $p_{a}(x)\coloneqq a_{0}+\cdots+a_{n}x^{n}$ the polynomial determined by $a$ . Note the following:

a)

For $a\in{\mathbb{R}}^{n+1}$ , let $g(a)\coloneqq\left\lVert p_{a}\right\rVert_{I}$ , then $g$ is continuous.

b)

For $a\in{\mathbb{R}}^{m+1}$ , let $g(a)\coloneqq\left\lVert f-p_{a}\right\rVert_{I}$ , then $g$ is continuous.

c)

There exists $T\in{\mathbb{R}}$ such that

[TABLE]

Proof: Let $S=\{a\in{\mathbb{R}}^{n+1}:\left\lVert a\right\rVert_{2}=1\}$ and define ${\delta_{S}\coloneqq\inf(\left\lVert p_{a}\right\rVert_{I}:a\in S)}$ . By (a), $\left\lVert p_{a}\right\rVert_{I}$ is continuous, and as $S$ is compact we have that there exists $a^{*}\in S$ such that $\left\lVert p_{a^{*}}\right\rVert_{I}=\delta_{S}$ . Note that necessarily $k_{n}\geq n+1$ , then by definition of $\left\lVert p_{a^{*}}\right\rVert_{I}$ it must be that either $\delta_{S}>0$ or $p_{a^{*}}=0$ . Since $a^{*}\in S$ , necessarily it is the former that holds. Hence, for every $a\in{\mathbb{R}}^{n+1}$ we have that $\left\lVert p_{a/\left\lVert a\right\rVert_{2}}\right\rVert_{I}\geq\delta_{S}$ , and by $\left\lVert p_{a}\right\rVert_{I}=\left\lVert a\right\rVert_{2}\cdot\left\lVert p_{a/\left\lVert a\right\rVert_{2}}\right\rVert_{I}$ we have $\left\lVert p_{a}\right\rVert_{I}\xrightarrow[\left\lVert a\right\rVert_{2}\rightarrow\infty]{}\infty$ . Finally, note that $\left\lVert f-p_{a}\right\rVert_{I}\geq\left\lVert p_{a}\right\rVert_{I}-\left\lVert f\right\rVert_{I}$ , and let $T$ such that $\left\lVert a\right\rVert_{2}\geq T\Rightarrow\left\lVert p_{a}\right\rVert_{I}>\varepsilon_{x,k_{n}}+1+\left\lVert f\right\rVert_{I}$ , then for all $a:\left\lVert a\right\rVert_{2}\geq T$ we have $\left\lVert f-p_{a}\right\rVert_{I}\geq\varepsilon_{x,k_{n}}+1+\left\lVert f\right\rVert_{I}-\left\lVert f\right\rVert_{I}=\varepsilon_{x,k_{n}}+1$ . Hence, ${\inf(\left\lVert f-p\right\rVert_{I}:p\in P_{n})=\inf(\left\lVert f-p_{a}\right\rVert_{I}:\left\lVert a\right\rVert_{2}\leq T)}$ .

By (b) and (c), $\varepsilon_{x,k_{n}}$ is the infimum of a continuous function on a closed ball, hence there exists $a^{*}\in{\mathbb{R}}^{n+1}$ such that $\varepsilon_{x,k_{n}}=\lVert f-p_{a^{*}}\rVert_{I}$ . By the assumption that $f$ is PIL, and the definition of $k_{n}$ , we have $\left\lVert f-p_{a^{*}}\right\rVert_{I}>0$ . ∎

Lemma 6.4

*For every $q,n\in{\mathbb{N}}$ there exists a point $T_{q,n}\in{\mathbb{N}}$ and a gap $\delta_{T_{q,n}}>0$ such that: for every PIL $f:\mathbb{N}\rightarrow\mathbb{R}$ , and every piecewise-polynomial $g$ with $q$ many pieces of degree $\leq n$ , there exists $y\in\mathbb{N},\;0\leq y\leq T_{q,n}$ for which $\left\lvert g(y)-f(y)\right\rvert\geq\delta_{T_{q,n}}$ . That is, the number of pieces and the max degree of a piecewise-polynomial $g$ determine a guaranteed minimum gap by which $g$ misses $f$ within a guaranteed interval. *

Proof.

Define $T_{0}=1$ . Using the notation of $k_{n}$ from Lemma 6.3, for every $i\in[q]$ define $T_{i}\coloneqq(k_{n}-1)(i)+1$ , define $I_{i}\coloneqq I_{T_{i-1},k_{n}}$ , and define $\delta_{i}\coloneqq\inf(\lVert{f-p}\rVert I_{i}:p\in P_{n})$ . Note that $\delta_{i}>0$ by Lemma 6.3. Finally, define ${T_{q,n}\coloneqq T_{q}}$ , ${\delta_{T_{q,n}}\coloneqq\min(\delta_{i}:i\in[q])}$ . Assume by contradiction that $g$ is close to $f$ by less than $\delta_{T_{q,n}}$ for every $y\in[0..T_{q,n}]$ , then, necessarily the first polynomial piece of $g$ ends at most at $T_{1}-1$ , the second at $T_{2}-1$ and the $q-1$ piece at $T_{q-1}-1$ , then the last polynomial piece starts the latest at $T_{q-1}$ and by $T_{q,n}$ it must have missed at least one point by at least $\delta_{T_{q,n}}>0$ . ∎

Theorem 6.5

*Let $f:{\mathcal{G}}_{1}\rightarrow{\mathcal{Z}}_{\mathbb{R}}$ a feature transformation, let $g(k)\coloneqq f(G_{k})(u)$ , and assume that $g$ is PIL. Then, for every $\operatorname{MUPA}$ -GNN architecture ${\mathcal{A}}$ , there exists $\varepsilon_{{\mathcal{A}}}>0$ such that for every $\operatorname{MUPA}$ -GNN ${\mathcal{N}}$ that realizes ${\mathcal{A}}$ there exists $k$ such that $\left\lvert{\mathcal{N}}(G_{k},u)-f(G_{k})(u)\right\rvert\geq\varepsilon$ . *

Proof.

Let the $q,r$ guaranteed by Lemma 6.1 for ${\mathcal{A}}$ , and let the $T_{q,r},\delta_{T_{q,r}}$ guaranteed by Lemma 6.4 for $q$ pieces of degree $\leq r$ . Then, by Lemma 6.4, for $\varepsilon_{\mathcal{A}}\coloneqq\delta_{T_{q,r}}$ and $k\coloneqq T_{q,r}$ the statement holds. ∎

Appendix B Experimentation Ext.

Architecture and Training

We implement all GNNs using PyTorch Geometric [?]. The update function $f_{\mathfrak{F}}$ of each GNN layer is a standard 2-layer MLP with a ReLU-activated hidden layer and a linear output layer. We set the intermediate embedding dimension to 256 and use 2 message passing layers in all models. We minimize the smooth L1 loss on the training data using the Adam Optimizer [?]. No readout function is needed. For both considered graph families the ground truth is a label of the root vertex. The prediction and loss of all other vertices are simply masked out.

Before each training run we randomly choose 500 graphs from the training data as a validation dataset. Each model is trained for 500 epochs with a batch size of 100. The initial learning rate is selected from $\{10^{-3},10^{-4},10^{-5}\}$ based on validation performance. The learning rate decays with a cosine annealing schedule [?] throughout training. We average all results over 5 models trained with different random seeds. All experiments are conducted on a machine with an NVIDIA RTX A6000 GPU (48GB) and 512GB of RAM running Ubuntu 22.04 LTS.

Extended Results

An illustration of the full experimental results can be seen in fig. 7. For both datasets, and each tested architecture, we provide the relative error (RE) over the full test range ( ${k\in[1..1000],c\in[1..1000]}$ ) as a 3D plot. The error is provided on the $z$ -axis, which is linearly scaled. The color map is linear as well and is scaled individually for each subplot to highlight additional details.

The results for the unbounded countable features (UC) experiment are provided in fig. 7(a). Note that the color map for the trained $\operatorname{Mean}$ -GNN is scaled by $10^{-5}$ , since the learned function is very close to the ground truth. The trained $\operatorname{Sum}$ -GNN performs significantly worse. Relative to itself though, as long as $c$ is in the training range $[1..100]$ it generalizes well along the $k$ axis. Operating the trained $\operatorname{Sum}$ -GNN , on $c$ in the training range, resembles the bounded input-feature domain setting examined in Section 4. Hence, the generalization in $k$ , when $c$ is in the training range, resembles the result in Section 4: $\operatorname{Sum}$ -GNNs can approximate Mean when the input-feature domain is bounded. Once $c$ is beyond the training range, the relative error grows rapidly, both along the $k$ axis (for fixed $c$ ) and along the $c$ axis. Interestingly, the error of the trained $\operatorname{Sum}$ -GNN also tends upwards at $c<10$ . The learned function therefore lacks robustness even towards the lower end of the training range of $c$ .

The results for the single value features (SV) experiment are provided in fig. 7(b). Overall, the trained (Sum,Mean)-GNN achieves a significantly lower error than the $\operatorname{Sum}$ -GNN. Like in the UC experiment, as long as $c$ is in the training range $[1..100]$ the trained $\operatorname{Sum}$ -GNN generalizes relatively well along the $k$ axis, and the performance deteriorates sharply (in both axis) when $c>100$ . We do note though, that the results of the (Sum,Mean)-GNN in this experiment are substantially worse than those of the $\operatorname{Mean}$ -GNN in the UC experiment. While there exists a (Sum,Mean)-GNN that computes exactly the SV-experiment function (see proof of Corollary 5.6), Stochastic Gradient Descend (SGD) was not able to learn this function in fine detail. To arrive in a good (Sum,Mean)-GNN instance, the first GNN-layer has to learn to ignore the coordinates of the Mean-aggregation and to use the coordinates of the Sum-aggregation properly, and the second GNN-layer has to learn to ignore the Sum and use the Mean. These requirements constitute a more challenging learning problem than that of learning a good $\operatorname{Mean}$ -GNN for the UC task, and the difference is reflected in the results. Interestingly, the relative error of the (Sum,Mean)-GNN is worst at the lower end of the training range $c<10$ for high values of $k$ .

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ Abboud et al. , 2021 ] Ralph Abboud, İsmail İlkan Ceylan, Martin Grohe, and Thomas Lukasiewicz. The surprising power of graph neural networks with random node initialization. In Zhi-Hua Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021 , pages 2112–2118. ijcai.org, 2021.
2[ anonymous, 2022 ] anonymous. The equioscillation theorem. https://en.wikipedia.org/wiki/Equioscillation_theorem , 2022.
3[ Barceló et al. , 2020 a ] Pablo Barceló, Egor V Kostylev, Mikael Monet, Jorge Pérez, Juan Reutter, and Juan-Pablo Silva. The logical expressiveness of graph neural networks. In 8th International Conference on Learning Representations (ICLR 2020) , 2020.
4[ Barceló et al. , 2020 b ] Pablo Barceló, Egor V Kostylev, Mikaël Monet, Jorge Pérez, Juan L Reutter, and Juan-Pablo Silva. The expressive power of graph neural networks as a query language. ACM SIGMOD Record , 49(2):6–17, 2020.
5[ Barceló et al. , 2021 ] Pablo Barceló, Floris Geerts, Juan Reutter, and Maksimilian Ryschkov. Graph neural networks with local graph parameters. Advances in Neural Information Processing Systems , 34:25280–25293, 2021.
6[ Cappart et al. , 2021 ] Quentin Cappart, Didier Chételat, Elias B. Khalil, Andrea Lodi, Christopher Morris, and Petar Velickovic. Combinatorial optimization and reasoning with graph neural networks. In Zhi-Hua Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021 , pages 4348–4355. ijcai.org, 2021.
7[ Chen et al. , 2019 ] Zhengdao Chen, Soledad Villar, Lei Chen, and Joan Bruna. On the equivalence between graph isomorphism testing and function approximation with gnns. Advances in neural information processing systems , 32, 2019.
8[ Corso et al. , 2020 ] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal neighbourhood aggregation for graph nets. Advances in Neural Information Processing Systems , 33:13260–13271, 2020.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Some Might Say All You Need Is Sum

Abstract

1 Introduction

1.1 Our Contribution

1.2 Related Work

2 Preliminaries

2.1 Graphs

2.2 Feedforward Neural Networks

2.3 Graph Neural Networks

2.4 Expressivity

3 Mean and Max Do Not Subsume

3.1 Mean and Max do not subsume Sum

Lemma 3.1**.**

Theorem 3.2**.**

Corollary 3.3**.**

3.2 Mean and Max do not subsume each other

Theorem 3.4**.**

Theorem 3.5**.**

Corollary 3.6**.**

4 Sometimes Sum Subsumes

4.1 Mean by Sum

Lemma 4.1**.**

Theorem 4.2**.**

Corollary 4.3**.**

4.2 Max by Sum

Lemma 4.4**.**

Theorem 4.5**.**

Corollary 4.6**.**

5 Mean and Max Have Their Place

5.1 Unbounded, Countable, Input-Feature Domain

Theorem 5.1**.**

Corollary 5.2**.**

Graph Embedding

Theorem 5.3**.**

Corollary 5.4**.**

5.2 Finite Input-Feature Domain

Theorem 5.5**.**

Corollary 5.6**.**

Graph Embedding

Theorem 5.7**.**

Corollary 5.8**.**

6 Sum and More are Not Enough

Lemma 6.1**.**

Theorem 6.2**.**

Lemma 6.3**.**

Lemma 6.4**.**

Theorem 6.5**.**

7 Experimentation

7.1 Data and Setup

7.2 Results

Unbounded, Countable, Feature Domain

Single-Value Feature Domain

Appendix A Proofs

Proofs for Section 3

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Proofs for Section 4

Lemma A.1**.**

Proof.

Proof.

Proof.

Lemma A.2**.**

Proof.

Proof.

Proof.

Proofs for Section 5

A.1 Describability

Lemma A.3**.**

Proof.

Lemma 3.1.

Theorem 3.2.

Corollary 3.3.

Theorem 3.4.

Theorem 3.5.

Corollary 3.6.

Lemma 4.1.

Theorem 4.2.

Corollary 4.3.

Lemma 4.4.

Theorem 4.5.

Corollary 4.6.

Theorem 5.1.

Corollary 5.2.

Theorem 5.3.

Corollary 5.4.

Theorem 5.5.

Corollary 5.6.

Theorem 5.7.

Corollary 5.8.

Lemma 6.1.

Theorem 6.2.

Lemma 6.3.

Lemma 6.4.

Theorem 6.5.

Lemma A.1.

Lemma A.2.

Lemma A.3.

Lemma A.4.

Lemma A.5.

Lemma A.6.

Lemma A.7.

Lemma A.8.

Lemma A.9.

Lemma A.10.

Lemma A.11.

Lemma A.12.