Neural network identifiability for a family of sigmoidal nonlinearities

Verner Vla\v{c}i\'c; Helmut B\"olcskei

arXiv:1906.06994·math.CO·September 3, 2020

Neural network identifiability for a family of sigmoidal nonlinearities

Verner Vla\v{c}i\'c, Helmut B\"olcskei

PDF

TL;DR

This paper investigates when the input-output map of a neural network with various nonlinearities uniquely determines its architecture and parameters, providing minimal conditions for identifiability across diverse network structures.

Contribution

It derives necessary genericity conditions for neural network identifiability of arbitrary depth and connectivity, and constructs a broad family of nonlinearities satisfying these conditions.

Findings

01

Identifiability conditions are established for networks of any depth and connectivity.

02

A large family of nonlinearities is constructed that meets the minimal genericity conditions.

03

The family of nonlinearities can approximate many common nonlinear functions arbitrarily well.

Abstract

This paper addresses the following question of neural network identifiability: Does the input-output map realized by a feed-forward neural network with respect to a given nonlinearity uniquely specify the network architecture, weights, and biases? Existing literature on the subject Sussman 1992, Albertini, Sontag et al. 1993, Fefferman 1994 suggests that the answer should be yes, up to certain symmetries induced by the nonlinearity, and provided the networks under consideration satisfy certain "genericity conditions". The results in Sussman 1992 and Albertini, Sontag et al. 1993 apply to networks with a single hidden layer and in Fefferman 1994 the networks need to be fully connected. In an effort to answer the identifiability question in greater generality, we derive necessary genericity conditions for the identifiability of neural networks of arbitrary depth and connectivity with an…

Equations320

N = (D_{0}, D_{1}, \dots, D_{L}; W^{1}, θ^{1}, W^{2}, θ^{2}, \dots, W^{L}, θ^{L}),

N = (D_{0}, D_{1}, \dots, D_{L}; W^{1}, θ^{1}, W^{2}, θ^{2}, \dots, W^{L}, θ^{L}),

⟨ N ⟩^{ρ} (x) = ρ (W^{L} (ρ (W^{L - 1} (\dots ρ (W^{1} x + θ^{1}) \dots) + θ^{L - 1})) + θ^{L}), x \in R^{D_{0}},

⟨ N ⟩^{ρ} (x) = ρ (W^{L} (ρ (W^{L - 1} (\dots ρ (W^{1} x + θ^{1}) \dots) + θ^{L - 1})) + θ^{L}), x \in R^{D_{0}},

N_{1} \sim N_{2} ⟹ ⟨ N_{1} ⟩^{ρ} (x) = ⟨ N_{2} ⟩^{ρ} (x), \forall x \in R^{D_{in}} .

N_{1} \sim N_{2} ⟹ ⟨ N_{1} ⟩^{ρ} (x) = ⟨ N_{2} ⟩^{ρ} (x), \forall x \in R^{D_{in}} .

⟨ N_{1} ⟩^{ρ} (x) = ⟨ N_{2} ⟩^{ρ} (x), \forall x \in R^{D_{in}} ⟹ N_{1} \sim N_{2} .

⟨ N_{1} ⟩^{ρ} (x) = ⟨ N_{2} ⟩^{ρ} (x), \forall x \in R^{D_{in}} ⟹ N_{1} \sim N_{2} .

W_{j k}^{ℓ} / W_{j^{'} k}^{ℓ} \in / {p / q : p, q \in Z, 1 \leq q \leq 100 D_{ℓ}^{2}} .

W_{j k}^{ℓ} / W_{j^{'} k}^{ℓ} \in / {p / q : p, q \in Z, 1 \leq q \leq 100 D_{ℓ}^{2}} .

W_{j k}^{ℓ} = ϵ_{j}^{ℓ} W_{γ_{ℓ} (j) γ_{ℓ - 1} (k)}^{ℓ} ϵ_{k}^{ℓ - 1}, and θ_{j}^{ℓ} = ϵ_{j}^{ℓ} θ_{γ_{ℓ} (j)} .

W_{j k}^{ℓ} = ϵ_{j}^{ℓ} W_{γ_{ℓ} (j) γ_{ℓ - 1} (k)}^{ℓ} ϵ_{k}^{ℓ - 1}, and θ_{j}^{ℓ} = ϵ_{j}^{ℓ} θ_{γ_{ℓ} (j)} .

W_{j k}^{ℓ} = W_{γ_{ℓ} (j) γ_{ℓ - 1} (k)}^{ℓ}, and θ_{j}^{ℓ} = θ_{γ_{ℓ} (j)} .

W_{j k}^{ℓ} = W_{γ_{ℓ} (j) γ_{ℓ - 1} (k)}^{ℓ}, and θ_{j}^{ℓ} = θ_{γ_{ℓ} (j)} .

(θ_{j}^{ℓ}, W_{j 1}^{ℓ}, \dots, W_{j D_{l - 1}}^{ℓ}) = (θ_{j^{'}}^{ℓ}, W_{j^{'} 1}^{ℓ}, \dots, W_{j^{'} D_{ℓ - 1}}^{ℓ}) .

(θ_{j}^{ℓ}, W_{j 1}^{ℓ}, \dots, W_{j D_{l - 1}}^{ℓ}) = (θ_{j^{'}}^{ℓ}, W_{j^{'} 1}^{ℓ}, \dots, W_{j^{'} D_{ℓ - 1}}^{ℓ}) .

N_{n c}^{D_{in}, D_{o u t}} = {N \in N^{D_{in}, D_{o u t}} : N satisfies the no-clones condition},

N_{n c}^{D_{in}, D_{o u t}} = {N \in N^{D_{in}, D_{o u t}} : N satisfies the no-clones condition},

ρ (ρ (x) - \frac{1}{2} ρ (2 x) - \frac{1}{2} ρ (2 x - 1) + 0) = 0, for all x \in R .

ρ (ρ (x) - \frac{1}{2} ρ (2 x) - \frac{1}{2} ρ (2 x - 1) + 0) = 0, for all x \in R .

\mathcal{N}_{0}=\left(W^{1},\theta^{1},W^{2},\theta^{2},\dots,W^{L},\theta^{L},\Big{(}\begin{smallmatrix}1\\ 2\\ 2\end{smallmatrix}\Big{)},\Big{(}\begin{smallmatrix}0\\ 0\\ -1\end{smallmatrix}\Big{)},\big{(}1\;\,-\frac{1}{2}\;\,-\frac{1}{2}\big{)},0\right)

\mathcal{N}_{0}=\left(W^{1},\theta^{1},W^{2},\theta^{2},\dots,W^{L},\theta^{L},\Big{(}\begin{smallmatrix}1\\ 2\\ 2\end{smallmatrix}\Big{)},\Big{(}\begin{smallmatrix}0\\ 0\\ -1\end{smallmatrix}\Big{)},\big{(}1\;\,-\frac{1}{2}\;\,-\frac{1}{2}\big{)},0\right)

BV(\mathbb{R})=\Bigg{\{}f\in L^{1}(\mathbb{R}):\|f\|_{BV(\mathbb{R})}:=\sup_{\begin{subarray}{c}\varphi\in C_{c}^{1}(\mathbb{R})\\ \|\varphi\|_{L^{\infty}(\mathbb{R})}\leq 1\end{subarray}}\int_{\mathbb{R}}f(x)\varphi^{\prime}(x)\mathrm{d}x\;<\infty\Bigg{\}}

BV(\mathbb{R})=\Bigg{\{}f\in L^{1}(\mathbb{R}):\|f\|_{BV(\mathbb{R})}:=\sup_{\begin{subarray}{c}\varphi\in C_{c}^{1}(\mathbb{R})\\ \|\varphi\|_{L^{\infty}(\mathbb{R})}\leq 1\end{subarray}}\int_{\mathbb{R}}f(x)\varphi^{\prime}(x)\mathrm{d}x\;<\infty\Bigg{\}}

N_{m} : = (D_{0}, D_{1}, \dots, D_{L - 1}, 2; W^{1}, θ^{1}, W^{2}, θ^{2}, \dots, W^{L - 1}, θ^{L - 1}, U^{m}, 0),

N_{m} : = (D_{0}, D_{1}, \dots, D_{L - 1}, 2; W^{1}, θ^{1}, W^{2}, θ^{2}, \dots, W^{L - 1}, θ^{L - 1}, U^{m}, 0),

⟨ N_{1} ⟩^{ρ} = (f_{1}, f_{3}), ⟨ N_{2} ⟩^{ρ} = (f_{1}, f_{4}), ⟨ N_{3} ⟩^{ρ} = (f_{2}, f_{4}), and ⟨ N_{4} ⟩^{ρ} = (f_{2}, f_{3}),

⟨ N_{1} ⟩^{ρ} = (f_{1}, f_{3}), ⟨ N_{2} ⟩^{ρ} = (f_{1}, f_{4}), ⟨ N_{3} ⟩^{ρ} = (f_{2}, f_{4}), and ⟨ N_{4} ⟩^{ρ} = (f_{2}, f_{3}),

⟨ N_{1} ⟩^{ρ} - ⟨ N_{2} ⟩^{ρ} + ⟨ N_{3} ⟩^{ρ} - ⟨ N_{4} ⟩^{ρ} = (0 + f_{1} - f_{1} + f_{2} - f_{2} 0 + f_{3} - f_{4} + f_{4} - f_{3}) = 0 .

⟨ N_{1} ⟩^{ρ} - ⟨ N_{2} ⟩^{ρ} + ⟨ N_{3} ⟩^{ρ} - ⟨ N_{4} ⟩^{ρ} = (0 + f_{1} - f_{1} + f_{2} - f_{2} 0 + f_{3} - f_{4} + f_{4} - f_{3}) = 0 .

0 \cdot 1 + ⟨ M_{1, j} ⟩^{σ} - ⟨ M_{2, j} ⟩^{σ} = (⟨ N_{1} ⟩^{σ})_{j} - (⟨ N_{2} ⟩^{σ})_{j} = 0,

0 \cdot 1 + ⟨ M_{1, j} ⟩^{σ} - ⟨ M_{2, j} ⟩^{σ} = (⟨ N_{1} ⟩^{σ})_{j} - (⟨ N_{2} ⟩^{σ})_{j} = 0,

λ_{0} + j = 1 \sum n λ_{j} (⟨ M ⟩^{σ})_{j} (z) = 0,

λ_{0} + j = 1 \sum n λ_{j} (⟨ M ⟩^{σ})_{j} (z) = 0,

k = 1 ⋁ n N_{k} = N_{1} \lor N_{2} \lor \dots \lor N_{n} : = (\dots (N_{1} \lor N_{2}) \lor \dots) \lor N_{n} .

k = 1 ⋁ n N_{k} = N_{1} \lor N_{2} \lor \dots \lor N_{n} : = (\dots (N_{1} \lor N_{2}) \lor \dots) \lor N_{n} .

\displaystyle\Big{(}c\,\bm{1}+\sum_{j=1}^{n}\lambda_{j}\left\langle{w_{j}}\right\rangle^{\rho,\,\mathcal{M}}\Big{)}(t)

\displaystyle\Big{(}c\,\bm{1}+\sum_{j=1}^{n}\lambda_{j}\left\langle{w_{j}}\right\rangle^{\rho,\,\mathcal{M}}\Big{)}(t)

\displaystyle=\Big{(}c\,\bm{1}+\sum_{j=1}^{n}\lambda_{j}\left\langle{\mathcal{N}_{j}}\right\rangle^{\rho}\Big{)}\big{(}(\omega_{\tilde{v}v_{in}}t)_{\tilde{v}\in V_{in}}\big{)}=0,

⟨ w ⟩^{ρ, M_{a}} (t_{1}, t_{2}, \dots, t_{D_{0} - 1}) = ⟨ w ⟩^{ρ, M} (t_{1}, t_{2}, \dots, t_{D_{0} - 1}, a),

⟨ w ⟩^{ρ, M_{a}} (t_{1}, t_{2}, \dots, t_{D_{0} - 1}) = ⟨ w ⟩^{ρ, M} (t_{1}, t_{2}, \dots, t_{D_{0} - 1}, a),

(t_{1}, t_{2}, \dots, t_{D_{0} - 1}) \mapsto ⟨ w ⟩^{ρ, M} (t_{1}, t_{2}, \dots, t_{D_{0} - 1}, a)

(t_{1}, t_{2}, \dots, t_{D_{0} - 1}) \mapsto ⟨ w ⟩^{ρ, M} (t_{1}, t_{2}, \dots, t_{D_{0} - 1}, a)

w \in V_{o u t}^{M} \sum λ_{w} ⟨ w ⟩^{ρ, M} = 0.

w \in V_{o u t}^{M} \sum λ_{w} ⟨ w ⟩^{ρ, M} = 0.

w \in V_{o u t}^{M} ∖ V_{o u t}^{M_{a}} \sum λ_{w} ⟨ w ⟩^{ρ, M} (a) 1 + w \in V_{o u t}^{M_{a}} \sum λ_{w} ⟨ w ⟩^{ρ, M_{a}} = w \in V_{o u t}^{M} \sum λ_{w} ⟨ w ⟩^{ρ, M} = 0,

w \in V_{o u t}^{M} ∖ V_{o u t}^{M_{a}} \sum λ_{w} ⟨ w ⟩^{ρ, M} (a) 1 + w \in V_{o u t}^{M_{a}} \sum λ_{w} ⟨ w ⟩^{ρ, M_{a}} = w \in V_{o u t}^{M} \sum λ_{w} ⟨ w ⟩^{ρ, M} = 0,

a_{v} = {a, ρ (\sum_{u \in par_{M} (v)} ω_{v u} a_{u} + θ_{v}), v = v_{D_{0}}^{0} v \neq = v_{D_{0}}^{0} .

a_{v} = {a, ρ (\sum_{u \in par_{M} (v)} ω_{v u} a_{u} + θ_{v}), v = v_{D_{0}}^{0} v \neq = v_{D_{0}}^{0} .

θ_{v} = θ_{v} + u \in par_{M} (v) ∖ V^{M_{a}} \sum ω_{v u} a_{u},

θ_{v} = θ_{v} + u \in par_{M} (v) ∖ V^{M_{a}} \sum ω_{v u} a_{u},

E_{(c_{1}, c_{2})} = {a \in R : c_{1}, c_{2} \in V^{M_{a}}, and c_{1}, c_{2} are clones in M_{a}} .

E_{(c_{1}, c_{2})} = {a \in R : c_{1}, c_{2} \in V^{M_{a}}, and c_{1}, c_{2} are clones in M_{a}} .

R = (c_{1}, c_{2}) \in V^{M} \times V^{M} ⋃ E_{(c_{1}, c_{2})} .

R = (c_{1}, c_{2}) \in V^{M} \times V^{M} ⋃ E_{(c_{1}, c_{2})} .

V^{N} = {S \cup {c_{1}, c_{2}}, S \cup {c_{1}}, if v_{D_{0}}^{0} \in V^{M (c_{2})} if v_{D_{0}}^{0} \in / V^{M (c_{2})} .

V^{N} = {S \cup {c_{1}, c_{2}}, S \cup {c_{1}}, if v_{D_{0}}^{0} \in V^{M (c_{2})} if v_{D_{0}}^{0} \in / V^{M (c_{2})} .

par_{M} (c_{1}) \cap V^{N}

par_{M} (c_{1}) \cap V^{N}

θ_{c_{1}} + r

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Neural Network Identifiability for

a Family of Sigmoidal Nonlinearities

Verner Vlačić and Helmut Bölcskei

Dept. of EE and Dept. of Math., ETH Zurich, Switzerland

Email: [email protected], [email protected]

Abstract

This paper addresses the following question of neural network identifiability: Does the input-output map realized by a feed-forward neural network with respect to a given nonlinearity uniquely specify the network architecture, weights, and biases? Existing literature on the subject [1, 2, 3] suggests that the answer should be yes, up to certain symmetries induced by the nonlinearity, and provided the networks under consideration satisfy certain “genericity conditions”. The results in [1] and [2] apply to networks with a single hidden layer and in [3] the networks need to be fully connected. In an effort to answer the identifiability question in greater generality, we derive necessary genericity conditions for the identifiability of neural networks of arbitrary depth and connectivity with an arbitrary nonlinearity. Moreover, we construct a family of nonlinearities for which these genericity conditions are minimal, i.e., both necessary and sufficient. This family is large enough to approximate many commonly encountered nonlinearities to within arbitrary precision in the uniform norm.

I Introduction

Deep learning has become a highly successful machine learning method employed in a wide range of applications such as optical character recognition [4], image classification [5], and speech recognition [6]. In a typical deep learning scenario one aims to fit a parametric model, realized by a deep neural network, to match a set of training data points. In order to make the ensuing discussion more concrete, we begin with the definition of a neural network and the map it realizes under a nonlinearity.

Definition 1 (Neural network).

We call an ordered sequence

[TABLE]

a neural network, where

–

$L$ is a positive integer, referred to as the depth of $\mathcal{N}$ ,

–

$(D_{0},D_{1},\dots,D_{L})$ is an $(L+1)$ -tuple of positive integers, called the layout,

–

$W^{\ell}={(W_{jk}^{\ell})}\in\mathbb{R}^{D_{\ell}\times D_{\ell-1}}$ , $\ell\in\{1,\dots,L\}$ , are matrices whose entries are referred to as the network’s weights, and

–

$\theta^{\ell}={(\theta_{j}^{\ell})}\in\mathbb{R}^{D_{\ell}}$ , $\ell\in\{1,\dots,L\}$ , are vectors of the so-called biases.

Furthermore, we stipulate that none of the $W^{\ell}$ , $\ell\in\{1,\dots,L\}$ , have an identically zero row or an identically zero column.

Definition 2.

Given a neural network $\mathcal{N}$ and a nonlinear function $\rho:\mathbb{R}\to\mathbb{R}$ , referred to as the nonlinearity, we define the map realized by $\mathcal{N}$ under $\rho$ as the function $\langle{\mathcal{N}}\rangle^{\rho}:\mathbb{R}^{D_{0}}\to\mathbb{R}^{D_{L}}$ given by

[TABLE]

where $\rho$ acts on real vectors in a componentwise fashion.

The requirement that the matrices $W^{\ell}$ in Definition 1 have nonzero rows corresponds to the absence of nodes whose contributions depend on the biases only, and are therefore constant as functions of the input. Similarly, columns that are identically zero correspond to nodes whose contributions do not enter the computation at the next layer. The map of a neural network failing this requirement can be realized by a network obtained by simply removing such spurious nodes. In practical applications, the numbers $L,D_{0},D_{1},\dots,D_{L}$ are typically determined through heuristic considerations, whereas the coefficients $W^{\ell},\,\theta^{\ell}$ of the affine maps $x\mapsto W^{\ell}x+\theta^{\ell}$ are learned based on training data. For an overview of practical techniques for deep learning, see [7]. Neural networks are often studied as mathematical objects in their own right, for instance in approximation theory [8, 9, 10, 11] and in control theory [12, 13]. In this context, a natural question is that of identification: Can a neural network be uniquely identified from the map it is to realize? Specifically, we will be interested in identifiability according to the following definition.

Definition 3 (Identifiability).

Given positive integers $D_{in}$ and $D_{out}$ , define $\mathscr{N}^{D_{in},D_{out}}$ to be the set of all neural networks whose layouts $(D_{0},\dots,D_{L})$ satisfy $D_{0}=D_{in}$ and $D_{L}=D_{out}$ , but are otherwise arbitrary. Let $\mathscr{N}$ be a subset of $\mathscr{N}^{D_{in},D_{out}}$ , $\rho$ a nonlinearity, and $\sim$ an equivalence relation on $\mathscr{N}^{D_{in},D_{out}}$ .

(i)

We say that $\sim$ is compatible with $(\mathscr{N},\rho)$ if, for all $\mathcal{N}_{1},\mathcal{N}_{2}\in\mathscr{N}$ ,

[TABLE] 2. (ii)

We say that $(\mathscr{N},\rho)$ *is identifiable up to * $\sim$ if, for all $\mathcal{N}_{1},\mathcal{N}_{2}\in\mathscr{N}$ ,

[TABLE]

Thus, by informally saying that a neural network $\mathcal{N}_{1}$ in a certain class is identifiable, we mean that any neural network $\mathcal{N}_{2}$ in the same class giving rise to the same output map, i.e., $\langle{\mathcal{N}_{1}}\rangle^{\rho}=\langle{\mathcal{N}_{2}}\rangle^{\rho}$ , is necessarily equivalent to $\mathcal{N}_{2}$ . The role of the equivalence relation $\sim$ in the previous definition is thus to “measure the degree of non-uniqueness”, and in particular, to accommodate symmetries within the network that may arise either from symmetries induced by the network weights and biases (such as the presence of clone pairs, to be introduced in Definition 5), symmetries of the nonlinearity (e.g., $\tanh$ is odd), or both simultaneously. These abstract concepts will be incarnated momentarily when discussing the seminal work by Fefferman [3], and in Section II through Definitions 4 and 5, as well as in the examples leading up to the formulation of the paper’s main results.

In [3], Fefferman showed that neural networks satisfying the following genericity conditions are, indeed, uniquely determined by the map they realize under the nonlinearity $\rho=\tanh$ , up to certain obvious isomorphisms of networks:

Assumptions 1 (Fefferman’s genericity conditions).

(i)

$\theta^{\ell}_{j}\neq 0$ , for all $\ell$ and $j$ , and $|\theta^{\ell}_{j}|\neq|\theta^{\ell}_{j^{\prime}}|$ , for all $\ell$ and $j,j^{\prime}$ with $j\neq j^{\prime}$ . 2. (ii)

$W^{\ell}_{jk}\neq 0$ , for all $\ell$ , $j$ , and $k$ , and 3. (iii)

for all $\ell$ , $k$ and $j,j^{\prime}$ with $j\neq j^{\prime}$ ,

[TABLE]

More precisely, for fixed positive integers $D_{in}$ and $D_{out}$ , Fefferman showed that $(\mathscr{N}_{A1}^{D_{in},D_{out}},\tanh)$ is identifiable up to $\sim_{\pm}$ , where $\mathscr{N}_{A1}^{D_{in},D_{out}}$ is defined as the set of all neural networks in $\mathscr{N}^{D_{in},D_{out}}$ satisfying Assumptions 1, and $\sim_{\pm}$ is defined by stipulating that $\mathcal{N}\sim_{\pm}\widetilde{\mathcal{N}}$ if and only if

(i)

$L=\widetilde{L}$ and $(D_{0},D_{1},\dots,D_{L})=(\widetilde{D}_{0},\widetilde{D}_{1},\dots,\widetilde{D}_{L})$ , and 2. (ii)

there exists a collection of signs $\{\epsilon^{\ell}_{j}:0\leq\ell\leq L,1\leq j\leq D_{\ell}\}$ , $\epsilon^{\ell}_{j}\in\{-1,+1\}$ , and permutations $\gamma_{\ell}:\{1,\dots,D_{\ell}\}\to\{1,\dots,D_{\ell}\}$ such that

–

$\gamma_{\ell}$ is the identity permutation and $\epsilon^{\ell}_{j}=+1$ , $j\in\{1,\dots,D_{\ell}\}$ , whenever $\ell=0$ or $\ell=L$ , and

–

for all $\ell\in\{1,\dots,L\}$ , $k\in\{1,\dots,D_{\ell-1}\}$ , and $j\in\{1,\dots,D_{\ell}\}$ ,

[TABLE]

It can be verified that $\sim_{\pm}$ is an equivalence relation on $\mathscr{N}_{A1}^{D_{in},D_{out}}$ . Networks $\mathcal{N}$ , $\widetilde{\mathcal{N}}$ such that $\mathcal{N}\sim_{\pm}\widetilde{\mathcal{N}}$ are said to be isomorphic up to sign changes. The permutations $\gamma_{\ell}$ reflect the fact that the ordering of the neurons in the hidden layers $1,\dots,L-1$ is not unique, whereas the freedom in choosing the signs $\epsilon^{\ell}_{j}$ reflects that $\tanh$ is an odd function. It can be verified that any two networks isomorphic up to sign changes give rise to the same map under the $\tanh$ nonlinearity, so $\sim_{\pm}$ is compatible with $(\mathscr{N}_{A1}^{D_{in},D_{out}},\tanh)$ . The crux of Fefferman’s result therefore lies in proving the converse statement, namely that two networks giving rise to the same map with respect to $\tanh$ are necessarily isomorphic up to sign changes. This is effected by the insight that the depth, the layout, and the weights and biases of a network $\mathcal{N}\in\mathscr{N}^{D_{in},D_{out}}_{A1}$ are encoded in the geometry of the singularities of the analytic continuation of $\langle{\mathcal{N}}\rangle^{\tanh}$ .

We note that Fefferman distilled the precise conditions of Assumptions 1 from his proof technique, in order to define a class of neural networks that is, on the one hand, sufficiently small to guarantee identifiability, and on the other hand, sufficiently large to encompass “generic” networks. Indeed, if we consider the network weights and biases $(W^{1},\theta^{1},\dots,W^{L},\theta^{L})$ as elements of the space $\mathbb{R}^{D_{1}\times D_{0}}\times\mathbb{R}^{D_{1}}\times\dots\times\mathbb{R}^{D_{L}\times D_{L-1}}\times\mathbb{R}^{D_{L}}$ , then Assumptions 1 rule out only a set of measure zero. In the contemporary practical machine learning literature, however, a network satisfying Assumptions 1 would hardly be considered generic, as Part (i) of Assumptions 1 implies that all biases are nonzero, and Part (ii) imposes full connectivity throughout the network.

Indeed, Fefferman remarks explicitly that it would be interesting to replace Assumptions 1 with minimal hypotheses, and to study nonlinearities other than $\tanh$ . The present paper aims to address these two issues. Characterizing the fundamental nature of conditions necessary for identifiability with respect to a fixed nonlinearity, even a simple one such as $\tanh$ , is likely a rather formidable task. In fact, the minimal identifiability conditions may generally depend on “fine” properties of the nonlinearity under consideration, and it is hence unclear how much insight can be obtained by having conditions that are specific to a given nonlinearity. We will thus be interested in an identification result with very mild conditions on the weights and biases of the neural networks to be identified, while still accommodating a broad class of nonlinearities.

II Contributions

We begin with two motivating examples. These lead up to the statements of our main contributions, whose corresponding proofs are developed in the remainder of the paper. We consider nonlinearities $\rho$ which are not necessarily odd (as $\tanh$ ), and thus need an equivalence relation which dispenses with sign changes.

Definition 4 (Neural network isomorphism).

We say that the neural networks $\mathcal{N}$ and $\widetilde{\mathcal{N}}$ are isomorphic, and write $\mathcal{N}\simeq\widetilde{\mathcal{N}}$ , if

(i)

$L=\widetilde{L}$ and $(D_{0},D_{1},\dots,D_{L})=(\widetilde{D}_{0},\widetilde{D}_{1},\dots,\widetilde{D}_{L})$ , and 2. (ii)

there exist permutations $\gamma_{\ell}:\{1,\dots,D_{\ell}\}\to\{1,\dots,D_{\ell}\}$ such that

–

$\gamma_{\ell}$ is the identity permutation for $\ell=0$ and $\ell=L$ , and

–

for all $\ell\in\{1,\dots,L\}$ , $k\in\{1,\dots,D_{\ell-1}\}$ , and $j\in\{1,\dots,D_{\ell}\}$ ,

[TABLE]

In the remainder of the paper we will work exclusively with isomorphisms in the sense of Definition 4. Note that any two isomorphic networks give rise to the same map with respect to any nonlinearity $\rho$ , and thus $\simeq$ is an equivalence relation compatible with any pair $(\mathscr{N},\rho)$ . The requirement that $\gamma_{\ell}$ be the identity map for $\ell\in\{0,L\}$ in the previous definition again corresponds to the fact that the inputs and the outputs of a neural network are not generally interchangeable. Indeed, suppose that $\mathcal{N}^{\rho}:\mathbb{R}^{2}\to\mathbb{R}^{2}$ , $\mathcal{N}^{\rho}(x,y)=(x,2y)$ is the map of a neural network with respect to some nonlinearity $\rho$ . Let $\mathcal{N}_{1}$ , $\mathcal{N}_{2}$ , and $\mathcal{N}_{3}$ be the networks obtained from $\mathcal{N}$ by interchanging the inputs of $\mathcal{N}$ , the outputs of $\mathcal{N}$ , and both inputs and outputs, respectively. Then $\mathcal{N}_{1}^{\rho}(x,y)=(y,2x)$ , $\mathcal{N}_{2}^{\rho}(x,y)=(2y,x)$ , and $\mathcal{N}_{3}^{\rho}(x,y)=(2x,y)$ are, indeed, distinct functions. We now give an example that Fefferman uses to motivate the necessity of restricting the class of all neural networks $\mathscr{N}^{D_{in},D_{out}}$ to a smaller class to be identifiable up to an equivalence relation. In Fefferman’s case, the equivalence relation is $\sim_{\pm}$ , but the example is equally pertinent to the relation $\simeq$ . Suppose that $\mathcal{N}$ is a neural network with $L\geq 2$ , and $\ell_{0},j_{1},j_{2}$ with $1\leq\ell_{0}\leq L-1$ and $1\leq j_{1}<j_{2}\leq D_{\ell_{0}}$ are such that $\theta^{\ell_{0}}_{j_{1}}=\theta^{\ell_{0}}_{j_{2}}$ and $W^{\ell_{0}}_{j_{1}k}=W^{\ell_{0}}_{j_{2}k}$ , for all $k$ . Then, if $\widetilde{\mathcal{N}}$ is obtained from $\mathcal{N}$ by replacing $W_{1j_{1}}^{\ell_{0}+1}$ and $W_{1j_{2}}^{\ell_{0}+1}$ with an arbitrary pair of numbers $\widetilde{W}_{1j_{1}}^{\ell_{0}+1}$ and $\widetilde{W}_{1j_{2}}^{\ell_{0}+1}$ such that $W_{1j_{1}}^{\ell_{0}+1}+W_{1j_{2}}^{\ell_{0}+1}=\widetilde{W}_{1j_{1}}^{\ell_{0}+1}+\widetilde{W}_{1j_{2}}^{\ell_{0}+1}$ , then $\langle{\widetilde{\mathcal{N}}}\rangle^{\rho}=\langle{\mathcal{N}}\rangle^{\rho}$ , for any $\rho$ . This example motivates the following definition.

Definition 5 (No-clones condition).

Let $\mathcal{N}$ be a neural network as in Definition 1. We say that $\mathcal{N}$ has a clone pair if there exist $\ell\in\{1,\dots,L\}$ and $j,j^{\prime}\in\{1,\dots,D_{\ell}\}$ with $j\neq j^{\prime}$ such that

[TABLE]

If $\mathcal{N}$ does not have a clone pair, we say that $\mathcal{N}$ satisfies the no-clones condition.

As the nonlinearity $\rho$ in the example above is completely arbitrary, the no-clones condition is necessary to have any hope of obtaining identifiability up to $\simeq$ . Hence, with our program in mind, given positive integers $D_{in}$ and $D_{out}$ , we define

[TABLE]

and seek nonlinearities $\rho$ such that $(\mathscr{N}^{D_{in},D_{out}}_{nc},\rho)$ is identifiable up to $\simeq$ . As any class strictly containing $\mathscr{N}^{D_{in},D_{out}}_{nc}$ , paired with any nonlinearity, fails identifiability up to $\simeq$ , the no-clones condition furnishes a canonical minimal assumption for identifiability up to $\simeq$ . Similarly to $\mathscr{N}^{D_{in},D_{out}}_{A1}$ , the class $\mathscr{N}^{D_{in},D_{out}}_{nc}$ , paired with any measurable nonlinearity $\rho$ such that $\displaystyle\lim_{x\to\infty}\rho(x)$ and $\displaystyle\lim_{x\to-\infty}\rho(x)$ exist and are not equal, satisfies the universal approximation property in the sense of Hornik [14] and Cybenko [15]. The following example demonstrates that insisting on the no-clones condition as the only assumption on the weights, biases, and layout will necessarily come at the cost of restricting the class of nonlinearities that allow for identifiability. Let $\rho(x)=\min\{1,\max\{0,x\}\}$ be the clipped rectified linear unit (ReLU) function. Note that

[TABLE]

Now, given an arbitrary neural network $\mathcal{N}=(W^{1},\theta^{1},W^{2},\theta^{2},\dots,W^{L},\theta^{L})$ with $D_{L}=1$ satisfying the no-clones condition, the network

[TABLE]

also satisfies the no-clones condition, and yields the identically-zero output, i.e., $\mathcal{N}_{0}^{\rho}\equiv 0$ . We have thus constructed an infinite collection of distinct networks satisfying the no-clones condition and all yielding the identically-zero map. The class of identically-zero output maps therefore contains networks of different depths and layouts, and thus identifiability up to $\simeq$ fails. This leads to the conclusion that a uniqueness result for neural networks with the clipped ReLU nonlinearity would need to encompass genericity conditions more stringent than the no-clones condition. Nonetheless, we are able to construct a class of real meromorphic nonlinearities $\sigma$ yielding identifiability without any assumptions on the neural networks beyond the no-clones condition, and which is large enough to uniformly approximate any piecewise $C^{1}$ nonlinearity $\rho$ with $\rho^{\prime}\in BV(\mathbb{R})$ , where

[TABLE]

is the space of functions of bounded variation on $\mathbb{R}$ .

Concretely, we have the following main result of this paper.

Theorem 1 (Uniqueness Theorem).

Let $D_{in}$ and $D_{out}$ be arbitrary positive integers. Furthermore, let $\rho$ be a piecewise $C^{1}$ function with $\rho^{\prime}\in BV(\mathbb{R})$ and let $\epsilon>0$ . Then there exists a meromorphic function $\sigma:\mathcal{D}\to\mathbb{C}$ , $\mathcal{D}\supset\mathbb{R}$ , $\sigma(\mathbb{R})\subset\mathbb{R}$ such that $\|\rho-\sigma\|_{L^{\infty}(\mathbb{R})}<\epsilon$ and $(\mathscr{N}^{D_{in},D_{out}}_{nc},\sigma)$ is identifiable up to $\simeq$ .

We note that, having fixed the input and output dimensions $D_{in}$ and $D_{out}$ , the depths and the layouts of the networks in $\mathscr{N}^{D_{in},D_{out}}_{nc}$ are completely arbitrary. Examples of nonlinearities $\rho(x)$ covered by Theorem 1 include many sigmoidal functions such as the aforementioned clipped ReLU, the logistic function $\frac{1}{1+e^{-x}}$ , the hyperbolic tangent $\tanh(x)$ , the inverse tangent $\arctan(x)$ , the softsign function $\frac{x}{1+|x|}$ , the inverse square root unit $\frac{x}{\sqrt{1+ax^{2}}}$ , the clipped identity $\frac{x}{\max\{1,|x|/a\}}$ , and the soft clipping function $\frac{1}{a}\log\frac{1+e^{ax}}{1+e^{a(x-1)}}$ , where $a>0$ is fixed in the last two cases. Unbounded nonlinearities such as the ReLU are not comprised. The nonlinearities $\sigma$ for which we have identifiability, unfortunately, need to be constructed, and, at the present time, we do not have an identification result for arbitrary given $\sigma$ . Furthermore, we remark that the statement of Theorem 1 is “not continuous” in the approximation error $\epsilon$ . Indeed, while the clipped ReLU function satisfies the conditions of Theorem 1, as shown in the example above, there exist non-isomorphic networks $\mathcal{N}_{0}$ and $\widetilde{\mathcal{N}}_{0}$ satisfying the no-clones condition and $\langle{\mathcal{N}_{0}}\rangle^{\rho}(x)=0=\langle{\widetilde{\mathcal{N}}_{0}}\rangle^{\rho}(x)$ , for all $x\in\mathbb{R}^{D_{0}}$ , where $\rho$ is the clipped ReLU function. We will see that Theorem 1 is, in fact, a consequence of the following result, which states that the maps realized by pairwise non-isomorphic networks with $D_{L}=1$ , under a nonlinearity $\sigma$ according to Theorem 1, are linearly independent functions $\mathbb{R}^{D_{0}}\to\mathbb{R}$ .

Theorem 2 (Linear Independence Theorem).

Let $D_{in}$ be an arbitrary positive integer, let $\rho$ be a piecewise $C^{1}$ function with ${\rho^{\prime}\in BV(\mathbb{R})}$ , and let $\epsilon>0$ . Then there exists a meromorphic function $\sigma:\mathcal{D}\to\mathbb{C}$ , $\mathcal{D}\supset\mathbb{R}$ , $\sigma(\mathbb{R})\subset\mathbb{R}$ such that $\|\rho-\sigma\|_{L^{\infty}(\mathbb{R})}<\epsilon$ with the following property: Suppose that $\mathcal{N}_{j}$ , $j=1,2,\dots,n$ , are pairwise non-isomorphic (in the sense of $\simeq$ ) neural networks in $\mathscr{N}^{D_{in},1}_{nc}$ . Then, $\{\langle{\mathcal{N}_{j}}\rangle^{\sigma}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{n}\cup\{\bm{1}\}$ is a linearly independent set of functions $\mathbb{R}^{D_{0}}\to\mathbb{R}$ , where $\bm{1}$ denotes the constant function taking on the value 1.

*Remark**.*

The function $\bm{1}$ is included in the linearly independent set both for the sake of greater generality of the statement, and to facilitate the proof of Theorem 2.

Unfortunately, Theorem 2 does not generalize to multiple outputs $D_{out}>1$ , as shown by the following example: Fix an arbitrary network $\mathcal{N}$ according to Definition 1 such that $L\geq 2$ , $D_{L}=4$ , $\theta_{L}=\bm{0}$ , and $\mathcal{N}$ satisfies the no-clones condition. Define $U^{m}\in\mathbb{R}^{2\times D_{L-1}}$ , $m\in\{1,2,3,4\}$ , as the submatrices of $W^{L}$ consisting of the rows $1$ and $3$ , $1$ and $4$ , $2$ and $4$ , and $2$ and $3$ , respectively. Furthermore, define the networks

[TABLE]

for $m\in\{1,2,3,4\}$ . As $\mathcal{N}$ satisfies the no-clones condition, the networks $\mathcal{N}_{m}$ , $m\in\{1,2,3,4\}$ , also satisfy the no-clones condition, and are pairwise non-isomorphic.

Now, let $\rho$ be an arbitrary nonlinearity, and write $\langle{\mathcal{N}}\rangle^{\rho}=(f_{1},f_{2},f_{3},f_{4})$ , where $f_{m}:\mathbb{R}^{D_{0}}\to\mathbb{R}$ , $m\in\{1,2,3,4\}$ . Then

[TABLE]

and so

[TABLE]

The set $\{\langle{\mathcal{N}_{m}}\rangle^{\rho}\}_{m\hskip 1.42262pt=\hskip 1.42262pt1}^{4}$ is hence linearly dependent, showing that Theorem 2 cannot be generalized to multiple outputs by replacing $\mathscr{N}_{nc}^{D_{in},1}$ with $\mathscr{N}_{nc}^{D_{in},D_{out}}$ . We now provide a panorama of the proofs of Theorems 1 and 2. The proof of Theorem 1 is by way of contradiction with Theorem 2. Specifically, assume that $D_{in}$ , $D_{out}$ , $\rho$ , and $\epsilon>0$ are as in the statement of Theorem 1, and let $\sigma$ be a nonlinearity satisfying the conclusion of Theorem 2 with these $D_{in}$ , $\rho$ , and $\epsilon$ . For a network $\mathcal{N}\in\mathscr{N}_{nc}^{D_{in},D_{out}}$ , we write the map $\langle{\mathcal{N}}\rangle^{\sigma}=\left((\langle{\mathcal{N}}\rangle^{\sigma})_{1},\dots,(\langle{\mathcal{N}}\rangle^{\sigma})_{D_{out}}\right)$ in terms of the coordinate functions $(\langle{\mathcal{N}}\rangle^{\sigma})_{j}:\mathbb{R}^{D_{in}}\to\mathbb{R}$ , $j\in\{1,\dots,D_{out}\}$ . Now, let $\mathcal{N}_{1},\mathcal{N}_{2}\in\mathscr{N}_{nc}^{D_{in},D_{out}}$ be networks such that $\langle{\mathcal{N}_{1}}\rangle^{\sigma}(x)=\langle{\mathcal{N}_{2}}\rangle^{\sigma}(x)$ , for all $x\in\mathbb{R}^{D_{in}}$ , and suppose by way of contradiction that they are non-isomorphic. We construct a network $\mathcal{M}$ containing both $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ as subnetworks (a precise definition of “subnetwork” is given in Section III, Definition 9). It follows that $\mathcal{M}$ contains subnetworks $\mathcal{M}_{m,j}\in\mathscr{N}^{D_{in},1}_{nc}$ with maps satisfying $\langle{\mathcal{M}_{m,j}}\rangle^{\sigma}={(\langle{\mathcal{N}_{m}}\rangle^{\sigma})}_{j}$ , for $m\in\{1,2\}$ and $j\in\{1,\dots,D_{out}\}$ . We then show that, as a consequence of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ being non-isomorphic, there exists a $j\in\{1,\dots,D_{out}\}$ such that $\mathcal{M}_{1,j}$ and $\mathcal{M}_{2,j}$ are non-isomorphic. But then

[TABLE]

which stands in contradiction to Theorem 2. This completes the proof of Theorem 1.

The proof of Theorem 2 is significantly more involved, as it requires extensive “fine tuning” of the function $\sigma$ . Let $\sigma:\mathcal{D}\to\mathbb{C}$ be as in the statement of Theorem 2. In addition to the properties stated in Theorem 2, the function $\sigma$ we construct exhibits the following convenient structural properties:

The domain $\mathcal{D}\subset\mathbb{C}$ of $\sigma$ is the complement of an (infinite) discrete set of poles, 2. 2.

$\sigma$ is $i$ -periodic, i.e., $\sigma(z+i)=\sigma(z)$ , for all $z\in\mathcal{D}$ , and 3. 3.

for any network $\mathcal{N}\in\mathscr{N}^{1,1}$ , the natural domain $\mathcal{D}_{\langle{\mathcal{N}}\rangle^{\sigma}}\subset\mathbb{C}$ of $\langle{\mathcal{N}}\rangle^{\sigma}$ , viewed as a holomorphic function, is the complement of a closed countable subset of $\mathbb{C}$ , and therefore a connected open set.

These three properties are all satisfied by the function $\tanh(\pi\,\cdot)$ , and are essentially the key insight leading to Fefferman’s identifiability result in [3], which establishes that, under the genericity conditions stated in Assumptions 1, a neural network can be read off from the asymptotic (as the imaginary part of the argument tends to infinity) locations of the singularities of the map it realizes under the $\tanh$ nonlinearity. The properties 1) – 3) will be key to our results as well, but instead of studying the set of singularities of the map in its own right, our proof of Theorem 2 will proceed by contradiction. The proof consists of three steps that we call amalgamation, input splitting, and input anchoring, and involves the use of analytic continuation, graph-theoretic constructions, and Kronecker’s theorem [16], the latter two of which are novel tools in this context and signify a significant departure from Fefferman’s proof technique in [3]. We now briefly describe the proof of Theorem 2 according to the aforementioned program. Suppose that $\mathcal{N}_{1},\dots,\mathcal{N}_{n}$ are pairwise non-isomorphic neural networks satisfying the no-clones condition. For the sake of simplicity of this informal discussion, we assume that $L_{1}=L_{2}=\dots=L_{n}$ , $D_{0}^{1}=D_{0}^{2}=\dots=D_{0}^{n}=1$ , and $D_{L_{1}}^{1}=D_{L_{2}}^{2}=\dots=D_{L_{n}}^{n}=1$ . By way of contradiction, we suppose that there exists a nontrivial linear combination such that $\lambda_{0}\bm{1}(x)+\sum_{j=1}^{n}\lambda_{j}\mathcal{N}_{j}^{\sigma}(x)=0$ , for all $x\in\mathbb{R}$ .

Amalgamation: In Section III we construct a neural network $\mathcal{M}\in\mathscr{N}^{1,n}_{nc}$ , called the amalgam of $\{\mathcal{N}_{j}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{n}$ , containing each $\mathcal{N}_{j}$ as a subnetwork. In particular, we have ${(\langle{\mathcal{M}}\rangle^{\sigma})}_{j}=\langle{\mathcal{N}_{j}}\rangle^{\sigma}$ , for all $j\in\{1,\dots,n\}$ . The linear dependence of $\{\langle{\mathcal{N}_{j}}\rangle^{\sigma}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{n}\cup\{\bm{1}\}$ thus translates to

[TABLE]

for all $z\in\mathbb{R}$ . By our construction of $\sigma$ , the natural domains $\mathcal{D}_{\langle{\mathcal{N}_{j}}\rangle^{\sigma}}=\mathcal{D}_{(\langle{\mathcal{M}}\rangle^{\sigma})_{j}}$ are complements of closed countable sets, and hence, by analytic continuation, (1) is valid for all $z\in\bigcap_{j=1}^{n}\mathcal{D}_{\langle{\mathcal{N}_{j}}\rangle^{\sigma}}$ . Now define $\mathscr{M}$ to be the set of all neural networks in $\bigcup_{m=1}^{n}\mathscr{N}_{nc}^{1,m}$ with linear dependency as in (1) between the output functions and the constant function. Note that $\mathscr{M}$ is nonempty, simply as $\mathcal{M}\in\mathscr{M}$ . We then fix a network $\mathcal{M}^{\prime}\in\mathscr{M}$ of minimum size (the precise definition of size will be given in the proof of Theorem 4). Write $(1,D^{\mathcal{M^{\prime}}}_{1},\dots,D^{\mathcal{M^{\prime}}}_{m})$ for the layout of $\mathcal{M^{\prime}}$ , and let $(\omega_{1},\dots,\omega_{D^{\mathcal{M^{\prime}}}_{1}})$ be the weights of the first layer of $\mathcal{M^{\prime}}$ (i.e., the entries of $W^{1}$ according to Definition 1). At this point the proof splits into two cases, depending on whether there exist $j,j^{\prime}\in\{1,\dots,{D^{\mathcal{M^{\prime}}}_{1}}\}$ , $j\neq j^{\prime}$ , such that $\omega_{j}/\omega_{j^{\prime}}$ is irrational.

Input splitting, the easy case. Provided there do exist such $j$ and $j^{\prime}$ , we use Kronecker’s theorem [16] and the properties (i) – (iii) of $\sigma$ to construct a network $\mathcal{M}^{\prime\prime}\in\mathscr{M}$ with layout $(k,D^{\mathcal{M^{\prime}}}_{1},\dots,D^{\mathcal{M^{\prime}}}_{m})$ , for some $k\in\{2,\dots,D^{\mathcal{M^{\prime}}}_{1}\}$ , and first-layer weights $\widetilde{W}^{1}\in\mathbb{R}^{D^{\mathcal{M^{\prime}}}_{1}\times k}$ such that the first $k$ rows of $\widetilde{W}^{1}$ form a $k\times k$ identity matrix.

Input anchoring. We then construct a third network $\mathcal{N}\in\mathscr{M}$ , obtained by fixing $k-1$ of the $k$ inputs of $\mathcal{M}^{\prime\prime}$ to specific real numbers, and “cutting out” all the parts of the network whose contributions to the output map have become constant in the process. The resulting network $\mathcal{N}$ will be a network in $\mathscr{M}$ of size smaller than $\mathcal{M}^{\prime}$ , which contradicts the minimality of $\mathcal{M}^{\prime}$ , and thereby completes the proof.

Input splitting, the hard case. If, however, all the ratios $\omega_{j}/\omega_{j^{\prime}}$ , $j\neq j^{\prime}$ are rational, the input splitting construction described above cannot be carried out. This problem will be remedied by further refining our initial construction of $\sigma$ . Specifically, we will ensure that the real parts of the poles of $\sigma$ form a subset of $\mathbb{R}$ satisfying what we call the self-avoiding property, to be introduced in Section V. This will enable an alternative construction of a network $\mathcal{M}^{\prime\prime}$ with at least two inputs. The resulting $\mathcal{M}^{\prime\prime}$ will, however, not be a neural network in the sense of Definition 1, but rather a generalized network in the sense of Definition 8, to be introduced in Section III.

Input anchoring. Finally, we apply an input anchoring procedure to $\mathcal{M}^{\prime\prime}$ similar to the one described above. Even though now $\mathcal{M}^{\prime\prime}$ is not a network in the sense of Definition 1, the input anchoring procedure will result in a network $\mathcal{N}\in\mathscr{M}$ which is a network in the sense of Definition 1, and is of smaller size than $\mathcal{M}^{\prime}$ , again completing the proof by contradiction.

We conclude this section by laying out the organization of the remainder of the paper. In Section III we develop a graph-theoretic framework needed to define amalgams of neural networks and several other technical concepts. In Section IV we state results from complex analysis and Kronecker’s theorem needed in arguments involving analytic continuation and input splitting, respectively. The proofs of these results are relegated to the Appendix. In Section V we discuss the fine structural properties of the function $\sigma$ constructed in the proof of Theorem 2. Finally, Section VI contains the proofs of our two main results.

III Directed acyclic graphs, general neural networks, and

neural network amalgams

As already mentioned, in the proof of Theorem 2 we will work with a form of neural networks that does not fit in with Definitions 1 and 2. In order to accommodate this notion of neural networks, and to lighten the manipulations needed to formalize the aforementioned techniques of amalgamation and input anchoring, we introduce a graph-theoretic framework.

We start by introducing the concept of a directed acyclic graph (DAG), commonly encountered in the graph theory literature [17].

Definition 6 (Directed acyclic graph).

–

A directed graph is an ordered pair $G=(V,E)$ where $V$ is a finite set of nodes, and $E\subset V\times V$ is a set of directed edges.

–

A directed cycle of a directed graph $G$ is a set $\{v_{1},\dots,v_{k}\}\subset V$ such that, for every $j\in\{1,\dots,k\}$ , $(v_{j},v_{j+1})\in E$ , where we set $v_{k+1}\vcentcolon=v_{1}$ .

–

A directed graph $G$ is said to be a directed acyclic graph (DAG) if it has no directed cycles.

We interpret an edge $(v,\widetilde{v})$ as an arrow connecting the nodes $v$ and $\widetilde{v}$ and pointing at $\widetilde{v}$ .

Definition 7 (Parent set, input nodes, and node level).

Let $G=(V,E)$ be a DAG.

–

We define the parent set of a node by $\mathrm{par}(v)=\{\widetilde{v}:(\widetilde{v},v)\in E\}$ .

–

We say that $v\in V$ is an input node if $\mathrm{par}(v)=\varnothing$ , and we write $\mathrm{In}(G)$ for the set of input nodes.

–

We define the level $\mathrm{lv}(v)$ of a node $v\in V$ recursively as follows. If $\mathrm{par}(v)=\varnothing$ , we set $\mathrm{lv}(v)=0$ . If $\mathrm{par}(v)=\{v_{1},v_{2},\dots,v_{k}\}$ and $\mathrm{lv}(v_{1}),\mathrm{lv}(v_{2}),\dots,\mathrm{lv}(v_{k})$ are defined, we set $\mathrm{lv}(v)=\max\{\mathrm{lv}(v_{1}),\mathrm{lv}(v_{2}),\dots,\mathrm{lv}(v_{k})\}+1$ .

Since the graph $G$ in Definition 7 is assumed to be acyclic, the level is well-defined for all nodes of $G$ . We are now ready to introduce our generalized definition of a neural network.

Definition 8.

A general feed-forward neural network (GFNN) is an ordered sextuple $\mathcal{N}=(V,E,V_{in},\allowbreak V_{out},\Omega,\Theta)$ , where

–

$G=(V,E)$ is a DAG, called the architecture of $\mathcal{N}$ ,

–

$V_{in}=\mathrm{In}(G)$ is the set of inputs of $\mathcal{N}$ ,

–

$V_{out}\subset V\setminus V_{in}$ is the set of outputs of $\mathcal{N}$ ,

–

$\Omega=\{\omega_{\widetilde{v}v}\in\mathbb{R}\setminus\{0\}:(v,\widetilde{v})\in E\}$ is the set of weights of $\mathcal{N}$ , and

–

$\Theta=\{\theta_{v}\in\mathbb{R}:v\in V\setminus V_{in}\}$ is the set of biases of $\mathcal{N}$ .

The depth of a GFNN is defined as $L(\mathcal{N})=\max\{\mathrm{lv}(v):v\in V\}$ .

When translating from Definition 1 to Definition 8, we will interpret a zero weight $W_{jk}^{\ell}=0$ simply as the absence of a directed edge between the nodes concerned, hence we do not allow the edges of a GFNN to have zero weight. If $V^{1}$ and $V^{2}$ are the sets of nodes of GFNNs $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ , respectively, and $v\in V^{1}\cap V^{2}$ , we will say that $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ share the node $v$ . When dealing with several networks sharing a node $v$ , we will write $\mathrm{par}_{\mathcal{N}}(v)$ for the parent set of $v$ in the architecture $(V,E)$ of $\mathcal{N}$ , to avoid ambiguity. Note that the set of outputs of a GFNN can be an arbitrary subset of the non-input nodes. In particular, $V_{out}$ can include nodes $w$ with $\mathrm{lv}(w)<L(\mathcal{N})$ . Related to the concept of the parent set of a node is the concept of a subnetwork introduced next.

Definition 9 (Subnetwork and ancestor subnetwork).

Let $\mathcal{N}=(V,E,V_{in},V_{out},\Omega,\Theta)$ be a GFNN. A subnetwork of $\mathcal{N}$ is a GFNN $\mathcal{N}^{\prime}=(V^{\prime},E^{\prime},V_{in}^{\prime},V_{out}^{\prime},\Omega^{\prime},\Theta^{\prime})$ such that there exists a set $S\subset V$ so that

(i)

$V^{\prime}=\{v\in V:v\in\mathrm{par}^{r}(u)\text{ for some }r\geq 0\}$ , where, for a set $W\subset V$ , we define $\mathrm{par}^{0}(W)=W$ and $\mathrm{par}^{r}(W)=\bigcup_{s\in W}\mathrm{par}^{r-1}(\mathrm{par}(s))$ , for $r\geq 1$ . 2. (ii)

$E^{\prime}=\{(v,\widetilde{v})\in E:v,\widetilde{v}\in V^{\prime}\}$ , 3. (iii)

$V_{in}^{\prime}=V_{in}\cap V^{\prime}$ , 4. (iv)

$\Omega^{\prime}=\{\omega_{\widetilde{v}v}:(v,\widetilde{v})\in E^{\prime}\}$ , and 5. (v)

$\Theta^{\prime}=\{\theta_{v}:v\in V^{\prime}\}$ .

If additionally $V_{out}^{\prime}=S$ , then $\mathcal{N}^{\prime}$ is uniquely specified by $S$ . In this case we say that $\mathcal{N}^{\prime}$ is the ancestor subnetwork of $S$ in $\mathcal{N}$ , and write $\mathcal{N}(S)$ for this network.

Definition 10.

A layered feed-forward neural network (LFNN) is a GFNN satisfying $\mathrm{lv}(\widetilde{v})=\mathrm{lv}(v)+1$ , for all $(v,\widetilde{v})\in E$ .

For an example of a GFNN that is not layered, see Figure 1. We notice that LFNNs correspond to neural networks as specified by Definition 1, with the nodes of level $\ell$ corresponding to the $\ell$ -th network layer. Specifically, if $\mathcal{N}=(V,E,V_{in},V_{out},\Omega,\Theta)$ is a LFNN, we can label the nodes $\{v\in V:\mathrm{lv}(v)=\ell\}$ by $v_{j}^{\ell}$ , $j=1,\dots,D_{\ell}$ , and let $\theta_{j}^{\ell}=\theta_{v_{j}^{\ell}}$ , $W^{\ell}_{jk}=\omega_{v_{j}^{\ell}v_{k}^{\ell-1}}$ when $(k,j)\in E$ and $W^{\ell}_{jk}=0$ else. Apropos, this correspondence is the reason for the indices of the weight $\omega_{\widetilde{v}v}$ associated with the edge $(v,\tilde{v})$ of a GFNN appearing in “reverse order”. The following definition generalizes Definition 2 to GFNNs.

Definition 11 (Output maps of nodes and networks).

Let $\mathcal{N}=(V,E,V_{in},V_{out},\Omega,\Theta)$ be a GFNN, and let $\rho:\mathbb{R}\to\mathbb{R}$ be a nonlinearity. The map realized by a node $v\in V$ under $\rho$ is the function $\left\langle{v}\right\rangle^{\rho}:\mathbb{R}^{V_{in}}\to\mathbb{R}$ defined recursively as follows:

–

If $v\in V_{in}$ , set $\left\langle{v}\right\rangle^{\rho}\!(\bm{t})=t_{v}$ , for all $\bm{t}=(t_{u})_{u\in V_{in}}\in\mathbb{R}^{V_{in}}$ .

–

Otherwise set $\left\langle{v}\right\rangle^{\rho}\!(\bm{t})=\rho\left(\sum_{u\in\mathrm{par}(v)}\omega_{vu}\cdot\left\langle{u}\right\rangle^{\rho}(\bm{t})+\theta_{v}\right)$ , for all $\bm{t}\in\mathbb{R}^{V_{in}}$ .

The map realized by $\mathcal{N}$ under $\rho$ is the function $\left\langle{\mathcal{N}}\right\rangle^{\rho}:\mathbb{R}^{V_{in}}\to\mathbb{R}^{V_{out}}$ given by $\left\langle{\mathcal{N}}\right\rangle^{\rho}=(\left\langle{w}\right\rangle^{\rho})_{w\in V_{out}}$ . When dealing with several networks we will write $\left\langle{v}\right\rangle^{\rho,\,\mathcal{N}}$ for the map realized by $v$ in $\mathcal{N}$ , to avoid ambiguity.

We will treat nodes $v\in V$ only as “handles”, and never as variables or functions. This is relevant when dealing with several networks with shared nodes, such as depicted in Figure 2. On the other hand, the output map $\left\langle{v}\right\rangle^{\rho}$ realized by $v$ is a function.

In the special case when the nonlinearity is holomorphic on a neighborhood of $\mathbb{R}$ , the output maps realized by the nodes of a network will extend to holomorphic functions on their natural domains, as given by the following definition.

Definition 12 (Natural domain).

Let $\mathcal{N}=(V,E,V_{in},V_{out},\Omega,\Theta)$ be a GFNN, and let $\sigma:\mathcal{D}_{\sigma}\to\mathbb{C}$ be a function holomorphic on an open domain $\mathcal{D}_{\sigma}\supset\mathbb{R}$ and such that $\sigma(\mathbb{R})\subset\mathbb{R}$ . For a node $v\in V$ , we define the natural domain $\mathcal{D}_{\left\langle{v}\right\rangle^{\sigma}}\subset\mathbb{C}^{V_{in}}$ and extend the definition of the function $\left\langle{v}\right\rangle^{\sigma}:\mathcal{D}_{\left\langle{v}\right\rangle^{\sigma}}\to\mathbb{C}$ recursively as follows:

–

For $v\in V_{in}$ , let $\mathcal{D}_{\left\langle{v}\right\rangle^{\sigma}}=\mathbb{C}^{V_{in}}$ , and set $\left\langle{v}\right\rangle^{\sigma}\!(\bm{z})=z_{v}$ , for all $\bm{z}=(z_{u})_{u\in V_{in}}\in\mathbb{C}^{V_{in}}$ .

–

Otherwise, set $\mathcal{D}_{\left\langle{v}\right\rangle^{\sigma}}=\left\{\bm{z}\in\bigcap_{u\in\mathrm{par}(v)}\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}:\sum_{u\in\mathrm{par}(v)}\omega_{vu}\left\langle{u}\right\rangle^{\sigma}\!(\bm{z})+\theta_{v}\in\mathcal{D}_{\sigma}\right\}$ , and let $\left\langle{v}\right\rangle^{\sigma}\!(\bm{z})\allowbreak=\sigma\left(\sum_{u\in\mathrm{par}(v)}\omega_{vu}\cdot\left\langle{u}\right\rangle^{\sigma}(\bm{z})+\theta_{v}\right)$ , for all $\bm{z}\in\mathcal{D}_{\left\langle{v}\right\rangle^{\sigma}}$ .

It follows that the natural domain $\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}$ of a node $u$ is open, as it is the preimage of an open set with respect to a continuous map. Moreover, the output map $\left\langle{u}\right\rangle^{\sigma}$ realized by $u$ is holomorphic on $\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}$ , as it is given explicitly by a concatenation of affine maps and the nonlinearity $\sigma$ , which are themselves holomorphic functions.

The following definition is a straightforward generalization of Definition 5.

Definition 13 (Clone pairs and the no-clones condition).

Let $\mathcal{N}=(V,E,V_{in},V_{out},\Omega,\Theta)$ be a GFNN. We say that the nodes $v_{1},v_{2}\in V$ , $v_{1}\neq v_{2}$ , are clones if $\mathrm{par}(v_{1})=\mathrm{par}(v_{2})$ , $\theta_{v_{1}}=\theta_{v_{2}}$ , and $\forall u\in\mathrm{par}(v_{1})$ , $\omega_{v_{1}u}=\omega_{v_{2}u}$ . We say that $\mathcal{N}$ satisfies the no-clones condition (or briefly, $\mathcal{N}$ is clones-free), if no two nodes $v_{1},v_{2}\in V$ , $v_{1}\neq v_{2}$ , are clones.

The following definition generalizes Definition 4 to GFNNs, and introduces two new concepts, termed extensional isomorphism and faithful isomorphism, which will play an important technical role throughout the remainder of the paper.

Definition 14 (Extensional and faithful isomorphisms of GFFNs).

Let $\mathcal{N}^{1}=(V^{1},E^{1},V_{in},V_{out}^{1},\allowbreak\Omega^{1},\Theta^{1})$ and $\mathcal{N}^{2}=(V^{2},E^{2},V_{in},V_{out}^{2},\Omega^{2},\Theta^{2})$ be GFNNs with the same input nodes $V_{in}$ .

–

We say that $\mathcal{N}^{1}$ and $\mathcal{N}^{2}$ are extensionally isomorphic, and write $\mathcal{N}^{1}\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{N}^{2}$ , if there exists a bijection $\pi:V^{1}\to V^{2}$ , called an extensional isomorphism, such that the following holds:

(i)

$\pi$ restricted to $V_{in}$ is the identity map, 2. (ii)

$\pi(V_{out}^{1})=V_{out}^{2}$ , 3. (iii)

for all $(v,\widetilde{v})\in E^{1}$ , we have $\omega^{2}_{\pi(\widetilde{v})\pi(v)}=\omega^{1}_{\widetilde{v}v}$ , and 4. (iv)

for all $v\in V^{1}\setminus V_{in}$ , we have $\theta^{2}_{\pi(v)}=\theta^{1}_{v}$ .

–

We say that $\mathcal{N}^{1}$ and $\mathcal{N}^{2}$ are faithfully isomorphic, and write $\mathcal{N}^{1}\stackrel{{\scriptstyle f}}{{\sim}}\mathcal{N}^{2}$ , if they are extensionally isomorphic via $\pi:V^{1}\to V^{2}$ with the following additional property:

(v)

$V_{out}^{1}=V_{out}^{2}$ , and $\pi$ restricted to $V_{out}^{1}$ is the identity map.

In this case we call $\pi$ a faithful isomorphism.

*Remark**.*

The concept of faithful isomorphisms in Definition 14 generalizes that of isomorphisms according to Definition 4. It is easily seen that extensional isomorphism is an equivalence relation on the set of all GFNNs with the same input nodes, whereas faithful isomorphism is an equivalence relation on the set of all GFNNs with the same input and output nodes. Furthermore, if $\mathcal{N}^{1}\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{N}^{2}$ via $\pi:V^{1}\to V^{2}$ , then we have $\left\langle{\pi(v)}\right\rangle^{\rho,\,\mathcal{N}^{2}}=\left\langle{v}\right\rangle^{\rho,\,\mathcal{N}^{1}}$ , for all $v\in V^{1}$ and any nonlinearity $\rho$ , and if additionally $\mathcal{N}^{1}\stackrel{{\scriptstyle f}}{{\sim}}\mathcal{N}^{2}$ , then $\left\langle{\mathcal{N}^{1}}\right\rangle^{\rho}=\left\langle{\mathcal{N}^{2}}\right\rangle^{\rho}$ .

The following definition introduces the non-degeneracy property of a GFNN, which corresponds to the absence of spurious nodes, i.e., nodes that do not contribute to the map realized by the GFNN (with respect to an arbitrary nonlinearity). In the special case of LFNNs considered in the introduction, this property corresponds to the requirement that no matrix $W^{\ell}$ in Definition 1 has an identically zero row or column.

Definition 15 (Non-degeneracy).

We say that a GFNN $\mathcal{N}=(V,E,V_{in},V_{out},\Omega,\Theta)$ is non-degenerate if

$V=V^{\mathcal{N}(V_{out})}$ , where $V^{\mathcal{N}(V_{out})}$ is the set of nodes of the ancestor subnetwork of $V_{out}$ in $\mathcal{N}$ . Networks that are not non-degenerate are referred to as degenerate.

Informally, a network is non-degenerate if its every node “leads up” to at least one output. This notion is best understood with the help of examples as in Figure 3.

We are now ready to introduce the concept of amalgams of LFNNs.

Definition 16 (Amalgam of two layered neural networks).

Let $\mathcal{N}_{1}=(V^{1},E^{1},V_{in},V_{out}^{1},\Omega^{1},\Theta^{1})$ and $\mathcal{N}_{2}=(V^{2},E^{2},V_{in},V_{out}^{2},\Omega^{2},\Theta^{2})$ be non-degenerate clones-free LFNNs with the same input set $V_{in}$ .

–

Let $\mathcal{A}=(V^{\mathcal{A}},E^{\mathcal{A}},V_{in},V_{out}^{\mathcal{A}},\Omega^{\mathcal{A}},\Theta^{\mathcal{A}})$ be a non-degenerate LFNN with the following properties:

(i)

There exist injective maps $\pi_{1}:V^{1}\to\pi_{1}(V^{1})\subset V^{\mathcal{A}}$ and $\pi_{2}:V^{2}\to\pi_{2}(V^{2})\subset V^{\mathcal{A}}$ such that the networks $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ are extensionally isomorphic to the ancestor subnetworks $\mathcal{A}(\pi_{1}(V_{out}^{1}))$ and $\mathcal{A}(\pi_{2}(V_{out}^{2}))$ via $\pi_{1}$ and $\pi_{2}$ , respectively. 2. (ii)

$V^{\mathcal{A}}=\pi_{1}(V^{1})\cup\pi_{2}(V^{2})$ and $V_{out}^{\mathcal{A}}=\pi_{1}(V_{out}^{1})\cup\pi_{2}(V_{out}^{2})$ .

We then say that $\mathcal{A}$ is a proto-amalgam of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ .

–

If $\mathcal{A}$ is a clones-free proto-amalgam of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ , we say that $\mathcal{A}$ is an amalgam of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ .

Proposition 1.

Let $\mathcal{N}_{1}=(V^{1},E^{1},V_{in},V_{out}^{1},\Omega^{1},\Theta^{1})$ and $\mathcal{N}_{2}=(V^{2},E^{2},V_{in},V_{out}^{2},\Omega^{2},\Theta^{2})$ be non-degenerate clones-free LFNNs with a shared input set $V_{in}$ . Then there exists an amalgam $\mathcal{A}$ of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ . Moreover, the amalgam is unique up to extensional isomorphisms.

As asserted in Proposition 1 (whose proof is deferred to the Appendix), an amalgam of two given non-degenerate clones-free LFNNs $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ always exists and is unique up to extensional isomorphisms. With slight abuse of notation, we will write $\mathcal{N}_{1}\vee\mathcal{N}_{2}$ for an arbitrary element of the equivalence class (induced by $\stackrel{{\scriptstyle e}}{{\sim}}$ ) of all the amalgams of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ . A concrete example of an amalgam construction is provided in Figure 4. Having defined the amalgam of two non-degenerate clones-free LFNNs, we define the amalgam of any finite collection $\mathcal{N}_{1},\dots,\mathcal{N}_{n}$ of non-degenerate clones-free LFNNs according to

[TABLE]

By Definition 16, $\bigvee_{k=1}^{n}\mathcal{N}_{k}$ is a non-degenerate clones-free LFNN. Moreover, there exist extensional isomorphisms $\pi_{j}:\mathcal{N}_{j}\to\pi_{j}(\mathcal{N}_{j})\subset\bigvee_{k=1}^{n}\mathcal{N}_{k}$ , for $j\in\{1,\dots,n\}$ , and we have $\left\langle{\pi_{j}(v)}\right\rangle^{\rho,\,\bigvee_{k=1}^{n}\mathcal{N}_{k}}=\left\langle{v}\right\rangle^{\rho,\,\mathcal{N}_{j}}$ , for $j\in\{1,\dots,n\}$ , $v\in V^{\mathcal{N}_{j}}$ , and any nonlinearity $\rho$ .

We are now in a position to prove two lemmas that form the basis for the proof of Theorem 2. The first lemma formalizes the idea of combining multiple pairwise non-isomorphic single-output networks with linearly dependent ouput maps into one multiple-output network with linear dependency among the maps of its ouput nodes.

Lemma 1.

Let $\mathcal{N}_{1}$ , $\mathcal{N}_{2}$ , …, $\mathcal{N}_{n}$ be non-degenerate, clones-free LFNNs with a shared input set $V_{in}$ and the same single output node $\{v_{out}\}$ . Furthermore, assume that no two networks $\mathcal{N}_{j_{1}},\mathcal{N}_{j_{2}}$ , $j_{1}\neq j_{2}$ , are extensionally isomorphic. Let $\rho$ be a nonlinearity and suppose that $\bm{1},\left\langle{\mathcal{N}_{1}}\right\rangle^{\rho},\left\langle{\mathcal{N}_{2}}\right\rangle^{\rho},\dots,\left\langle{\mathcal{N}_{n}}\right\rangle^{\rho}$ are linearly dependent as functions $\mathbb{R}^{V_{in}}\to\mathbb{R}$ . Then there exists a non-degenerate clones-free LFNN $\mathcal{M}=(V^{\mathcal{M}},E^{\mathcal{M}},V_{in}^{\mathcal{M}},V_{out}^{\mathcal{M}},\Omega^{\mathcal{M}},\Theta^{\mathcal{M}})$ (obtained by modifying $\bigvee_{k=1}^{n}\mathcal{N}_{k}$ ) with a single input node $V_{in}^{\mathcal{M}}=\{v_{in}\}$ , such that $\{\left\langle{w}\right\rangle^{\rho}:w\in V_{out}^{\mathcal{M}}\}\cup\{\bm{1}\}$ is a linearly dependent set of functions from $\mathbb{R}$ to $\mathbb{R}$ .

Proof.

We first create a new node $v_{in}$ and select an arbitrary set $\{\omega_{\widetilde{v}v_{in}}:\widetilde{v}\in V_{in}\}\subset\mathbb{R}\setminus\{0\}$ of cardinality $\#V_{in}$ . Now, we enlarge each $\mathcal{N}_{j}$ to a new network $\widetilde{\mathcal{N}}_{j}$ by gluing the node $v_{in}$ to the set $V_{in}$ through the edges $\{(v_{in},\widetilde{v}):\widetilde{v}\in V_{in}\}$ along with the corresponding weights $\omega_{\widetilde{v}v_{in}}$ . The nodes $v\in V_{in}$ are non-input nodes of the $\widetilde{\mathcal{N}}_{j}$ , as their parent sets $\mathrm{par}_{\widetilde{\mathcal{N}}_{j}}(v)=\{v_{in}\}$ are non-empty, and we set their biases $\theta_{v}$ to [math]. The node $v_{in}$ is now the shared single input of the networks $\widetilde{\mathcal{N}}_{j}$ , $j=1,\dots,n$ . Note that, as the networks $\mathcal{N}_{j}$ are clones-free, and the weights $\omega_{\widetilde{v}v_{in}}$ are distinct, the networks $\widetilde{\mathcal{N}}_{j}$ are clones-free by assumption. Further, since ${\mathcal{N}}_{j}$ , $j\in\{1,\dots,n\}$ , are pairwise non-isomorphic, so are the $\widetilde{\mathcal{N}}_{j}$ , $j\in\{1,\dots,n\}$ . We now construct a network $\mathcal{M}$ by amalgamating $\widetilde{\mathcal{N}}_{j}$ , $j=1,\dots,n$ , according to $\mathcal{M}=(\dots(\widetilde{\mathcal{N}}_{1}\vee\widetilde{\mathcal{N}}_{2})\vee\dots)\vee\widetilde{\mathcal{N}}_{n}$ . Denote by $\pi_{j}:V^{\widetilde{\mathcal{N}}_{j}}\to\pi_{j}(V^{\widetilde{\mathcal{N}}_{j}})\subset V^{\mathcal{M}}$ the extensional isomorphism between $\widetilde{\mathcal{N}}_{j}$ and the corresponding subnetwork of $\mathcal{M}$ , and let $w_{j}=\pi_{j}(v_{out})$ be the node of $\mathcal{M}$ corresponding to the output node of $\mathcal{N}_{j}$ . We claim that $w_{j_{1}}\neq w_{j_{2}}$ , for $j_{1}\neq j_{2}$ . To see this, take $j_{1},j_{2}$ such that $w_{j_{1}}=w_{j_{2}}$ , i.e., $\pi_{j_{1}}(v_{out})=\pi_{j_{2}}(v_{out})$ . Then, by Property (i) of Definition 16, $\widetilde{\mathcal{N}}_{j_{1}}(v_{out})\stackrel{{\scriptstyle e}}{{\sim}}\widetilde{\mathcal{N}}_{j_{2}}(v_{out})$ , and therefore ${\mathcal{N}}_{j_{1}}(v_{out})\stackrel{{\scriptstyle e}}{{\sim}}{\mathcal{N}}_{j_{2}}(v_{out})$ as well. But $\mathcal{N}_{j_{1}}(v_{out})=\mathcal{N}_{j_{1}}$ and $\mathcal{N}_{j_{2}}(v_{out})=\mathcal{N}_{j_{2}}$ by the non-degeneracy assumption, and hence $\mathcal{N}_{j_{1}}\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{N}_{j_{2}}$ . It follows that $j_{1}=j_{2}$ , as $\mathcal{N}_{j}$ , $j=1,\dots,n$ , are assumed to be pairwise non-isomorphic. Thus the $w_{j}$ are, indeed, distinct nodes of $\mathcal{M}$ , and we have $V_{out}^{\mathcal{M}}=\{w_{1},w_{2},\dots,w_{n}\}$ . As $\bm{1},\left\langle{\mathcal{N}_{1}}\right\rangle^{\rho},\left\langle{\mathcal{N}_{2}}\right\rangle^{\rho},\dots,\left\langle{\mathcal{N}_{n}}\right\rangle^{\rho}$ are linearly dependent by assumption, there exists a nonzero vector $(c,\lambda_{1},\lambda_{2},\dots,\lambda_{n})\in\mathbb{R}^{n+1}$ such that $\left(c\,\bm{1}+\sum_{j=1}^{n}\lambda_{j}\left\langle{\mathcal{N}_{j}}\right\rangle^{\rho}\right)\big{(}(t_{v})_{v\in V_{in}}\big{)}=0$ , for all $(t_{v})_{v\in V_{in}}\in\mathbb{R}^{V_{in}}$ . We then have

[TABLE]

for all $t\in\mathbb{R}$ . This establishes that $\{\left\langle{w_{1}}\right\rangle^{\rho,\,\mathcal{M}},\left\langle{w_{2}}\right\rangle^{\rho,\,\mathcal{M}},\dots,\left\langle{w_{n}}\right\rangle^{\rho,\,\mathcal{M}}\}\cup\{\bm{1}\}$ is a linearly dependent set, so $\mathcal{M}$ is the desired network. ∎

Before stating the next lemma, we describe the procedure of input anchoring, which is a method for selecting and modifying a subnetwork of a non-degenerate GFNN in a manner that preserves linear dependencies between the maps realized by the output nodes of the original network. Concretely, let $\mathcal{M}=(V^{\mathcal{M}},E^{\mathcal{M}},V_{in}^{\mathcal{M}},V_{out}^{\mathcal{M}},\allowbreak\Omega^{\mathcal{M}},\allowbreak\Theta^{\mathcal{M}})$ be a non-degenerate, clones-free GFNN with input nodes $V_{in}^{\mathcal{M}}=\{v_{1}^{0},\dots,v_{D_{0}}^{0}\}$ , $D_{0}\geq 2$ . For specificity, let w.l.o.g. $v_{D_{0}}^{0}$ be the input node to be anchored, and let $a\in\mathbb{R}$ be the value $v_{D_{0}}^{0}$ is anchored to. Furthermore, let $\rho$ be a nonlinearity. We seek to construct a network ${\mathcal{M}}_{a}=(V^{{\mathcal{M}_{a}}},E^{{\mathcal{M}_{a}}},V_{in}^{{\mathcal{M}_{a}}},V_{out}^{{\mathcal{M}_{a}}},\Omega^{{\mathcal{M}_{a}}},\Theta^{{\mathcal{M}_{a}}})$ with $V_{in}^{{\mathcal{M}_{a}}}=\{v_{1}^{0},\dots,v_{D_{0}-1}^{0}\}$ and $V_{out}^{{\mathcal{M}_{a}}}=V_{out}^{\mathcal{M}}\cap V^{{\mathcal{M}_{a}}}$ satisfying the following two properties:

(IA-1)

For all $w\in V_{out}^{{\mathcal{M}_{a}}}$ ,

[TABLE]

for all $(t_{1},t_{2},\dots,t_{D_{0}-1})\in\mathbb{R}^{D_{0}-1}$ (after identifying $\mathbb{R}^{V_{in}}$ with $\mathbb{R}^{D_{0}}$ ).

(IA-2)

For all $w\in V_{out}^{\mathcal{M}}\setminus V_{out}^{{\mathcal{M}_{a}}}$ , the function $\mathbb{R}^{D_{0}-1}\to\mathbb{R}$ given by

[TABLE]

is constant, and we denote its value by $\left\langle{w}\right\rangle^{\rho,\,\mathcal{M}}\!\left(a\right)$ .

As $V^{{\mathcal{M}_{a}}}\subset V^{{\mathcal{M}}}\setminus\{v_{D_{0}}^{0}\}$ , the network $\mathcal{M}_{a}$ will, indeed, have fewer nodes than $\mathcal{M}$ . Now suppose that $\mathcal{M}_{a}$ is such a network, and suppose that $\{w^{\,\rho,\,\mathcal{M}}\}_{w\in V_{out}^{\mathcal{M}}}$ is a linearly dependent set of functions $\mathbb{R}^{D_{0}}\to\mathbb{R}$ . In particular, let $(\lambda_{w})_{w\in V_{out}^{\mathcal{M}}}$ be a nonzero set of scalars such that

[TABLE]

We then have

[TABLE]

and thus $\{\left\langle{w}\right\rangle^{\rho,\,\mathcal{M}_{a}}\}_{w\in V_{out}^{\mathcal{M}_{a}}}\cup\{\bm{1}\}$ is a linearly dependent set of functions $\mathbb{R}^{D_{0}-1}\to\mathbb{R}$ . Apropos, this derivation illustrates why it is often convenient to include the constant function $\bm{1}$ when dealing with linear dependencies between the outputs of GFNNs. In the following definition we construct a network $\mathcal{M}_{a}$ with the desired properties, and in Figure 5 we provide an illustration of this construction.

Definition 17.

Let $\mathcal{M}=(V^{\mathcal{M}},E^{\mathcal{M}},V_{in}^{\mathcal{M}},V_{out}^{\mathcal{M}},\Omega^{\mathcal{M}},\allowbreak\Theta^{\mathcal{M}})$ be a non-degenerate, clones-free GFNN with input nodes $V_{in}^{\mathcal{M}}=\{v_{1}^{0},\dots,v_{D_{0}}^{0}\}$ , $D_{0}\geq 2$ . Let $a\in\mathbb{R}$ , and let $\rho$ be a nonlinearity. The network obtained from $\mathcal{M}$ by anchoring the input $v_{D_{0}}^{0}$ to $a$ is the GFNN ${\mathcal{M}}_{a}=(V^{{\mathcal{M}_{a}}},E^{{\mathcal{M}_{a}}},V_{in}^{{\mathcal{M}_{a}}},V_{out}^{{\mathcal{M}_{a}}},\Omega^{{\mathcal{M}_{a}}},\allowbreak\Theta^{{\mathcal{M}_{a}}})$ given by the following:

–

$V^{{\mathcal{M}_{a}}}=\{v\in V^{\mathcal{M}}:\{v_{1}^{0},\dots,v_{D_{0}-1}^{0}\}\cap V^{\mathcal{M}(v)}\neq\varnothing\}$ , where $\mathcal{M}(v)$ denotes the ancestor network of $v$ ,

–

$E^{{\mathcal{M}_{a}}}=\{(v,\widetilde{v}),v,\widetilde{v}\in V^{{\mathcal{M}_{a}}}\}$ ,

–

$V_{in}^{{\mathcal{M}_{a}}}=\{v_{1}^{0},\dots,v_{D_{0}-1}^{0}\}$ , $V_{out}^{{\mathcal{M}_{a}}}=V_{out}^{\mathcal{M}}\cap V^{{\mathcal{M}_{a}}}$ , and

–

$\Omega^{{\mathcal{M}_{a}}}=\{\omega_{\widetilde{v}v}:(v,\widetilde{v})\in E^{{\mathcal{M}_{a}}}\}$ .

–

For a node $v\in V^{\mathcal{M}}\setminus V^{{\mathcal{M}_{a}}}$ we define recursively

[TABLE]

(Note that all $a_{v}$ are well-defined, as $\mathrm{par}_{\mathcal{M}}(v)\subset V^{\mathcal{M}}\setminus V^{{\mathcal{M}_{a}}}$ whenever $v\in V^{\mathcal{M}}\setminus V^{{\mathcal{M}_{a}}}$ .) Now, for $v\in V^{{\mathcal{M}_{a}}}$ let

[TABLE]

and set $\Theta^{{\mathcal{M}_{a}}}=\{\widetilde{\theta}_{v}:v\in V^{{\mathcal{M}_{a}}}\}$ .

The network ${\mathcal{M}_{a}}$ satisfies (IA-1) and (IA-2) by construction, and if $\mathcal{M}$ is layered, then so is ${\mathcal{M}_{a}}$ . Moreover, ${\mathcal{M}_{a}}$ is non-degenerate. To see this, let $v\in V^{\mathcal{M}_{a}}$ be arbitrary. Then, by non-degeneracy of $\mathcal{M}$ , there exists a $w\in V^{\mathcal{M}}_{out}$ such that $v\in V^{\mathcal{M}(w)}$ . As $w$ is connected directly with a node in $V^{\mathcal{M}_{a}}$ , it follows that $w\in V^{{\mathcal{M}_{a}}}$ , and so $w\in V_{out}^{\mathcal{M}_{a}}$ .

Therefore $v\in V^{\mathcal{M}_{a}(w)}$ , and, as $v$ was arbitrary, we obtain $V^{\mathcal{M}_{a}}\subset\bigcup_{w\in V_{out}^{\mathcal{M}_{a}}}V^{\mathcal{M}_{a}(w)}$ , establishing by Definition 15 that ${\mathcal{M}_{a}}$ is non-degenerate. However, ${\mathcal{M}_{a}}$ will not, generally, be clones-free. This is unfortunate, as our program for proving Theorem 2 envisages maintaining the no-clones property when constructing networks with linearly dependent outputs. However, not all is lost, as the following lemma says that, for nonlinearities holomorphic on a neighborhood of $\mathbb{R}$ , either there exists some value of $a\in\mathbb{R}$ such that the network ${\mathcal{M}_{a}}$ is, indeed, clones-free, or it is possible to modify a subnetwork of $\mathcal{M}$ (different from the subnetwork giving rise to $\mathcal{M}_{a}$ ) to yield a clones-free subnetwork $\mathcal{N}$ of $\mathcal{M}$ with input $\{v_{D_{0}}^{0}\}$ and linear dependency among the maps realized by its output nodes. This will be sufficient for our purposes.

Lemma 2 (Input anchoring).

Let $\mathcal{M}=(V^{\mathcal{M}},E^{\mathcal{M}},V_{in}^{\mathcal{M}},V_{out}^{\mathcal{M}},\Omega^{\mathcal{M}},\Theta^{\mathcal{M}})$ , be a non-degenerate, clones-free GFNN with input nodes $V_{in}^{\mathcal{M}}=\{v_{1}^{0},\dots,v_{D_{0}}^{0}\}$ , $D_{0}\geq 2$ . Let $\rho:\mathcal{U}\to\mathbb{R}$ be holomorphic on an open domain $\mathcal{U}\subset\mathbb{C}$ containing $\mathbb{R}$ , such that $\rho(\mathbb{R})\subset\mathbb{R}$ . Let ${\mathcal{M}}_{a}$ denote the network obtained by anchoring the input $v_{D_{0}}^{0}$ to some $a\in\mathbb{R}$ , according to Definition 17. Then one of the following two statements must be true:

(i)

There exists an $a\in\mathbb{R}$ such that ${\mathcal{M}}_{a}$ is clones-free. 2. (ii)

There exist a non-degenerate clones-free GFNN $\mathcal{N}=(V^{\mathcal{N}},E^{\mathcal{N}},\{v_{D_{0}}^{0}\},V_{out}^{\mathcal{N}},\Omega^{\mathcal{N}},\Theta^{\mathcal{N}})$ (obtained by modifying a subnetwork of $\mathcal{M}$ ), a real number $\lambda_{0}$ , and nonzero real numbers $(\lambda_{w})_{w\in V_{out}^{\mathcal{N}}}$ , such that the function $h_{out}^{\mathcal{N}}:=\lambda_{0}\,\bm{1}+\sum_{w\in V_{out}^{\mathcal{N}}}\lambda_{w}w^{\rho,\,\mathcal{N}}$ is identically zero on $\mathbb{R}$ .

Proof.

For a pair of nodes $(c_{1},c_{2})\in V^{{\mathcal{M}}}\times V^{{\mathcal{M}}}$ define

[TABLE]

Suppose that (i) is false, so that, for every $a\in\mathbb{R}$ , we have $a\in E_{(c_{1},\,c_{2})}$ for some $(c_{1},c_{2})$ . Then we can write $\mathbb{R}$ as a finite union

[TABLE]

It follows that there exists a pair $(c_{1},c_{2})$ such that at least one of the sets $E_{(c_{1},c_{2})}$ is not discrete, i.e., it has a limit point. Fix such a pair $(c_{1},c_{2})$ . Note that we have $v_{D_{0}}^{0}\in V^{\mathcal{M}(c_{j})}$ , for at least one of $j=1$ or $j=2$ , as otherwise we would have $\mathrm{par}_{\mathcal{M}_{a}}(c_{j})=\mathrm{par}_{\mathcal{M}}(c_{j})$ , for $j\in\{1,2\}$ and all $a\in E_{(c_{1},\,c_{2})}$ , and thus $c_{1}$ , $c_{2}$ would be clones in $\mathcal{M}_{a}$ if and only if they are clones in $\mathcal{M}$ . But, by the no-clones property of $\mathcal{M}$ , this would imply $E_{(c_{1},\,c_{2})}=\varnothing$ , contradicting the fact that $E_{(c_{1},c_{2})}$ is not discrete. Thus, we may w.l.o.g. assume that $v_{D_{0}}^{0}\in V^{\mathcal{M}(c_{1})}$ , which leaves us with the cases $v_{D_{0}}^{0}\in V^{\mathcal{M}(c_{2})}$ and $v_{D_{0}}^{0}\notin V^{\mathcal{M}(c_{2})}$ that will be treated separately when needed. Define the GFNN $\mathcal{N}=(V^{\mathcal{N}},E^{\mathcal{N}},\{v_{D_{0}}^{0}\},V_{out}^{\mathcal{N}},\Omega^{\mathcal{N}},\Theta^{\mathcal{N}})$ according to the following:

–

Let $S=\{v\in V^{\mathcal{M}(\{c_{1},c_{2}\})}:V_{in}^{\mathcal{M}}\cap V^{\mathcal{M}(v)}=\{v_{D_{0}}^{0}\}\}$ , and set

[TABLE]

–

$E^{{\mathcal{N}}}=\{(v,\widetilde{v}),\;v,\widetilde{v}\in V^{{\mathcal{N}}}\}$ ,

–

$V_{out}^{\mathcal{N}}=\{c_{1},c_{2}\}\cap V^{\mathcal{N}}$ ,

–

$\Omega^{\mathcal{N}}=\{\omega_{\widetilde{v}v}:(v,\widetilde{v})\in E^{\mathcal{N}}\}$ ,

–

choose a number $r\in\mathbb{R}\setminus\big{(}\{\theta_{v}-\theta_{c_{1}}:v\in S\}\cup\{\theta_{v}-\theta_{c_{2}}:v\in S\}\big{)}$ , and set $\overline{\theta}_{c_{1}}={\theta}_{c_{1}}+r$ , $\overline{\theta}_{c_{2}}={\theta}_{c_{2}}+r$ , and $\overline{\theta}_{v}=\theta_{v}$ , for $v\in S$ . Define $\Theta^{\mathcal{N}}=\{\overline{\theta}_{v}:v\in V^{\mathcal{N}}\}$ .

Informally, the so-constructed network $\mathcal{N}$ consists of the parts of $\mathcal{M}$ propagating the input at $v_{D_{0}}^{0}$ to $c_{1}$ and $c_{2}$ (and it might happen that this input does not reach $c_{2}$ , in which case this node is not included in $V^{\mathcal{N}}$ ), and the biases $\overline{\theta}_{c_{1}}$ and $\overline{\theta}_{c_{2}}$ are chosen so as to ensure that $\mathcal{N}$ has no clone pair $(v,\tilde{v})$ with $v\in\{c_{1},c_{2}\}$ and $\tilde{v}\in S$ . Thus, in order to show that $\mathcal{N}$ is clones-free, it suffices to establish that $c_{1}$ and $c_{2}$ are not clones in $\mathcal{N}$ (note that $c_{1}$ and $c_{2}$ can be clones in $\mathcal{N}$ only in the case $v_{D_{0}}^{0}\in V^{\mathcal{M}(c_{2})}$ ), as any clone pair $(v,\tilde{v})$ with $v,\tilde{v}\in S$ would also be a clone pair in $\mathcal{M}$ . By way of contradiction, assume that $c_{1}$ and $c_{2}$ are clones in $\mathcal{N}$ , i.e.,

[TABLE]

As the construction of $\mathcal{N}$ does not depend on $a$ , we can fix an arbitrary $a\in E_{(c_{1},\,c_{2})}$ , and the condition that $c_{1}$ and $c_{2}$ are clones in $\mathcal{M}_{a}$ then implies

[TABLE]

where the real numbers $a_{u}$ are defined according to (2). This, together with (4), yields

[TABLE]

which would say that $c_{1}$ and $c_{2}$ are clones in $\mathcal{M}$ and hence stands in contradiction to the no-clones property of $\mathcal{M}$ . This establishes the no-clones property of $\mathcal{N}$ . The non-degeneracy of $\mathcal{N}$ follows by its construction. Now, by adding $r$ to both sides of (5) and applying $\rho$ , we find

[TABLE]

for all $a\in E_{(c_{1},\,c_{2})}$ (note that $\mathrm{par}_{\mathcal{M}}(c_{2})\cap V^{\mathcal{N}}=\varnothing$ in the case $v_{D_{0}}^{0}\notin V^{\mathcal{M}(c_{2})}$ , and so the sum on the right-hand side of (5) evaluates to [math] in this case). As $\rho$ is holomorphic on an open neighborhood of $\mathbb{R}$ and $\rho(\mathbb{R})\subset\mathbb{R}$ , we also have that $\left\langle{c_{1}}\right\rangle^{\rho,\,\mathcal{N}}$ , $\left\langle{c_{2}}\right\rangle^{\rho,\,\mathcal{N}}$ are holomorphic on a neighborhood of $\mathbb{R}$ . Further, since $E_{(c_{1},c_{2})}$ has a limit point, it follows by the identity theorem [18, Thm. 10.18] that (7) holds for all $a\in\mathbb{R}$ . We have hence shown that Statement (ii) is valid with this $\mathcal{N}$ , and

[TABLE]

∎

IV Auxiliary results from complex analysis and Kronecker’s theorem

We state the remaining auxiliary results needed in the proof of our main statements. Since these results are relatively simple consequences of standard results in complex analysis and of Kronecker’s theorem, their proofs are relegated to the appendix.

Recall the definition of the natural domain $\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}$ of the map realized by a GFNN node $u$ with respect to a holomorphic nonlinearity as given in Definition 12.

In the proof of Theorem 2 it will be crucial that $\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}$ be connected for all nodes $u$ of a certain GFNN with a single input. The following lemma establishes this fact.

Lemma 3.

Let $\mathcal{N}=(V,E,\{v_{in}\},V_{out},\Omega,\Theta)$ be a GFNN, and let $\sigma:\mathcal{D}_{\sigma}\to\mathbb{C}$ be a meromorphic function on $\mathbb{C}$ with its set of poles given by $P\subset\mathbb{C}\setminus\mathbb{R}$ . Furthermore, suppose that $\sigma(\mathbb{R})\subset\mathbb{R}$ . Then, for every $u\in V$ , we have $\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}=\mathbb{C}\setminus E_{u}$ , where $E_{u}\subset\mathbb{C}$ is a closed countable subset of $\mathbb{C}\setminus\mathbb{R}$ . In particular, we have that $\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}$ is an open connected set with $\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}\supset\mathbb{R}$ .

In the following we write $D^{\circ}_{k}(\bm{a},\delta):=\{(z_{1},\dots,z_{k})\in\mathbb{C}^{k}:|z_{j}-a_{j}|<\delta,\forall j\}$ for the open polydisc of radius $\delta>0$ , centered at $\bm{a}=(a_{1},\dots,a_{k})\in\mathbb{C}^{k}$ . Further, for a set $S\subset\mathbb{C}^{k}$ , we write $\mathrm{cl}(S)$ for the closure of $S$ in $\mathbb{C}^{k}$ .

Lemma 4.

Let $F:\mathcal{U}\to\mathbb{C}$ be holomorphic on a connected open domain $\mathcal{U}\subset\mathbb{C}^{k}$ containing $\mathbb{R}^{k}$ . Let $\bm{a}=(a_{1},\dots,a_{k})\in\mathbb{R}^{k}$ and $\delta>0$ be given, and let

[TABLE]

Suppose that $D^{\circ}_{k}(\bm{a},\delta)\subset\mathcal{U}$ , and $F(z)=0$ , for all $z\in T$ . Then $F=0$ identically on $\mathcal{U}$ .

Lemma 5.

Let $t^{*}\in\mathbb{C}$ , $\bm{a}=(a_{1},\dots,a_{k})\in\mathbb{R}^{k}$ , and $\delta>0$ , and let $F:\mathcal{U}\to\mathbb{C}$ be holomorphic on a connected open domain $\mathcal{U}\subset\mathbb{C}^{1+k}$ containing $\{t^{*}\}\times\mathbb{R}^{k}$ . Define the set

[TABLE]

and suppose that $D^{\circ}_{1+k}(\bm{a},\delta)\subset\mathcal{U}$ . If there exists a set $\widetilde{T}\subset\mathbb{C}^{1+k}$ such that $\widetilde{T}\subset(\mathbb{C}\setminus\{t^{*}\})\times\mathbb{C}^{k}$ , $\mathrm{cl}(\widetilde{T})\supset T$ , and $F|_{\widetilde{T}}\equiv 0$ , then $F|_{\mathcal{U}}\equiv 0$ .

We will now elaborate on the tools needed in the proof of Theorem 2. The material touches upon the theory of Lie groups and representation theory, and will be presented in a self-contained fashion, only assuming familiarity with finitely-generated abelian groups and basic point-set topology. We write $T^{d}=\mathbb{R}^{d}/\mathbb{Z}^{d}$ for the $d$ -dimensional torus considered as a compact abelian topological group. For a finite set of real numbers $\{\alpha_{j}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{d}$ we let $\langle\alpha_{1},\dots,\alpha_{d}\rangle_{\mathbb{Q}}$ denote the span of $\{\alpha_{j}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{d}$ in the vector space $\mathbb{R}$ over the scalar field $\mathbb{Q}$ , and we write $\dim\langle\alpha_{1},\dots,\alpha_{d}\rangle_{\mathbb{Q}}$ for its dimension. We will need the following lemma, which is an easy consequence of Kronecker’s theorem [16]. For the sake of completeness, we provide an elementary proof from first principles.

Lemma 6 ([16] Kronecker).

Let $d\in\mathbb{N}$ and let $\{\alpha_{j}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{d}$ be an arbitrary set of nonzero real numbers with $k=\dim\langle\alpha_{1},\dots,\alpha_{d}\rangle_{\mathbb{Q}}$ . Define the following subset of $T^{d}$ :

[TABLE]

where $\mathrm{cl}$ denotes the closure in $T^{d}$ . Then $M$ is isomorphic to a $k$ -dimensional torus as a Lie group, i.e., there exists a $\Psi:M\to\mathbb{R}^{k}/\mathbb{Z}^{k}$ that is both a homeomorphism (between $M$ and $\mathbb{R}^{k}/\mathbb{Z}^{k}$ as topological spaces) and a homomorphism (between $M$ and $\mathbb{R}^{k}/\mathbb{Z}^{k}$ as abelian groups).

When $d=2$ , Lemma 6 simply says that the line $\ell:t\mapsto(\alpha_{1}t,\alpha_{2}t)+\mathbb{Z}^{2}$ , $t\in\mathbb{R}$ , either exhibits discrete periodic behavior and is thus homeomorphic to a 1-dimensional torus, which is the case if $k=1$ , i.e., $\alpha_{1}/\alpha_{2}$ is rational, or otherwise, if $k=2$ , i.e., when $\alpha_{1}/\alpha_{2}$ is irrational, $\ell$ is dense in the whole square, and so its closure is a $2$ -dimensional torus, namely $\mathbb{R}^{2}/\mathbb{Z}^{2}$ itself. This is illustrated in Figure 6. When $d\geq 3$ , the situation can be more complicated, as illustrated in Figure 7. Specifically, the torus $M$ obtained as the closure of the line $\ell:t\mapsto(\alpha_{1}t,\dots,\alpha_{d}t)+\mathbb{Z}^{d}$ , $t\in\mathbb{R}$ , may not occupy the entirety of $\mathbb{R}^{d}/\mathbb{Z}^{d}$ . In this case, Lemma 6 provides the precise dimension of $M$ , namely $k=\dim\langle\alpha_{1},\dots,\alpha_{d}\rangle_{\mathbb{Q}}$ . For the purpose of proving Theorem 2, it will suffice to consider the behavior of $\ell$ in a neighborhood of the point $\bm{0}+\mathbb{Z}^{d}\in T^{d}$ . Concretely, if $Q\in\mathbb{Q}^{d\times k}$ is the matrix representing $\alpha_{1},\dots,\alpha_{d}$ in the basis $\{\alpha_{1},\dots,\alpha_{k}\}$ , the following lemma states that, in a neighborhood of $\bm{0}$ , $\ell$ visits points arbitrarily close to the $k$ -dimensional subspace of $\mathbb{R}^{d}$ spanned by the columns of $Q$ .

Lemma 7.

Suppose that $\{\alpha_{j}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{d}$ are nonzero real numbers, and let $k=\dim\langle\alpha_{1},\dots,\alpha_{d}\rangle_{\mathbb{Q}}$ . Furthermore, assume that $\{\alpha_{j}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{k}$ is a basis for $\langle\alpha_{1},\dots,\alpha_{d}\rangle_{\mathbb{Q}}$ over $\mathbb{Q}$ , and let $Q=(Q_{pj})\in\mathbb{Q}^{d\times k}$ be the matrix such that $(\alpha_{1},\dots,\alpha_{d})=Q\cdot(\alpha_{1},\dots,\alpha_{k})$ . Then there exists an open set $C\subset\mathbb{R}^{k}$ with $\bm{0}\in C$ , such that, for every $\bm{s}=(s_{1},\dots,s_{k})\in C$ , there are sequences $(t^{n,\bm{s}})_{n\in\mathbb{N}}\subset\mathbb{R}$ and $(\bm{r}^{n,\bm{s}})_{n\in\mathbb{N}}=(r_{1}^{n,\bm{s}},\dots,r_{k}^{n,\bm{s}})_{n\in\mathbb{N}}\subset C$ with the following properties:

(i)

$(\alpha_{1}t^{n,\bm{s}},\alpha_{2}t^{n,\bm{s}},\dots,\alpha_{d}t^{n,\bm{s}})+\mathbb{Z}^{d}=Q\cdot(\alpha_{1}r_{1}^{n,\bm{s}},\dots,\alpha_{k}r_{k}^{n,\bm{s}})+\mathbb{Z}^{d}$ , for all $n\in\mathbb{N}$ , 2. (ii)

$|t^{n,\bm{s}}|\to\infty$ * as $n\to\infty$ ,* 3. (iii)

$\bm{r}^{n,\bm{s}}\to\bm{s}$ * in $\mathbb{R}^{k}$ , as $n\to\infty$ .*

V Imaginary period and the self-avoiding property

We say that a holomorphic function $f:\mathcal{D}\to\mathbb{C}$ is $i$ -periodic if $f(z+i)=f(z)$ , for all $z\in\mathcal{D}$ . An example of such a function is the scaled hyperbolic tangent function $\tanh(\pi\,\cdot)$ . More generally, for an arbitrary discrete set $S\subset\mathbb{R}$ , and arbitrary $C\in\mathbb{R}$ and real sequence $\{c_{s}\}_{s\in S}\in\ell^{1}(S)$ , the function $\sigma=C+\sum_{s\in S}c_{s}\tanh(\pi(\,\cdot-s))$ is also $i$ -periodic, and in particular, the set of its poles $P$ has the structure $P=\bigcup_{n\in\mathbb{Z}}\left(S+\left(n+\frac{1}{2}\right)i\right)$ . We now introduce a property defined for discrete subsets of $\mathbb{R}$ , which will, when applied to the set $S$ , be the final technical ingredient in the proof of our main results.

Definition 18 (Self-avoiding set).

Let $S\subset\mathbb{R}$ be a discrete set. We say that $S$ is self-avoiding if, for every finite collection of distinct pairs $\{(\omega_{j},\theta_{j})\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{m}\subset(2\mathbb{Z}+1)\times\mathbb{R}$ , there exist a $j^{*}\in\{1,\dots,m\}$ and a $t^{*}$ such that

[TABLE]

*Remark**.*

In other words, a set $S$ is self-avoiding if the union of a finite number of distinct copies of $S$ obtained by translating and scaling by an odd integer contains a real number which is an element of exactly one of the copies.

Proposition 2.

Let $S=\{s_{k}:k\in\mathbb{Z}\}$ , $s_{k}-s_{k-1}>0$ , $\forall k\in\mathbb{Z}\,$ , be an infinite discrete set such that $\{s_{k}-s_{k-1}:k\in\mathbb{Z}\}$ is rationally independent. Then $S$ is self-avoiding.

Proof.

We use the shorthand notation $S_{\omega,\theta}=\frac{S-\theta}{\omega}$ . Suppose by way of contradiction that $A\subset(2\mathbb{Z}+1)\times\mathbb{R}$ , $\#A\geq 2$ , is a set of pairs such that, for every $(\omega,\theta)\in A$ and every $t\in S_{\omega,\theta}$ , there exists a pair $(\omega^{\prime},\theta^{\prime})\in A\setminus\{(\omega,\theta)\}$ such that $t\in S_{\omega^{\prime},\theta^{\prime}}$ . Fix a pair $(\omega_{1},\theta_{1})\in A$ . We then have, by assumption,

[TABLE]

Since $S$ is infinite, there exists a $(\omega_{2},\theta_{2})\in A\setminus\{(\omega_{1},\theta_{1})\}$ such that $\#(S_{\omega_{1},\theta_{1}}\cap S_{\omega_{2},\theta_{2}})\geq 3$ . Pick an arbitrary subset $\{t_{1}<t_{2}<t_{3}\}\subset S_{\omega_{1},\theta_{1}}\cap S_{\omega_{2},\theta_{2}}$ and note that there exist $k_{1}^{1},k_{2}^{1},k_{3}^{1}\in\mathbb{Z}$ and $k_{1}^{2},k_{2}^{2},k_{3}^{2}\in\mathbb{Z}$ such that

[TABLE]

Moreover, for $r=1,2$ , we have $k_{1}^{r}<k_{2}^{r}<k_{3}^{r}$ if $\omega_{r}>0$ and $k_{1}^{r}>k_{2}^{r}>k_{3}^{r}$ if $\omega_{r}<0$ . Define the index sets

[TABLE]

For brevity write $a_{k}=s_{k}-s_{k-1}$ , $\forall k\in\mathbb{Z}$ . We then have

[TABLE]

Now, since $\{a_{k}:k\in\mathbb{Z}\}$ is rationally independent and $|\omega_{1}|,|\omega_{2}|\in\mathbb{Z}$ , (9) implies $|\omega_{1}|=|\omega_{2}|$ and $K_{j}^{1}=K_{j}^{2}$ , for $j=1,2$ . In particular, $K_{j}^{1}=K_{j}^{2}$ , for $j=1,2$ , implies $\mathrm{sgn}(\omega_{1})=\mathrm{sgn}(\omega_{2})$ , so we have $\omega_{1}=\omega_{2}$ . Then, from the definition of $K_{j}^{r}$ , it follows that $k_{j}^{1}=k_{j}^{2}$ , for $j=1,2,3$ . We thus obtain from (8) that $\theta_{1}=\theta_{2}$ , contradicting $(\omega_{1},\theta_{1})\neq(\omega_{2},\theta_{2})$ . Therefore, our initial assumption was false, so we deduce that $S$ must be self-avoiding. ∎

The following proposition formalizes the notion that nonlinearities $\sigma$ of the form considered at the beginning of the chapter are dense in the set of sigmoidal nonlinearities, even after imposing the additional constraint that $S$ be self-avoiding.

Proposition 3.

Let $\rho$ be a piecewise $C^{1}$ nonlinearity with $\rho^{\prime}\in BV(\mathbb{R})\cap L^{1}(\mathbb{R})$ . Then, for every $\epsilon>0$ , there exist a discrete self-avoiding set $S\subset\mathbb{R}$ , a sequence $\{c_{s}\}_{s\in S}\in\ell^{1}(S)$ with $c_{s}\neq 0$ , for all $s\in S$ , and real numbers $\alpha>0$ and $C$ , such that the function $\sigma$ given by

[TABLE]

satisfies $\|\sigma-\rho\|_{L^{\infty}(\mathbb{R})}<\epsilon$ .

Proof.

First note that

[TABLE]

is a well-defined real number, as $\rho^{\prime}\in L^{1}(\mathbb{R})$ . Let $H$ denote the Heaviside step function. We now have, for all $x\in\mathbb{R}$ ,

[TABLE]

Denote $h_{\alpha}=\frac{1}{2}\left(1+\tanh(\alpha\,\cdot\,)\right)$ and consider the function $\rho_{\alpha}$ defined by

[TABLE]

We then have

[TABLE]

Now note that $\|\rho^{\prime}\|_{L^{\infty}(\mathbb{R})}<\infty$ as $\rho^{\prime}\in BV(\mathbb{R})$ , and $\|H-h_{\alpha}\|_{L^{1}(\mathbb{R})}\to 0$ as $\alpha\to\infty$ by dominated convergence, so there exists $\alpha>0$ such that $\|\rho-\rho_{\alpha}\|_{L^{\infty}(\mathbb{R})}<\frac{\epsilon}{3}$ . Let $b:\mathbb{Z}\to\mathbb{N}$ be a bijection, and $\beta\in(0,1)$ a parameter to be specified. Define the infinite discrete set $S_{\beta}=\{s_{k}^{\beta}:=\beta(k+\pi^{-b(k)}):k\in\mathbb{Z}\}\subset\mathbb{R}$ . Then, since $\pi$ is transcendental, Proposition 2 implies that $S_{\beta}$ is self-avoiding. Now, since $\rho^{\prime}$ is integrable on $\mathbb{R}$ and piecewise continuous, and $h_{\alpha}$ is bounded and continuous, we have that $\rho^{\prime}\cdot h_{\alpha}(x-\cdot)$ is integrable on $\mathbb{R}$ and piecewise continuous. Hence, as $\mathrm{mesh}(S_{\beta}):=\sup_{k\in\mathbb{Z}}|s_{k}^{\beta}-s_{k-1}^{\beta}|\to 0$ for $\beta\to 0$ , we have the following convergence of Riemann sums

[TABLE]

Therefore $\rho(-\infty)+\sum_{k\in\mathbb{Z}}(s_{k}^{\beta}-s_{k-1}^{\beta})\rho^{\prime}(s_{k}^{\beta})h_{\alpha}(\cdot-s_{k}^{\beta})\to\rho_{\alpha}$ pointwise. To upgrade this to convergence in $\|\cdot\|_{L^{\infty}(\mathbb{R})}$ , we proceed as follows. By the mean value theorem, for any $x\in\mathbb{R}$ and $\beta>0$ , there exist $y_{k}^{\beta,x}\in[s_{k-1}^{\beta},s_{k}^{\beta}]$ such that

[TABLE]

We can therefore write

[TABLE]

Since $\rho^{\prime}\in BV(\mathbb{R})$ by assumption, and $h_{\alpha}\in BV(\mathbb{R})$ by definition, the quantities in the parentheses are all finite. As they are moreover independent of $\beta$ , and $\mathrm{mesh}(S_{\beta})\to 0$ for $\beta\to 0$ , we can pick a $\beta>0$ such that

[TABLE]

where we used (10) to replace $\int_{\mathbb{R}}\rho^{\prime}(y)h_{\alpha}(x-y)\mathrm{d}y$ in (11) with $\rho_{\alpha}-\rho(-\infty)$ . Finally, let $\{d_{s}\}_{s\in S_{\beta}}$ be an arbitrary sequence of real numbers such that $\mathrm{mesh}(S_{\beta})\sum_{k\in\mathbb{Z}}|d_{s_{k}^{\beta}}|<\frac{\epsilon}{3}$ and, for each $s\in S_{\beta}$ , $d_{s}=0$ if and only if $\rho^{\prime}(s)\neq 0$ . We then have

[TABLE]

Now, combining the estimates (12), (13), and $\|\rho-\rho_{\alpha}\|_{L^{\infty}(\mathbb{R})}<\frac{\epsilon}{3}$ yields

[TABLE]

so the claim of the proposition holds with $S=S_{\beta}$ , $c_{s_{k}^{\beta}}=\frac{1}{2}(s_{k}^{\beta}-s_{k-1}^{\beta})(\rho^{\prime}(s_{k}^{\beta})+d_{s_{k}^{\beta}})$ , and $C=\rho(-\infty)+\sum_{k\in\mathbb{Z}}c_{s_{k}^{\beta}}$ . ∎

VI The main theorems

Theorem 3.

Let $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ be non-degenerate clones-free LFNNs with the same input and ouput sets $V_{in}$ and $V_{out}$ . Let

[TABLE]

where $C\in\mathbb{R}$ , $S$ is a discrete self-avoiding set, and $\{c_{s}\}_{s\in S}\in\ell^{1}(S)$ are all nonzero and real. Suppose that $\left\langle{\mathcal{N}_{1}}\right\rangle^{\sigma}\!(\bm{t})=\left\langle{\mathcal{N}_{2}}\right\rangle^{\sigma}\!(\bm{t})$ , for all $\bm{t}\in\mathbb{R}^{V_{in}}$ . Then $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ are faithfully isomorphic.

Theorem 4.

Let $\mathcal{N}_{j}$ , $j\in\{1,2,\dots,n\}$ , be non-degenerate clones-free LFNNs with the same input set $V_{in}$ and the same single output node $\{v_{out}\}$ . Furthermore, suppose that no two networks $\mathcal{N}_{j_{1}}$ , $\mathcal{N}_{j_{2}}$ , $j_{1}\neq j_{2}$ , are extensionally isomorphic. Consider the nonlinearity

[TABLE]

with $C\in\mathbb{R}$ , $S$ a discrete self-avoiding set, and $\{c_{s}\}_{s\in S}\in\ell^{1}(S)$ , where each $c_{s}$ is nonzero and real. Then $\{\left\langle{\mathcal{N}_{j}}\right\rangle^{\sigma}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{n}\cup\{\bm{1}\}$ is a linearly independent set of functions from $\mathbb{R}^{V_{in}}$ to $\mathbb{R}$ .

Before embarking on the proofs of Theorems 3 and 4, we show how Theorems 1 and 2 follow from these two results together with Proposition 3.

Proof of Theorem 1.

Let $\rho$ be as in the statement of Theorem 1, and let $\epsilon>0$ be arbitrary. Proposition 3 guarantees the existence of a discrete self-avoiding set $S\subset\mathbb{R}$ , a sequence $\{c_{s}\}_{s\in S}\in\ell^{1}(S)$ with $c_{s}\neq 0$ , for all $s\in S$ , and real numbers $\alpha>0$ and $C$ , such that the function $\sigma$ defined by

[TABLE]

satisfies $\|\sigma-\rho\|_{L^{\infty}(\mathbb{R})}<\epsilon$ . Now suppose that $\mathcal{N}=(V,E,V_{in},\allowbreak V_{out},\Omega,\Theta)$ and $\widetilde{\mathcal{N}}=(\widetilde{V},\widetilde{E},{V}_{in},\allowbreak{V}_{out},\widetilde{\Omega},\widetilde{\Theta})$ are clones-free non-degenerate LFNNs with the same input set $V_{in}$ and such that $\langle{\mathcal{N}}\rangle^{\sigma}(x)=\langle{\widetilde{\mathcal{N}}}\rangle^{\sigma}(x)$ , for all $x\in\mathbb{R}^{V_{in}}$ . Consider the scaled objects $\sigma_{\alpha}:=\sigma\left(\frac{\pi}{\alpha}\,\cdot\right)$ , $S_{\alpha}=\frac{\alpha}{\pi}S$ , $\mathcal{N^{\alpha}}=\big{(}V,E,V_{in},V_{out},\allowbreak\frac{\alpha}{\pi}\Omega,\frac{\alpha}{\pi}\Theta\big{)}$ , and $\widetilde{\mathcal{N}}^{\alpha}=\big{(}\widetilde{V},\widetilde{E},{V}_{in},{V}_{out},\frac{\alpha}{\pi}\widetilde{\Omega},\frac{\alpha}{\pi}\widetilde{\Theta}\big{)}$ , where $\frac{\alpha}{\pi}\Omega=\left\{\frac{\alpha}{\pi}\omega:\omega\in\Omega\right\}$ , and $\frac{\alpha}{\pi}\Theta,\frac{\alpha}{\pi}\widetilde{\Omega},\frac{\alpha}{\pi}\widetilde{\Theta}$ are defined analogously. Then $\langle{\mathcal{N}^{\alpha}}\rangle^{\sigma_{\alpha}}(x)=\langle{\mathcal{N}}\rangle^{\sigma}(x)=\langle{\widetilde{\mathcal{N}}}\rangle^{\sigma}(x)=\langle{\widetilde{\mathcal{N}}^{\alpha}}\rangle^{\sigma_{\alpha}}(x)$ , for all $x\in\mathbb{R}^{V_{in}}$ . Moreover,

[TABLE]

and $S_{\alpha}$ is a discrete self-avoiding set (as the self-avoiding property is preserved under scaling by a nonzero real number), so by Theorem 3 we obtain $\mathcal{N}^{\alpha}\stackrel{{\scriptstyle f}}{{\sim}}\widetilde{\mathcal{N}}^{\alpha}$ , which implies $\mathcal{N}\simeq\widetilde{\mathcal{N}}$ . ∎

Proof of Theorem 2.

Let $\rho$ be as in the statement of Theorem 2, and let $\epsilon>0$ be arbitrary. Proposition 3 guarantees the existence of a discrete self-avoiding set $S\subset\mathbb{R}$ , a sequence $\{c_{s}\}_{s\in S}\in\ell^{1}(S)$ with $c_{s}\neq 0$ , for all $s\in S$ , and real numbers $\alpha>0$ and $C$ , such that the function $\sigma$ defined by

[TABLE]

satisfies $\|\sigma-\rho\|_{L^{\infty}(\mathbb{R})}<\epsilon$ . Now suppose that $\mathcal{N}_{j}=(V^{j},E^{j},V_{in},\allowbreak\{v_{out}\},\Omega^{j},\Theta^{j})$ , $j\in\{1,\dots,n\}$ , are non-degenerate clones-free LFNNs such that no two $\mathcal{N}_{j_{1}}$ , $\mathcal{N}_{j_{2}}$ , $j_{1}\neq j_{2}$ , are faithfully isomorphic. As $\{v_{out}\}$ is a singleton, it follows that no two $\mathcal{N}_{j_{1}}$ , $\mathcal{N}_{j_{2}}$ , $j_{1}\neq j_{2}$ , are extensionally isomorphic either. Now, define the scaled objects $\sigma_{\alpha}:=\sigma\left(\frac{\pi}{\alpha}\,\cdot\right)$ , $S_{\alpha}=\frac{\alpha}{\pi}S$ , and $\mathcal{N}^{\alpha}_{j}=\left(V^{j},E^{j},V_{in},\{v_{out}\},\frac{\alpha}{\pi}\Omega^{j},\frac{\alpha}{\pi}\Theta^{j}\right)$ , for $j\in\{1,\dots,n\}$ , where $\frac{\alpha}{\pi}\Omega^{j}=\left\{\frac{\alpha}{\pi}\omega:\omega\in\Omega_{j}\right\}$ and $\frac{\alpha}{\pi}\Theta^{j}=\left\{\frac{\alpha}{\pi}\theta:\theta\in\Theta^{j}\right\}$ . Then the $\mathcal{N}_{j}^{\alpha}$ are non-degenerate and clones-free, and no two $\mathcal{N}_{j_{1}}^{\alpha}$ , $\mathcal{N}_{j_{2}}^{\alpha}$ , $j_{1}\neq j_{2}$ , are extensionally isomorphic. Moreover,

[TABLE]

and $S_{\alpha}$ is a discrete self-avoiding set, so by Theorem 4 we obtain that $\{\langle{\mathcal{N}_{j}^{\alpha}}\rangle^{\sigma_{\alpha}}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{n}\cup\{\bm{1}\}$ is linearly independent. Now, suppose by way of contradiction that there is linear dependency $\lambda_{0}+\sum_{j=1}^{n}\lambda_{j}\,\langle{\mathcal{N}_{j}}\rangle^{\sigma}=0$ among $\{\left\langle{\mathcal{N}_{j}}\right\rangle^{\sigma}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{n}\cup\{\bm{1}\}$ . But then

[TABLE]

which contradicts the linear independence of $\{\langle{\mathcal{N}_{j}^{\alpha}}\rangle^{\sigma_{\alpha}}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{n}\cup\{\bm{1}\}$ . We deduce that $\{\left\langle{\mathcal{N}_{j}}\right\rangle^{\sigma}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{n}\cup\{\bm{1}\}$ must be linearly independent, as desired. ∎

Proof of Theorem 4.

We argue by contradiction, so suppose that the statement is false. Specifically, let $\mathcal{N}_{j}$ , $j\in\{1,2,\dots,n\}$ , be LFNNs and $\sigma$ a nonlinearity as in the statement of the theorem, and suppose that $\{\left\langle{\mathcal{N}_{j}}\right\rangle^{\sigma}\}_{j\hskip 1.42262pt=\hskip 1.42262pt1}^{n}\cup\{\bm{1}\}$ is linearly dependent. Then, by Lemma 1, there exists a non-degenerate clones-free LFNN $\mathcal{M}=(V^{\mathcal{M}},E^{\mathcal{M}},V_{in}^{\mathcal{M}},V_{out}^{\mathcal{M}},\Omega^{\mathcal{M}},\Theta^{\mathcal{M}})$ with a single input node $V_{in}^{\mathcal{M}}=\{v_{in}\}$ , such that $\{\left\langle{w}\right\rangle^{\sigma}:w\in V_{out}^{\mathcal{M}}\}\cup\{\bm{1}\}$ is a linearly dependent set of functions from $\mathbb{R}$ to $\mathbb{R}$ . Let $\mathscr{M}$ denote the set of all non-degenerate clones-free LFNNs $\widetilde{\mathcal{M}}=(V^{\widetilde{\mathcal{M}}},E^{\widetilde{\mathcal{M}}},\{v_{in}\},V_{out}^{\widetilde{\mathcal{M}}},\allowbreak\Omega^{\widetilde{\mathcal{M}}},\Theta^{\widetilde{\mathcal{M}}})$ such that $\{\left\langle{w}\right\rangle^{\sigma}:w\in V_{out}^{\widetilde{\mathcal{M}}}\}\allowbreak\cup\{\bm{1}\}$ is linearly dependent. We then have $\mathscr{M}\neq\varnothing$ , simply as $\mathcal{M}\in\mathscr{M}$ . Denote by $\mathscr{M}_{min}$ the set of all networks in $\mathscr{M}$ of minimum depth, and fix a network $\mathcal{M}^{\prime}\in\mathscr{M}_{min}$ with the minimal number of nodes among all the networks in $\mathscr{M}_{min}$ . The proof proceeds by constructing a network $\mathcal{N}\in\mathscr{M}_{min}$ with a strictly smaller number of nodes than $\mathcal{M}^{\prime}$ , thereby deriving a contradiction and concluding the proof. First note that linear dependence of $\{\left\langle{w}\right\rangle^{\sigma}:w\in V_{out}^{\mathcal{M}^{\prime}}\}\cup\{\bm{1}\}$ is equivalent to the existence of a nonzero set of real numbers $\{\lambda_{w}\}_{w\in V_{out}^{\mathcal{M}^{\prime}}}$ and a real number $c\in\mathbb{R}$ such that $h_{out}:\mathbb{R}\to\mathbb{R}$ , given by

[TABLE]

is constant-valued, i.e., $h_{out}(t)=c$ , for all $t\in\mathbb{R}$ . Note that $\lambda_{w}\neq 0$ , for all $w\in V_{out}^{\mathcal{M}^{\prime}}$ , for otherwise the ancestor subnetwork $\mathcal{M^{\prime}}\left(\{w\in V_{out}^{\mathcal{M}^{\prime}},\,\lambda_{w}\neq 0\}\right)$ would be an element of $\mathscr{M}_{min}$ with strictly fewer nodes than $\mathcal{M}^{\prime}$ , contradicting the minimality of $\mathcal{M}^{\prime}$ .

Next, note that $\sigma$ is a real meromorphic function whose set of poles is

[TABLE]

and in particular, $\mathcal{M}^{\prime}$ and $\sigma$ satisfy the assumptions of Lemma 3, and so the sets $\mathbb{C}\setminus\mathcal{D}_{\left\langle{w}\right\rangle^{\sigma}}$ are closed and countable, where $\mathcal{D}_{\left\langle{w}\right\rangle^{\sigma}}$ denotes the natural domain of $\left\langle{w}\right\rangle^{\sigma}$ , for $w\in V_{out}^{\mathcal{M}^{\prime}}$ . Therefore, as a linear combination of holomorphic functions, $h_{out}$ is a holomorphic function on $\mathcal{D}_{h_{out}}\vcentcolon=\bigcap_{w\in V_{out}^{\mathcal{M}^{\prime}}}\mathcal{D}_{\left\langle{w}\right\rangle^{\sigma}}$ . As $\mathbb{C}\setminus\mathcal{D}_{\left\langle{w}\right\rangle^{\sigma}}$ are closed and countable, $\mathbb{C}\setminus\mathcal{D}_{h_{out}}$ is also closed and countable, and therefore $\mathcal{D}_{h_{out}}$ is a connected open set. It follows by the identity theorem [18, Thm. 10.18] that $h_{out}$ continues in a unique fashion to a holomorphic function on $\mathcal{D}_{h_{out}}$ with $h_{out}(t)=c$ , for all $t\in\mathcal{D}_{h_{out}}$ .

Set $V_{\ell}=\{v\in V^{\mathcal{M}^{\prime}}:\mathrm{lv}(v)=\ell\}$ , for $\ell\geq 1$ . Let $k=\dim\left\langle\{\omega_{uv_{in}}:u\in V_{1}\}\right\rangle_{\mathbb{Q}}$ and enumerate the nodes $V_{1}=\{v_{1}^{1},\dots,v^{1}_{D_{1}}\}$ so that $\{\omega_{v_{1}^{1}v_{in}},\dots,\omega_{v_{k}^{1}v_{in}}\}$ is a basis for $\langle\omega_{v_{1}^{1}v_{in}},\dots,\allowbreak\omega_{v_{D_{1}}^{1}v_{in}}\rangle_{\mathbb{Q}}$ . In the remainder of the proof, we distinguish between the cases $k\geq 2$ and $k=1$ .

The case $k\geq 2$ . Fix a real number

[TABLE]

chosen so that none of $\left\langle{v_{p}^{1}}\right\rangle^{\sigma}(z)=\sigma(\omega_{v_{p}^{1}v_{in}}z+\theta_{v_{p}^{1}})$ , $p\in\{1,\dots,D_{1}\}$ , has singularities along $A+i\,\mathbb{R}$ . Such a number always exists, as $\bigcup_{p\hskip 1.42262pt=\hskip 1.42262pt1}^{D_{1}}{(S-\theta_{v_{p}^{1}})}/{\omega_{v_{p}^{1}v_{in}}}$ is a discrete set. Now, write $(\omega_{v_{p}^{1}v_{in}})_{p\hskip 1.42262pt=\hskip 1.42262pt1}^{D_{1}}={Q}\cdot(\omega_{v_{p}^{1}v_{in}})_{p\hskip 1.42262pt=\hskip 1.42262pt1}^{{k}}$ , where ${Q}=({q}_{pj})\in\mathbb{Q}^{D_{1}\times k}$ is a rational matrix whose first ${k}$ rows form a ${k}\times{k}$ identity matrix. Let $C\subset\mathbb{R}^{k}$ be a set satisfying the conclusion of Lemma 7 applied with $\alpha_{p}=\omega_{v_{p}^{1}v_{in}}$ , $p\in\{1,\dots,D_{1}\}$ . Given an arbitrary $\bm{s}=(s_{1},s_{2},\dots,s_{{k}})\in C$ , Lemma 7 yields sequences $(t^{n,\bm{s}})_{n\in\mathbb{N}}\subset\mathbb{R}$ and $(\bm{r}^{n,\bm{s}})_{n\in\mathbb{N}}\subset{C}$ such that

[TABLE]

We now perform a calculation that will enable us to interpret the single input variable of $\mathcal{M}^{\prime}$ as a rational linear combination of $k$ input variables of another LFNN $\mathcal{M}^{\prime\prime}$ , to be specified below. The argument will then proceed by anchoring at all but one of the inputs of $\mathcal{M}^{\prime\prime}$ . It is this last step that uses $k\geq 2$ as a key assumption, as anchoring requires at least two input nodes to be meaningful. We thus have

[TABLE]

for $p\in\{1,\dots,D_{1}\}$ , where in (19) we used the $i$ -periodicity of $\sigma$ , in (20) we used (16), and in (21) we used $\omega_{v_{p}^{1}v_{in}}=\sum_{j=1}^{k}{q}_{pj}\,\omega_{v_{j}^{1}v_{in}}$ and the $i$ -periodicity of $\sigma$ again. Owing to (15), none of $\left\langle{v_{p}^{1}}\right\rangle^{\sigma}$ , $p\in\{1,\dots,D_{1}\}$ , has singularities along $A+i\,\mathbb{R}$ , and thus all the quantities in (19) – (21) are well-defined. The calculation just presented suggests constructing a new LFNN by “splitting” the input node $v_{in}$ of $\mathcal{M}^{\prime}$ into $k$ new input nodes. Formally, we define an LFNN $\mathcal{M}^{\prime\prime}=(V^{\mathcal{M}^{\prime\prime}},E^{\mathcal{M}^{\prime\prime}},V_{in}^{\mathcal{M}^{\prime\prime}},V_{out}^{\mathcal{M}^{\prime\prime}},\Omega^{\mathcal{M}^{\prime\prime}},\Theta^{\mathcal{M}^{\prime\prime}})$ as follows:

–

$V_{in}^{{\mathcal{M}^{\prime\prime}}}=\{u_{1},\dots,u_{k}\}$ is a set of $k$ newly-created input nodes (disjoint from $V^{\mathcal{M}^{\prime}}$ ),

–

$V^{{\mathcal{M}^{\prime\prime}}}\vcentcolon=V_{in}^{{\mathcal{M}^{\prime\prime}}}\cup\bigcup_{\ell\geq 1}V_{\ell}$ ,

–

$E^{{\mathcal{M}^{\prime\prime}}}\vcentcolon=\{(v,\widetilde{v})\in E^{\mathcal{M}^{\prime}}:\;\mathrm{lv}(v)\geq 1\}\cup\{(u_{j},v_{p}^{1}):1\leq p\leq D_{1},\,1\leq j\leq k,\,q_{pj}\neq 0\},$

–

$V_{out}^{{\mathcal{M}^{\prime\prime}}}\vcentcolon=V_{out}^{{\mathcal{M}^{\prime}}}$ ,

–

Define $\omega_{v_{p}^{1}u_{j}}:=q_{pj}\,\omega_{v_{j}^{1}v_{in}}$ , for $p\in\{1,\dots,D_{1}\}$ , $j\in\{1,\dots,k\}$ , and let

[TABLE]

–

$\Theta^{{\mathcal{M}^{\prime\prime}}}:=\Theta^{\mathcal{M}^{\prime}}$ .

The procedure for constructing ${\mathcal{M}^{\prime\prime}}$ for a given $\mathcal{M}^{\prime}$ is illustrated in Figure 8.

Owing to (19) – (21) and the construction of ${\mathcal{M}^{\prime\prime}}$ , we have the following “input splitting” relationship

[TABLE]

for $p\in\{1,\dots,D_{1}\}$ .

We now show that ${\mathcal{M}^{\prime\prime}}$ is non-degenerate and clones-free. To this end, first note that, for every $j\in\{1,\dots,k\}$ , there exists a $w\in V_{out}^{\mathcal{M}^{\prime}}$ such that $v_{j}^{1}\in V^{\mathcal{M}^{\prime}(w)}$ , by non-degeneracy of $\mathcal{M}^{\prime}$ , and as $u_{j}\in\mathrm{par}(v_{j}^{1})$ , we have $u_{j}\in V^{\mathcal{M}^{\prime\prime}(w)}$ . This establishes non-degeneracy. Next, we observe that a clone pair in $\mathcal{M}^{\prime\prime}$ would have to consist of nodes in $\{v_{1}^{1},v_{2}^{1},\dots,v_{D_{1}}^{1}\}$ , as a clone pair in $\mathcal{M}^{\prime\prime}$ consisting only of nodes in $\bigcup_{\ell\geq 2}V_{\ell}$ would also be a clone pair in $\mathcal{M}^{\prime}$ . Thus, by way of contradiction, suppose that $(v_{p_{1}}^{1},v_{p_{2}}^{1})$ , $1\leq p_{1}<p_{2}\leq D_{1}$ , is a clone pair in $\mathcal{M}^{\prime\prime}$ . Then $\theta_{p_{1}}^{1}=\theta_{p_{2}}^{1}$ and $\omega_{v_{p_{1}}^{1}v_{in}}=\sum_{j=1}^{k}{q}_{p_{1}j}\,\omega_{v_{j}^{1}v_{in}}=\sum_{j=1}^{k}{q}_{p_{2}j}\,\omega_{v_{j}^{1}v_{in}}=\omega_{v_{p_{2}}^{1}v_{in}}$ , so $(v_{p_{1}}^{1},v_{p_{2}}^{1})$ is a clone pair in $\mathcal{M}^{\prime}$ , which stands in contradiction to the no-clones property of $\mathcal{M}^{\prime}$ , and hence establishes that $\mathcal{M}^{\prime\prime}$ is clones-free. We now revisit the constant-valued function $h_{out}(t)=\sum_{w\in V_{out}^{\mathcal{M}^{\prime}}}\lambda_{w}\left\langle{w}\right\rangle^{\sigma,\,\mathcal{M}^{\prime}}(t)=c$ , for all $t\in\mathcal{D}_{h_{out}}$ . Examining the structure of $\mathcal{M}^{\prime}$ , we see that, for each $w\in V_{out}^{\mathcal{M}^{\prime}}$ , we can write

[TABLE]

where $F_{w}$ corresponds to the map realized by the LFNN with nodes

[TABLE]

inputs $\{v_{1}^{1},\dots,v_{D_{1}}^{1}\}$ , output $\{w\}$ , and edges, weights, and biases inherited from $\mathcal{M}^{\prime}$ . As $F_{w}$ is the map realized by a node of a GFNN according to Definition 12, it is holomorphic on its natural domain $\mathcal{D}_{F_{w}}\subset\mathbb{C}^{D_{1}}$ containing $\mathbb{R}^{D_{1}}$ . We can therefore write

[TABLE]

where $F:\mathcal{D}_{F}\to\mathbb{C}$ , $F=\sum_{w\in V_{out}^{\mathcal{M}^{\prime}}}\lambda_{w}\,F_{w}$ , is holomorphic on $\mathcal{D}_{F}\vcentcolon=\bigcap_{w\in V_{out}^{\mathcal{M}^{\prime}}}\mathcal{D}_{F_{w}}\supset\mathbb{R}^{D_{1}}$ .

Now, by definition of natural domain, for each $w\in V_{out}^{\mathcal{M}^{\prime\prime}}$ , we have

[TABLE]

where the variables $z_{1},\dots,z_{k}$ correspond to the input nodes $u_{1},\dots,u_{k}$ , respectively. Therefore, for $(z_{1},\dots,z_{k})$ in the open domain $\mathcal{D}_{\widetilde{h}_{out}}\vcentcolon=\bigcap_{w\in V_{out}^{\mathcal{M}^{\prime\prime}}}\mathcal{D}_{\left\langle{w}\right\rangle^{\sigma,\,\mathcal{M}^{\prime\prime}}}$ , we can define the function $\widetilde{h}_{out}:\mathcal{D}_{\widetilde{h}_{out}}\to\mathbb{C}$ according to

[TABLE]

Moreover, as $\mathcal{M}^{\prime}$ and $\mathcal{M}^{\prime\prime}$ share the nodes in (23), as well as the associated edges, weights, and biases, we have

[TABLE]

for all $w\in V_{out}^{\mathcal{M}^{\prime\prime}}$ , and thus

[TABLE]

We are now in a position to show that, like $h_{out}$ , the function $\widetilde{h}_{out}$ is constant valued. As this will be effected by an analytic continuation argument through Lemma 4, we first need to ensure that the relevant quantities lie in $\mathcal{D}_{\widetilde{h}_{out}}$ . To this end, as $\langle{v_{p}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime\prime}}(z_{1},\dots,z_{k})\in\mathbb{R}$ , for all $(z_{1},\dots,z_{k})\in\mathbb{R}^{k}$ , $p\in\{1,\dots,D_{1}\}$ , and $\mathcal{D}_{F}$ is an open set containing $\mathbb{R}^{D_{1}}$ , we can choose a small enough $\delta>0$ so that $\mathcal{D}_{\widetilde{h}_{out}}\supset D^{\circ}_{k}((A,\dots,A),\delta)$ . Now, fix an arbitrary $\bm{s}=(s_{1},\dots,s_{k})$ in the smaller open set $C\cap D^{\circ}_{{k}}(\bm{0},\delta)$ . We then have

[TABLE]

and since

[TABLE]

as $n\to\infty$ , we obtain

[TABLE]

for large enough $n\in\mathbb{N}$ . We may assume w.l.o.g. that this is true for all $n\in\mathbb{N}$ by discarding finitely many elements of the sequence $(\bm{r}^{n,\bm{s}})_{n\in\mathbb{N}}$ . Now, we use (22), (24), and (25) to get

[TABLE]

Define the set

[TABLE]

and note that $\mathrm{cl}(T)\supset\left((A,\dots,A)+(i\,C)\cap D^{\circ}_{k}(0,\delta)\right),$ so it follows by Lemma 4 that ${\widetilde{h}_{out}-c}\equiv 0$ everywhere in a neighborhood of $\mathbb{R}^{k}$ , and thus, in particular, $\widetilde{h}_{out}|_{\mathbb{R}^{k}}\equiv c$ . We now repeatedly apply Lemma 2 to $\mathcal{M}^{\prime\prime}$ , anchoring successively each of the inputs $u_{1},\dots,u_{k-1}$ . Observe that we will never find ourselves in the circumstance (ii) of Lemma 2, as this would mean that we have obtained a network $\mathcal{N}\in\mathscr{M}_{min}$ with a strictly smaller number of nodes than $\mathcal{M}^{\prime}$ . Moreover, as the first $k$ rows of $Q$ form an identity matrix, we have

[TABLE]

for all $p,j\in\{1,\dots,k\}$ . Therefore, for each $j\in\{1,\dots,k\}$ , the node $v_{j}^{1}$ will be removed when anchoring the input $u_{j}$ . A concrete example of this input anchoring procedure in the case $k\geq 2$ is shown schematically in Figure 9.

Thus, having anchored the nodes $u_{1},u_{2},\dots,u_{k-1}$ to appropriate real numbers $a_{1},\dots,a_{k-1}$ , we will be left with a non-degenerate clones-free LFNN ${\mathcal{N}}=(V^{\mathcal{N}},E^{\mathcal{N}},\{u_{k}\},V_{out}^{\mathcal{N}},\Omega^{\mathcal{N}},\Theta^{\mathcal{N}})$ such that the function $h_{out}^{\mathcal{N}}\vcentcolon=\sum_{w\in V_{out}^{\mathcal{N}}}\lambda_{w}\left\langle{w}\right\rangle^{\sigma,\,\mathcal{N}}$ satisfies

[TABLE]

We have shown that the first term on the right-hand side of (26) evaluates identically to $c$ . Moreover, as input anchoring yields networks satisfying (IA-2), the values $\left\langle{w}\right\rangle^{\sigma,\,\mathcal{M}^{\prime\prime}}$ , for $w\in V_{out}^{\mathcal{M}^{\prime\prime}}\setminus V_{out}^{\mathcal{N}}$ , are constant with respect to the input at $u_{k}$ . Therefore the value of the sum on the right-hand side of (26) is independent of $t$ , that is, $h_{out}^{\mathcal{N}}\equiv c_{\mathcal{N}}$ , for some $c_{\mathcal{N}}\in\mathbb{R}$ . As $\lambda_{w}\neq 0$ , for $w\in V_{out}^{\mathcal{M}^{\prime\prime}}$ , it follows that $\{\left\langle{w}\right\rangle^{\sigma,\,\mathcal{N}}:w\in V_{out}^{\mathcal{N}}\}\cup\{\bm{1}\}$ is linearly dependent. We have thus shown that the network $\mathcal{N}$ is in $\mathscr{M}_{min}$ . As $\mathcal{N}$ has strictly fewer nodes than $\mathcal{M}^{\prime}$ , we have established the desired contradiction and proved the theorem for $k\geq 2$ .

The case $k=1$ . We have $\dim\langle\omega_{v_{1}^{1}v_{in}},\dots,\omega_{v_{D_{1}}^{1}v_{in}}\rangle_{\mathbb{Q}}=1$ , so we can write $\omega_{v_{j}^{1}v_{in}}=N_{j}a$ , where $a\in\mathbb{R}$ and $N_{j}\in\mathbb{Z}$ , for $j=1,\dots,D_{1}$ . Moreover, by replacing $a$ with $2^{l}a$ and all $N_{j}$ with $N_{j}/2^{l}$ for an appropriate integer $l$ , we may assume w.l.o.g. that at least one of the $N_{j}$ is odd. We make the following crucial observation. For all $j=1,\dots,D_{1}$ and $t\in\mathbb{R}$ , we have

[TABLE]

We see that, along the line $\mathbb{R}+\frac{i}{2a}$ , the functions $\langle{v_{j}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}$ are real-valued, for all $j=1,\dots,D_{1}$ , and, provided that $N_{j}$ is odd, they have poles at the points $\frac{1}{a}\left[\frac{S-{\theta_{v_{j}^{1}}}}{N_{j}}+\frac{i}{2}\right]$ . As $S$ is self-avoiding, and at least one of the $N_{j}$ is odd, there exist a $j^{*}\in\{1,\dots,D_{1}\}$ and a $t^{*}\in\mathbb{R}+\frac{i}{2a}$ such that $\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}$ has a pole at $t^{*}$ , and all the other $\langle{v_{j}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}$ , $j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}$ , are analytic and real-valued at $t^{*}$ . Let $\epsilon>0$ be such that $\langle{v_{j}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}$ , $j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}$ , are analytic on an open set containing the closed disk $D(t^{*},\epsilon)$ , and such that $\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}$ is analytic on the punctured disk $D(t^{*},\epsilon)\setminus\{t^{*}\}$ . Before embarking on the construction of $\mathcal{N}$ in the case $k=1$ , we verify the following auxiliary statement:

*Claim 1: We have $L(\mathcal{M}^{\prime})\geq 2$ and $\{\widetilde{v}\in V_{2}:(v_{j^{*}}^{1},\widetilde{v})\in E^{\mathcal{M}^{\prime}}\}\neq\varnothing$ .

Proof of Claim 1.* We first show that $L(\mathcal{M}^{\prime})\geq 2$ . To this end, suppose by way of contradiction that $L(\mathcal{M}^{\prime})=1$ . Then $V_{out}^{\mathcal{M}^{\prime}}=V_{1}$ by non-degeneracy, so the function $h_{out}=\sum_{w\in V_{out}^{\mathcal{M}^{\prime}}}\lambda_{w}\left\langle{w}\right\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ can be written as

[TABLE]

where $g$ is analytic in an open neighborhood of $t^{*}$ . But $\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ has a pole at $t^{*}$ , and so $h_{out}$ has a pole at $t^{*}$ , which stands in contradiction to $h_{out}\equiv c$ , and thus establishes $L(\mathcal{M}^{\prime})\geq 2$ .

Next, by way of contradiction assume that $\{\widetilde{v}\in V_{2}:(v_{j^{*}}^{1},\widetilde{v})\in E^{\mathcal{M}^{\prime}}\}=\varnothing$ . Then, by non-degeneracy of $\mathcal{M}^{\prime}$ , we have $v_{j^{*}}^{1}\in V_{out}^{\mathcal{M}^{\prime}}$ , and $\left\langle{w}\right\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ , for $w\in V_{out}^{\mathcal{M}^{\prime}}\setminus\{v_{j^{*}}^{1}\}$ , are real holomorphic functions of $\big{(}\langle{v_{j}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}\big{)}_{j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}}$ . Now, as $\langle{v_{j}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}$ , $j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}$ , are analytic and real-valued at $t^{*}$ , the function $h_{out}$ can again be written in the form (28) with $g$ analytic in an open neighborhood of $t^{*}$ . This again contradicts $h_{out}\equiv c$ , and thus $\{\widetilde{v}\in V_{2}:(v_{j^{*}}^{1},\widetilde{v})\in E^{\mathcal{M}^{\prime}}\}\neq\varnothing$ , establishing the claim. We can therefore enumerate the nodes $V_{2}=\{v^{2}_{1},\dots,v^{2}_{d},v^{2}_{d+1},\dots,v^{2}_{D_{2}}\}$ so that

–

$v_{j^{*}}^{1}\in\bigcap_{p\hskip 0.85358pt\leq\hskip 0.85358ptd}\mathrm{par}(\{v^{2}_{p}\})\setminus\bigcup_{p\hskip 0.85358pt>\hskip 0.85358ptd}\mathrm{par}(\{v^{2}_{p}\})$ , and

–

$\{\omega_{v_{1}^{2}v_{j^{*}}^{1}},\dots,\omega_{v_{\bar{k}}^{2}v_{j^{*}}^{1}}\}$ is a basis for $\langle\omega_{v_{1}^{2}v_{j^{*}}^{1}},\dots,\omega_{v_{d}^{2}v_{j^{*}}^{1}}\rangle_{\mathbb{Q}}$ .

In particular, we have $\bar{k}=\dim\langle\omega_{v_{1}^{2}v_{j^{*}}^{1}},\dots,\omega_{v_{d}^{2}v_{j^{*}}^{1}}\rangle_{\mathbb{Q}}$ . We will apply a similar input splitting procedure as in the case $k\geq 2$ , but this time with the nodes $v_{j^{*}}^{1}$ and $v^{2}_{1},\dots,v^{2}_{d}$ taking on the roles of $v_{in}$ and $v_{1}^{1},\dots,v^{1}_{D_{1}}$ . Specifically, we will use the pole of $\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ at $t^{*}$ to obtain sequences $(t^{n,\bm{s}})_{n\in\mathbb{N}}$ and $(\bm{r}^{n,\bm{s}})_{n\in\mathbb{N}}$ according to Lemma 7, that is to say, we will “split the non-input node” $v_{j^{*}}^{1}$ of $\mathcal{M}^{\prime}$ into input nodes of the new network $\mathcal{M}^{\prime\prime}$ to be constructed. We remark that the outputs of $v^{2}_{1},\dots,v^{2}_{d}$ depend on $\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ , which, in turn, is a function of the input variables. This “extra level of separation” will cause the construction of $\mathcal{M}^{\prime\prime}$ to be more involved in the case $k=1$ than it was in the case $k\geq 2$ .

In order to motivate the construction of $\mathcal{M}^{\prime\prime}$ in the case $k=1$ , we will carry out a calculation analogous to (19)–(21). We begin by determining a $B\in\mathbb{R}$ such that none of the functions

[TABLE]

for $p\in\{1,\dots,d\}$ , have singularities in the set $\mathcal{L}_{B}\vcentcolon=\{z\in D(t^{*},\epsilon):\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}(z)\in B+i\,\mathbb{R}\}$ , where the functions $f_{p}:\mathcal{D}_{f_{p}}\to\mathbb{C}$ , for $p\in\{1,\dots,d\}$ , are defined according to

[TABLE]

When $D_{1}=1$ , the functions $f_{p}$ are all identically zero. For given $p\in\{1,\dots,d\}$ , $z\in\mathcal{L}_{B}$ is a singularity of $\langle{v_{p}^{2}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ if and only if $z$ is an element of $D(t^{*},\epsilon)$ such that

[TABLE]

where $P$ is the set of poles of $\sigma$ , expressed in terms of $S$ by (14). But

[TABLE]

for all $z\in D(t^{*},\epsilon)$ , so it suffices to ensure that

[TABLE]

Next, let

[TABLE]

and note that, as $f_{p}$ , $p=1,\dots,d$ , are continuous in a neighborhood of $t^{*}$ , we have $\eta(\epsilon)\to 0$ as $\epsilon\to 0$ . Let $\mathrm{Leb}$ denote the Lebesgue measure on $\mathbb{R}$ . We then have

[TABLE]

for small enough values of $\epsilon$ . Therefore, by choosing a sufficiently small $\epsilon$ , we can ensure that there exists a $B\in[0,1]$ such that (31) holds, as desired. Now, write $(\omega_{v_{p}^{2}v_{j^{*}}^{1}})_{p\hskip 1.42262pt=\hskip 1.42262pt1}^{d}=\bar{Q}\cdot(\omega_{v_{p}^{2}v_{j^{*}}^{1}})_{p\hskip 1.42262pt=\hskip 1.42262pt1}^{\bar{k}}$ , where $\bar{Q}=(\bar{q}_{pj})_{p,j}\in\mathbb{Q}^{d\times\bar{k}}$ is a rational matrix whose first $\bar{k}$ rows form a $\bar{k}\times\bar{k}$ identity matrix. Let $C\subset\mathbb{R}^{\bar{k}}$ be a set satisfying the conclusion of Lemma 7 applied with $\alpha_{p}=\omega_{v_{p}^{2}v_{j^{*}}^{1}}$ , $p=1,\dots,\bar{k}$ .

Given an arbitrary $\bm{s}=(s_{1},s_{2},\dots,s_{\bar{k}})\in C$ , Lemma 7 yields sequences $(t^{n,\bm{s}})_{n\in\mathbb{N}}\subset\mathbb{R}$ , $(\bm{r}^{n,\bm{s}})_{n\in\mathbb{N}}\subset{C}$ such that

[TABLE]

As $\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ is analytic on the punctured disk $D(t^{*},\epsilon)\setminus\{t^{*}\}$ and its singularity at $t^{*}$ is a pole, it follows that the reciprocal $1/\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ is holomorphic on $D(t^{*},\epsilon)$ with a zero at $t^{*}$ . Thus, by the complex open mapping theorem [18, Thm. 10.32] applied to $1/\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ , there exists a $\delta>0$ such that, for every $y\in D(0,\delta)$ , there is a $z_{y}\in D(t^{*},\epsilon)$ with $1/\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}(z_{y})=y$ . Now, since $|t^{n,\bm{s}}|\to\infty$ , we also have $|B+i\,t^{n,\bm{s}}|\to\infty$ , so it follows that there exists a sequence $(z^{n,\bm{s}})_{n\in\mathbb{N}}$ in $D(t^{*},\epsilon)\setminus\{t^{*}\}$ with $z^{n,\bm{s}}\to t^{*}$ , such that $\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}(z^{n,\bm{s}})=B+i\,t^{n,\bm{s}}$ (a finite number of elements of the sequence $(t^{n,\bm{s}})_{n\in\mathbb{N}}$ may need to be discarded to ensure that $(z^{n,\bm{s}})_{n\in\mathbb{N}}$ is, indeed, contained in $D(t^{*},\epsilon)\setminus\{t^{*}\}$ ). Now, for $p\in\{1,\dots,d\}$ , compute

[TABLE]

where in (35) we used the definition of $z^{n,\bm{s}}$ , in (36) we used the $i$ -periodicity of $\sigma$ , in (37) we used (32), and in (38) we used $\omega_{v_{p}^{2}v_{j^{*}}^{1}}=\sum_{j=1}^{\bar{k}}{\bar{q}}_{pj}\,\omega_{v_{j}^{2}v_{j^{*}}^{1}}$ and the $i$ -periodicity of $\sigma$ again. As $B$ was chosen so that the functions (29) do not have singularities in $\mathcal{L}_{B}$ , all the quantities in the calculation (35)–(38) are well-defined.

Motivated by (35)–(38), we construct a GFNN ${\mathcal{M}^{\prime\prime}}=(V^{{\mathcal{M}^{\prime\prime}}},E^{{\mathcal{M}^{\prime\prime}}},V_{in}^{{\mathcal{M}^{\prime\prime}}},V_{out}^{{\mathcal{M}^{\prime\prime}}},\Omega^{{\mathcal{M}^{\prime\prime}}},\Theta^{{\mathcal{M}^{\prime\prime}}})$ as follows

–

First, $\bar{k}$ new nodes are created and enumerated as $\{u_{1},\dots,u_{\bar{k}}\}$ . Now, if $D_{1}>1$ , then let $V_{in}^{{\mathcal{M}^{\prime\prime}}}=\{v_{in},u_{1},\dots,u_{\bar{k}}\}$ , and if $D_{1}=1$ , set $V_{in}^{{\mathcal{M}^{\prime\prime}}}=\{u_{1},\dots,u_{\bar{k}}\}$ .

–

$V^{{\mathcal{M}^{\prime\prime}}}:=V_{in}^{{\mathcal{M}^{\prime\prime}}}\cup(V_{1}\setminus\{v_{j^{*}}^{1}\})\cup\bigcup_{\ell\geq 2}V_{\ell}$ .

–

${E^{{\mathcal{M}^{\prime\prime}}}\vcentcolon=\{(v,\widetilde{v})\in E^{\mathcal{M}}:\;\mathrm{lv}(v)\geq 2\}\cup\{(v_{j}^{1},v_{p}^{2}):j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\},p\in\{1,\dots,D_{2}\}\}}\break\hskip 56.9055pt{\cup\,\{(u_{j},v_{p}^{2}):p\in\{1,\dots,d\},\,j\in\{1,\dots,\bar{k}\},\,{\bar{q}}_{pj}\neq 0\}},$

–

$V_{out}^{{\mathcal{M}^{\prime\prime}}}\vcentcolon=V_{out}^{{\mathcal{M}^{\prime}}}\setminus\{v_{j^{*}}^{1}\}$ ,

–

define $\omega_{v_{p}^{2}u_{j}}:={\bar{q}}_{pj}\,\omega_{v_{j}^{2}v_{1}^{1}}$ , for $p=1,\dots,d$ , $j=1,\dots,{\bar{k}}$ , and let

$\Omega^{{\mathcal{M}^{\prime\prime}}}:={\{\omega_{\widetilde{v}v}\in\Omega^{\mathcal{M}^{\prime}}:\;\mathrm{lv}(v)\geq 2\}\cup\{\omega_{v_{p}^{2}v_{j}^{1}}:j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\},p\in\{1,\dots,D_{2}\}\}}\\ \hskip 56.9055pt{\cup\{\omega_{v_{p}^{2}u_{j}}:p\in\{1,\dots,d\},\,j\in\{1,\dots,\bar{k}\},\,{\bar{q}}_{pj}\neq 0\}},$

–

let

$\Theta^{{\mathcal{M}^{\prime\prime}}}:=\{\theta_{v}\in\Theta^{\mathcal{M}^{\prime}}:\;\mathrm{lv}(v)\geq 2\}\cup\{\theta_{v_{j}^{1}}:j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}\}$ .

The construction of $\mathcal{M}^{\prime\prime}$ for a concrete $\mathcal{M}^{\prime}$ is illustrated in Figure 10. Note that ${\mathcal{M}^{\prime\prime}}$ is not layered in the case $D_{1}>1$ , due to the presence of the node $v_{in}$ . Owing to (35)–(38) and the construction of ${\mathcal{M}^{\prime\prime}}$ , we have the following “input splitting” relationship:

[TABLE]

for $p\in\{1,\dots,d\}$ .

We next show that ${\mathcal{M}^{\prime\prime}}$ is non-degenerate and clones-free. To establish non-degeneracy, it suffices to show $V_{in}^{{\mathcal{M}^{\prime\prime}}}\subset\bigcup_{w\in V_{out}^{{\mathcal{M}^{\prime\prime}}}}V^{\mathcal{M}^{\prime\prime}(w)}$ . First note that, in both cases $D_{1}=1$ and $D_{1}>1$ , for a given $j\in\{1,\dots,\bar{k}\}$ , there exists a $w\in V_{out}^{{\mathcal{M}^{\prime}}}\setminus\{v_{j^{*}}^{1}\}$ such that $v_{j}^{2}\in V^{\mathcal{M}^{\prime}(w)}$ , by non-degeneracy of $\mathcal{M}^{\prime}$ . It follows that $v_{j}^{2}\in V^{{\mathcal{M}^{\prime\prime}}(w)}$ and thus $u_{j}\in V^{{\mathcal{M}^{\prime\prime}}(w)}$ . As $j$ was arbitrary, we have $\{u_{1},\dots,u_{\bar{k}}\}\subset\bigcup_{w\in V_{out}^{{\mathcal{M}^{\prime\prime}}}}V^{\mathcal{M}^{\prime\prime}(w)}$ , which establishes non-degeneracy of $\mathcal{M}^{\prime\prime}$ in the case $D_{1}=1$ . For $D_{1}>1$ we need to additionally show that $v_{in}\in V^{{\mathcal{M}^{\prime\prime}}(w)}$ . To this end, note that there exist an $m^{*}\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}$ and a $w\in V_{out}^{{\mathcal{M}^{\prime}}}\setminus\{v_{j^{*}}^{1}\}$ such that $v_{m^{*}}^{1}\in V^{\mathcal{M}^{\prime}(w)}$ , and so $v_{in}\in V^{{\mathcal{M}^{\prime\prime}}(w)}$ , as desired. The clones-free property of ${\mathcal{M}^{\prime\prime}}$ follows by the same argument as in the case $k\geq 2$ .

Once again, we revisit the function $h_{out}(t)=\sum_{w\in V_{out}^{\mathcal{M}^{\prime}}}\lambda_{w}\left\langle{w}\right\rangle^{\sigma,\,\mathcal{M}^{\prime}}(t)=c$ , for all $t\in\mathcal{D}_{h_{out}}$ , and proceed in a similar fashion as in the case $k\geq 2$ . This time, however, the output sets $V^{\mathcal{M}^{\prime}}_{out}$ and $V^{\mathcal{M}^{\prime\prime}}_{out}$ may differ by the node $v_{j^{*}}^{1}$ . This is a nuisance that will be dealt with below in Claim 2, but in the meantime, it is convenient to introduce the “truncated” linear dependency function

[TABLE]

and proceed exactly as in the case $k\geq 2$ . By examining the structure of $\mathcal{M}^{\prime}$ , we see that, for each $w\in V_{out}^{\mathcal{M}^{\prime}}\setminus\{v^{1}_{j^{*}}\}$ , we can write

[TABLE]

where $H_{w}:\mathcal{D}_{H_{w}}\to\mathbb{C}$ corresponds to the map realized by the GFNN with nodes

[TABLE]

inputs $\{v^{2}_{p}\}_{p\hskip 1.42262pt=\hskip 1.42262pt1}^{d}\cup\{v^{1}_{j}\}_{j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}}$ , single output $\{w\}$ , and edges, weights, and biases inherited from $\mathcal{M}^{\prime}$ . The function $H_{w}:\mathcal{D}_{H_{w}}\to\mathbb{C}$ is holomorphic on its natural domain $\mathcal{D}_{H_{w}}\subset\mathbb{C}^{d+(D_{1}-1)}$ containing $\mathbb{R}^{d+(D_{1}-1)}$ . We can therefore write

[TABLE]

where $H:\mathcal{D}_{H}\to\mathbb{C}$ , $H=\sum_{w\in V_{out}^{\mathcal{M}^{\prime}}\setminus\{v^{1}_{j^{*}}\}}\lambda_{w}\,H_{w}$ , is holomorphic on $\mathcal{D}_{H}=\bigcap_{w\in V_{out}^{\mathcal{M}^{\prime}}\setminus\{v^{1}_{j^{*}}\}}\mathcal{D}_{H_{w}}\supset\mathbb{R}^{d+(D_{1}-1)}$ .

Now, by definition of natural domain, for each $w\in V_{out}^{\mathcal{M}^{\prime\prime}}$ , the natural domain $\mathcal{D}_{\left\langle{w}\right\rangle^{\sigma,\,\mathcal{M}^{\prime\prime}}}$ is the set of all $\bm{z}\in\bigcap_{p=1}^{d}\mathcal{D}_{\langle{v_{p}^{2}}\rangle^{\sigma,\,\mathcal{M}^{\prime\prime}}}\cap\bigcap_{j\neq j^{*}}\mathcal{D}_{\langle{v_{j}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime\prime}}}$ such that

[TABLE]

where the variable $\bm{z}=(z_{0},z_{1},\dots,z_{\bar{k}})$ corresponds to the input nodes $v_{in},u_{1},\dots,u_{\bar{k}}$ , in the case $D_{1}>1$ , and $\bm{z}=(z_{1},\dots,z_{\bar{k}})$ corresponds to the input nodes $u_{1},\dots,u_{\bar{k}}$ , in the case $D_{1}=1$ . Therefore, for $\bm{z}$ in the open domain $\mathcal{D}_{\widetilde{h}_{out}}\vcentcolon=\bigcap_{w\in V_{out}^{\mathcal{M}^{\prime\prime}}}\mathcal{D}_{\left\langle{w}\right\rangle^{\sigma,\,\mathcal{M}^{\prime\prime}}}$ , we can define the function $\widetilde{h}_{out}:\mathcal{D}_{\widetilde{h}_{out}}\to\mathbb{C}$ according to

[TABLE]

Moreover, as $\mathcal{M}^{\prime}$ and $\mathcal{M}^{\prime\prime}$ share the nodes in (41), as well as the associated edges, weights, and biases, we have

[TABLE]

for all $w\in V_{out}^{\mathcal{M}^{\prime\prime}}$ , and thus

[TABLE]

At this point we verify another auxiliary claim, which states that $h_{tr}$ and $h_{out}$ are always, in fact, the same function, and therefore $\tilde{h}_{out}\equiv c$ follows by a similar argument as in the case $k\geq 2$ .

Claim 2: Recall that $t^{*}\in\mathbb{R}+\frac{i}{2a}$ is such that $\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}$ has a pole at $t^{*}$ , and all the other $\langle{v_{j}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}$ , $j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}$ , are analytic and real-valued at $t^{*}$ . Further recall the open set $C\subset\mathbb{R}^{\bar{k}}$ containing $\bm{0}$ . We have $\{t^{*}\}\times\mathbb{R}^{\bar{k}}\subset\mathcal{D}_{\widetilde{h}_{out}}$ and $\widetilde{h}_{out}|_{\mathbb{R}^{\bar{k}+1}}\equiv c$ , in the case $D_{1}>1$ , and $\mathbb{R}^{\bar{k}}\subset\mathcal{D}_{\widetilde{h}_{out}}$ and $\widetilde{h}_{out}|_{\mathbb{R}^{\bar{k}}}\equiv c$ , in the case $D_{1}=1$ . Moreover, in both cases we have $v_{j^{*}}^{1}\notin V_{out}^{\mathcal{M}^{\prime}}$ . Proof of Claim 2. First assume that $D_{1}>1$ . To show that $\{t^{*}\}\times\mathbb{R}^{\bar{k}}\subset\mathcal{D}_{\widetilde{h}_{out}}$ , first observe that, for $j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}$ and $(z_{1},\dots,z_{\bar{k}})\in\mathbb{R}^{\bar{k}}$ , we have $\langle{v_{j}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime\prime}}(t^{*},z_{1},\dots,z_{\bar{k}})=\langle{v_{j}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}(t^{*})$ , which, by (VI), is a real number. By (30), this further implies $f_{p}(t^{*})\in\mathbb{R}$ , for $p=1,\dots,d$ . Therefore

[TABLE]

for $p\in\{1,\dots,d\}$ and $(z_{1},\dots,z_{\bar{k}})\in\mathbb{R}^{\bar{k}}$ . As $\mathbb{R}^{d+(D_{1}-1)}\subset\mathcal{D}_{H}$ , we deduce that

[TABLE]

This establishes $\{t^{*}\}\times\mathbb{R}^{\bar{k}}\subset\mathcal{D}_{\widetilde{h}_{out}}$ . We proceed to showing $\widetilde{h}_{out}|_{\mathbb{R}^{\bar{k}+1}}\equiv c$ . As $\mathcal{D}_{\widetilde{h}_{out}}$ is open, it follows that $\mathcal{D}_{\widetilde{h}_{out}}\supset\mathcal{U}$ , for some connected open $\mathcal{U}\subset\mathbb{C}^{1+\bar{k}}$ containing $\{t^{*}\}\times\mathbb{R}^{\bar{k}}$ . Choose a small enough $\delta>0$ so that $\mathcal{U}\supset D^{\circ}_{1}(t^{*},\delta)\times D^{\circ}_{\bar{k}}((B,\dots,B),\delta)$ . Now, fix an arbitrary $\bm{s}=(s_{1},\dots,s_{\bar{k}})$ in the smaller open set $C\cap D^{\circ}_{\bar{k}}(\bm{0},\delta)$ . We then have

[TABLE]

and since

[TABLE]

as $n\to\infty$ , we obtain

[TABLE]

for large enough $n\in\mathbb{N}$ . We may again assume w.l.o.g. that this is true for all $n\in\mathbb{N}$ by discarding finitely many elements of the sequences $(z^{n,\bm{s}})_{n\in\mathbb{N}}$ and $(\bm{r}^{n,\bm{s}})_{n\in\mathbb{N}}$ . Now, we use (39), (42), and (43) to get

[TABLE]

for all $\bm{s}\in C\cap D^{\circ}_{\bar{k}}(\bm{0},\delta)$ . We are now ready to show that $v_{j^{*}}^{1}\notin V_{out}^{\mathcal{M}^{\prime}}$ (still in the case $D_{1}>1$ ). To this end, suppose by way of contradiction that $v_{j^{*}}^{1}\in V_{out}^{\mathcal{M}^{\prime}}$ and set $\bm{s}=\bm{0}$ . Note that $\widetilde{h}_{out}(t^{*},B,\dots,B)$ is a well-defined (finite) complex number, simply as $(t^{*},B,\dots,B)\in\{t^{*}\}\times\mathbb{R}^{\bar{k}}\subset\mathcal{D}_{\widetilde{h}_{out}}$ . Thus, by (40) and (44), we have

[TABLE]

as $n\to\infty$ , which contradicts the fact that $\langle{v_{j^{*}}^{1}}\rangle^{\sigma,\,\mathcal{M}^{\prime}}$ has a pole at $t^{*}$ . This establishes $v_{j^{*}}^{1}\notin V_{out}^{\mathcal{M}^{\prime}}$ . As a consequence we further have $h_{tr}=h_{out}$ , and so (44) reads

[TABLE]

for all $\bm{s}\in C\cap D^{\circ}_{\bar{k}}(0,\delta)$ . Now, define the set

[TABLE]

Note that $\widetilde{T}$ satisfies

[TABLE]

so by Lemma 5, it follows that $\widetilde{h}_{out}-c\equiv 0$ everywhere in an open neighborhood of $\mathbb{R}^{\bar{k}+1}$ , and thus $\widetilde{h}_{out}|_{\mathbb{R}^{\bar{k}+1}}\equiv c$ in particular. This establishes Claim 2 in the case $D_{1}>1$ . It remains to prove the claim for $D_{1}=1$ . Showing that $\mathbb{R}^{\bar{k}}\subset\mathcal{D}_{\widetilde{h}_{out}}$ is fully analogous to showing $\{t^{*}\}\times\mathbb{R}^{\bar{k}}\subset\mathcal{D}_{\widetilde{h}_{out}}$ in the case $D_{1}>1$ . We can hence proceed to establishing $\widetilde{h}_{out}|_{\mathbb{R}^{\bar{k}}}\equiv c$ . To this end, we first note that there is a connected open set $\mathcal{U}$ and a $\delta>0$ such that $\mathbb{R}^{\bar{k}}\subset\mathcal{U}\subset\mathcal{D}_{\widetilde{h}_{out}}$ and $D^{\circ}_{\bar{k}}((B,\dots,B),\delta)\subset\mathcal{U}$ , and we similarly obtain

[TABLE]

for all $n\in\mathbb{N}$ and $\bm{s}\in C\cap D^{\circ}_{\bar{k}}(0,\delta)$ . Again, showing $v_{j^{*}}^{1}\notin V_{out}^{\mathcal{M}^{\prime}}$ now proceeds in a manner entirely analogous to the case $D_{1}>1$ , as does obtaining the identity

[TABLE]

for all $\bm{s}\in C\cap D^{\circ}_{\bar{k}}(0,\delta)$ . Now, define the set

[TABLE]

Note that $T$ satisfies $\mathrm{cl}(T)\supset\left((B,\dots,B)+(i\,C)\cap D^{\circ}_{\bar{k}}(0,\delta)\right)$ , so, by Lemma 4, we have $\widetilde{h}_{out}\equiv c$ everywhere in an open neighborhood of $\mathbb{R}^{\bar{k}}$ , which concludes the proof of Claim 2.

Finally, it remains to apply an input anchoring procedure to $\mathcal{M}^{\prime\prime}$ , which will conclude the proof in a manner similar to the case $k\geq 2$ . Specifically, we use Lemma 2 to successively eliminate inputs of $\mathcal{M}^{\prime\prime}$ , starting with $v_{in}$ (if present), and proceeding with $u_{1},\dots,u_{\bar{k}-1}$ . If $D_{1}>1$ , the network ${\mathcal{M}}^{\prime\prime}$ is not layered (unlike in the case $k\geq 2$ and the case $k=1$ , $D_{1}=1$ ). However, every network obtained from $\mathcal{M}^{\prime\prime}$ by anchoring all but one of the input nodes $\{v_{in},u_{1},\dots,u_{\bar{k}}\}$ is layered. This means that, when anchoring $v_{in}$ , we do not find ourselves in the circumstance (ii) of Lemma 2, as this would mean we have obtained a network $\mathcal{N}\in\mathscr{M}_{min}$ with strictly fewer nodes than $\mathcal{M}$ . Thus, after having anchored $v_{in}$ , we are left with a layered network with inputs $u_{1},\dots,u_{\bar{k}}$ . At this point we proceed completely analogously to the case $k\geq 2$ by successively eliminating the inputs $u_{1},\dots,u_{\bar{k}-1}$ . We are left with a non-degenerate clones-free LFNN ${\mathcal{N}}=(V^{\mathcal{N}},E^{\mathcal{N}},\{u_{\bar{k}}\},V_{out}^{\mathcal{N}},\Omega^{\mathcal{N}},\Theta^{\mathcal{N}})$ and a vector of real constants $\bm{a}$ (specifically, $\bm{a}\in\mathbb{R}^{\bar{k}}$ in the case $D_{1}>1$ , and $\bm{a}\in\mathbb{R}^{\bar{k}-1}$ in the case $D_{1}=1$ ), such that the function $h_{out}^{\mathcal{N}}:=\sum_{w\in V_{out}^{\mathcal{N}}}\lambda_{w}\left\langle{w}\right\rangle^{\sigma,\,\mathcal{N}}$ satisfies

[TABLE]

A concrete example of this input anchoring procedure in the case $k\geq 2$ is shown schematically in Figure 11. By Claim 2, the first term on the right-hand side of (45) evaluates identically to $c$ . Moreover, as input anchoring yields networks satisfying (IA-2), the values of the functions $\left\langle{w}\right\rangle^{\sigma,\,\mathcal{M}^{\prime\prime}}$ , for $w\in V_{out}^{\mathcal{M}^{\prime\prime}}\setminus V_{out}^{\mathcal{N}}$ , do not depend on the input at $u_{\bar{k}}$ . Therefore $h_{out}^{\mathcal{N}}\equiv c_{\mathcal{N}}$ , for some $c_{\mathcal{N}}\in\mathbb{R}$ . We have thus shown that the network $\mathcal{N}$ is in $\mathscr{M}$ . But $L(\mathcal{N})=L(\mathcal{M})-1$ , which stands in contradiction to the minimality of depth of the elements of $\mathscr{M}_{min}$ , and therefore completes the proof of the theorem. ∎

Proof of Theorem 3.

Let $\mathcal{N}_{j}=(V^{j},E^{j},V_{in},V_{out},\Omega^{j},\Theta^{j})$ , $j\in\{1,2\}$ , be networks as in the theorem statement. Let $\mathcal{N}=\mathcal{N}_{1}\vee\mathcal{N}_{2}$ be their amalgam and $\pi_{j}:V^{\mathcal{N}_{j}}\to\pi_{j}(V^{\mathcal{N}_{j}})\subset V^{\mathcal{N}}$ the extensional isomorphisms between $\mathcal{N}_{j}$ and the corresponding subnetworks of $\mathcal{N}$ , for $j\in\{1,2\}$ . We start by claiming that $\pi_{1}(w)=\pi_{2}(w)$ , for all $w\in V_{out}$ . Indeed, suppose to the contrary that we have $\pi_{1}(w^{\prime})\neq\pi_{2}(w^{\prime})$ , for some $w^{\prime}\in V_{out}$ , and denote $w_{j}=\pi_{j}(w^{\prime})$ , $j\in\{1,2\}$ . Since $w_{1}\neq w_{2}$ , it follows that $\mathcal{N}(w_{1})$ and $\mathcal{N}(w_{2})$ are not extensionally isomorphic, for otherwise $w_{1}$ and $w_{2}$ would be clones, contradicting the no-clones condition for $\mathcal{N}$ . Now,

[TABLE]

by assumption. But this contradicts the conclusion of Theorem 4, and thus establishes $\pi_{1}(w)=\pi_{2}(w)$ , for all $w\in V_{out}$ . By non-degeneracy of $\mathcal{N}_{1}$ , for every $v\in V^{1}$ , there exists a $w\in V_{out}$ such that $v\in V^{\mathcal{N}_{1}(w)}$ . Then $\pi_{1}(v)\in V^{\mathcal{N}(\pi_{1}(w))}=V^{\mathcal{N}(\pi_{2}(w))}=\pi_{2}(V^{\mathcal{N}_{2}(w)})\subset\pi_{2}(V^{2})$ . Similarly, for every $v\in V^{2}$ , we have $\pi_{2}(v)\in\pi_{1}(V^{1})$ . Thus, the function $\psi:V^{1}\to V^{2}$ given by $\psi=\pi_{2}^{-1}\circ\pi_{1}$ is well-defined. This function is invertible with inverse $\pi_{1}^{-1}\circ\pi_{2}$ , so it is a bijection. Therefore $\psi$ is an extensional isomorphism between $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ , by virtue of being a composition of two extensional isomorphisms. Moreover, we have $\psi(w)=\pi_{2}^{-1}(\pi_{1}(w))=w$ , for all $w\in V_{out}$ , so $\psi$ restricted to $V_{out}$ is the identity map, and thus $\psi$ is a faithful isomorphism. ∎

Acknowledgment

The authors would like to thank Thomas Allard for useful suggestions regarding the proof of Proposition 3 and an anonymous reviewer for proposing a clearer exposition of Lemma 6.

Appendix: proofs of auxiliary results

Proof of Proposition 1.

Fix $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ as in the statement of the proposition. We begin by establishing the existence of a corresponding amalgam $\mathcal{A}$ . Let $\mathscr{A}$ denote the set of all proto-amalgams of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ . To see that $\mathscr{A}$ is non-empty, consider the LFNN $\mathcal{N}=(V^{\mathcal{N}},E^{\mathcal{N}},V_{in},V_{out}^{\mathcal{N}},\Omega^{\mathcal{N}},\Theta^{\mathcal{N}})$ specified as follows:

–

Let $S$ be a set of cardinality $\#(V^{1}\setminus V_{in})+\#(V^{2}\setminus V_{in})$ disjoint from $V_{in}$ , and set $V^{\mathcal{N}}\vcentcolon=V_{in}\cup S$ . Furthermore, let $\pi_{j}^{\,\mathcal{N}}:V^{j}\to\pi_{j}^{\,\mathcal{N}}(V^{j})\subset V^{\mathcal{N}}$ be injective functions such that $\pi_{j}^{\,\mathcal{N}}(v)=v$ , for $v\in V_{in}$ , $j\in\{1,2\}$ , and $\pi_{1}^{\,\mathcal{N}}(V^{1}\setminus V_{in})\cap\pi_{2}^{\,\mathcal{N}}(V^{2}\setminus V_{in})=\varnothing$ , but otherwise arbitrary.

–

$E^{\mathcal{N}}\vcentcolon=\bigcup_{j=1,2}\{(\pi_{j}^{\,\mathcal{N}}(v),\pi_{j}^{\,\mathcal{N}}(\widetilde{v})):v,\widetilde{v}\in V^{j},(v,\tilde{v})\in E^{j}\}$ .

–

$V^{\mathcal{N}}_{out}\vcentcolon=\pi_{1}^{\mathcal{N}}(V_{out}^{1})\cup\pi_{2}^{\mathcal{N}}(V_{out}^{2})$ .

–

For $j\in\{1,2\}$ and $v,\widetilde{v}\in V^{j}$ such that $(v,\tilde{v})\in E^{j}$ , let $\omega_{\pi_{j}^{\,\mathcal{N}}(\widetilde{v})\pi_{j}^{\,\mathcal{N}}({v})}=\omega_{\widetilde{v}v}$ , and set

$\Omega^{\mathcal{N}}\vcentcolon=\left\{\omega_{vu}:(u,v)\in E^{\mathcal{N}}\right\}$ .

–

For $j=1,2$ and $v\in V^{j}\setminus V_{in}$ , let $\theta_{\pi_{j}^{\,\mathcal{N}}(v)}=\theta_{v}$ , and set $\Theta^{\mathcal{N}}\vcentcolon=\left\{\theta_{u}:u\in V^{\mathcal{N}}\setminus V_{in}\right\}$ .

Informally, the network ${\mathcal{N}}$ is obtained by putting $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ “side by side”, sharing only the input nodes $V_{in}$ . As $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ are non-degenerate, so is $\mathcal{N}$ . Moreover, Properties (i) and (ii) of Definition 16 hold for $\mathcal{N}$ with $\pi_{j}^{\,\mathcal{N}}:V^{j}\to\pi_{j}(V^{j})\subset V^{\mathcal{N}}$ , for $j=1,2$ .

Thus $\mathcal{N}$ is a proto-amalgam of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ , and so $\mathscr{A}\neq\varnothing$ . Now, let $\mathcal{A}=(V^{\mathcal{A}},E^{\mathcal{A}},V_{in}^{\mathcal{A}},V_{out}^{\mathcal{A}},\Omega^{\mathcal{A}},\allowbreak\Theta^{\mathcal{A}})\in\mathscr{A}$ be a network with the least possible number of nodes among all the networks in $\mathscr{A}$ , and let $\pi_{j}:V^{j}\to\pi_{j}(V^{j})\subset V^{\mathcal{A}}$ , for $j\in\{1,2\}$ , be extensional isomorphisms between $\mathcal{N}_{j}$ and the appropriate subnetworks of $\mathcal{A}$ . We now show that $\mathcal{A}$ is clones-free. To this end, suppose by way of contradiction that $c_{1},c_{2}\in V^{\mathcal{A}}$ are clones. As $\mathcal{N}_{1}$ is clones-free, $c_{1},c_{2}$ cannot both be in $\pi_{1}(V^{1})$ , for otherwise $\pi_{1}^{-1}(c_{1})$ and $\pi_{1}^{-1}(c_{2})$ would be clones in $\mathcal{N}_{1}$ . By the same token, $c_{1},c_{2}$ cannot both be in $\pi_{2}(V^{2})$ . Thus, we may write w.l.o.g. $c_{1}=\pi_{1}(v_{1})$ and $c_{2}=\pi_{2}(v_{2})$ , for some $v_{1}\in V^{1}$ and $v_{2}\in V^{2}$ . Now, let $\widetilde{\mathcal{A}}$ be the network obtained from $\mathcal{A}$ by making the following alterations:

–

For every edge $(c_{2},v)\in E^{\mathcal{A}}$ , where $v\in V^{\mathcal{A}}$ , introduce a new edge $(c_{1},v)$ together with the associated weight $\omega_{vc_{2}}$ , and delete the edge $(c_{2},v)$ .

–

Delete the edges $(v,c_{2})\in E^{\mathcal{A}}$ , as well as the node $c_{2}$ .

–

If $c_{2}$ was a node in $\pi_{2}(V_{out}^{2})$ , then add $c_{1}$ to the set $V_{out}^{\widetilde{\mathcal{A}}}$ .

The network $\widetilde{\mathcal{A}}$ is a proto-amalgam of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ via the extensional isomorphisms ${\widetilde{\pi}_{1}=\pi_{1}}$ and

[TABLE]

But $\widetilde{\mathcal{A}}$ has strictly fewer nodes than $\mathcal{A}$ , which contradicts the minimality of $\mathcal{A}$ , and thereby establishes that $\mathcal{A}$ is clones-free, and hence $\mathcal{A}$ is an amalgam of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ , completing the proof of existence. To establish uniqueness—up to extensional isomorphisms—of the amalgam, suppose that $\mathcal{A}$ and $\mathcal{A}^{\prime}$ are both amalgams of $\mathcal{N}_{1}$ and $\mathcal{N}_{2}$ via extensional isomorphisms $\pi_{j}:V^{j}\to\pi_{j}(V^{j})\subset V^{\mathcal{A}}$ , $\pi_{j}^{\prime}:V^{j}\to\pi_{j}^{\prime}(V^{j})\subset V^{\mathcal{A}^{\prime}}$ , for $j\in\{1,2\}$ . We first show that

[TABLE]

by induction on $\mathrm{lv}_{\mathcal{A}}(v)$ . If $v\in V_{in}$ , then (46) holds trivially as the restrictions of the maps $\pi_{j}$ , ${\pi_{j}}^{\prime}$ , for $j\in\{1,2\}$ , to the set $V_{in}$ , both equal the identity map $\mathrm{id}_{V_{in}}$ . Now, let $L\geq 1$ and suppose that (46) holds for all $u\in\pi_{1}(V_{1})\cap\pi_{2}(V_{2})$ with $\mathrm{lv}_{\mathcal{A}}(u)<L$ . Let $v\in\pi_{1}(V_{1})\cap\pi_{2}(V_{2})$ with $\mathrm{lv}_{\mathcal{A}}(v)=L$ , but otherwise arbitrary, and write $w_{j}=(\pi^{\prime}_{j}\circ\pi_{j}^{-1})(v)$ , for $j=1,2$ . By Property (i) of Definition 16 for the amalgam $\mathcal{A}$ we have $\mathcal{N}_{1}\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{A}(\pi_{1}(V_{out}^{1}))$ and $\mathcal{N}_{2}\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{A}(\pi_{2}(V_{out}^{2}))$ , and so $\mathcal{N}_{1}\left(\pi_{1}^{-1}(v)\right)\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{A}(v)$ and $\mathcal{N}_{2}\left(\pi_{2}^{-1}(v)\right)\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{A}(v)$ by appropriately restricting $\pi_{1}$ and $\pi_{2}$ . Similarly, $\mathcal{N}_{1}\left((\pi_{1}^{\prime})^{-1}(w_{1})\right)\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{A}^{\prime}(w_{1})$ and $\mathcal{N}_{2}\left((\pi_{2}^{\prime})^{-1}(w_{2})\right)\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{A}^{\prime}(w_{2})$ . But $(\pi_{j}^{\prime})^{-1}(w_{j})=\pi_{j}^{-1}(v)$ , and so $\mathcal{N}_{j}\left((\pi_{j}^{\prime})^{-1}(w_{j})\right)=\mathcal{N}_{j}\left(\pi_{j}^{-1}(v)\right)$ , for $j\in\{1,2\}$ . Therefore $\mathcal{A}^{\prime}(w_{1})\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{A}(v)$ and $\mathcal{A}^{\prime}(w_{2})\stackrel{{\scriptstyle e}}{{\sim}}\mathcal{A}(v)$ via $\pi_{1}\circ(\pi_{1}^{\prime})^{-1}$ and $\pi_{2}\circ(\pi_{2}^{\prime})^{-1}$ , respectively. Now, as $\mathcal{A}^{\prime}$ is an amalgam, it is clones-free, and thus we deduce that $w_{1}=w_{2}$ , for otherwise $w_{1}$ and $w_{2}$ would be clones in $\mathcal{A}^{\prime}$ . This establishes (46).

Now define $\psi:V^{\mathcal{A}}\to V^{\mathcal{A}^{\prime}}$ according to

[TABLE]

It follows by (46) that this definition is consistent, in the sense that the two cases in (47) yield the same value for $\psi(v)$ when $v\in\pi_{1}(V_{1})\cap\pi_{2}(V_{2})$ . Now, Properties (i) and (ii) of Definition 14 for $\psi$ follow, so $\psi$ is an extensional isomorphism between $\mathcal{A}$ and $\mathcal{A}^{\prime}$ , finishing the proof. ∎

Proof of Lemma 3.

Denote by $\mathcal{D}_{\sigma}=\mathbb{C}\setminus P$ the domain of holomorphy of $\sigma$ . We proceed by induction on $\mathrm{lv}(u)$ . In the base case $\mathrm{lv}(u)=0$ , i.e., $u=v_{in}$ , the claim is trivially true with $E_{u}=\varnothing$ . Now suppose that $\mathrm{lv}(u)\geq 1$ , and assume the statement holds for all $v\in V$ with $\mathrm{lv}(v)<\mathrm{lv}(u)$ , i.e., $\mathcal{D}_{\left\langle{v}\right\rangle^{\sigma}}=\mathbb{C}\setminus E_{v}$ , where $E_{v}$ are closed countable subsets of $\mathbb{C}\setminus\mathbb{R}$ . Set $E_{u}=\mathbb{C}\setminus\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}$ . We will show that $E_{u}$ is a closed countable subset of $\mathbb{C}\setminus\mathbb{R}$ . To this end, first note that $S:=\bigcup_{v\in\mathrm{par}(u)}E_{v}$ is a closed countable subset of $\mathbb{C}\setminus\mathbb{R}$ , and thus $\mathbb{C}\setminus S$ is an open connected set containing $\mathbb{R}$ . We claim that if $z^{*}$ is a limit point of $E_{u}\setminus S$ , then $z^{*}\in S$ . Suppose otherwise, i.e., there exist a sequence $(z_{n})_{n\in\mathbb{N}}$ of distinct elements of $E_{u}\setminus S$ , and a point $z^{*}\in\mathbb{C}\setminus S$ , such that $z_{n}\to z^{*}$ . Define the function $f:\mathbb{C}\setminus S\to\mathbb{C}$ , $f(z)=\sum_{v\in\mathrm{par}(u)}\omega_{uv}\left\langle{v}\right\rangle^{\sigma}\!(z)+\theta_{u}$ . As the functions $\left\langle{v}\right\rangle^{\sigma}$ are holomorphic on $\mathcal{D}_{\left\langle{v}\right\rangle^{\sigma}}$ , they are, in particular, continuous, and so $f$ is continuous. Therefore $f(z_{n})\to f(z^{*})$ as $n\to\infty$ . As

[TABLE]

it follows by definition of natural domain that $f(z_{n})\in P$ , for all $n\in\mathbb{N}$ . Moreover, since $P$ is discrete, we deduce that there exists a point $p^{*}\in P$ such that $f(z_{n})=p^{*}$ , for all sufficiently large $n\in\mathbb{N}$ . Now, since $\mathbb{C}\setminus S$ is connected and $f$ is holomorphic, it follows that $f(z)=p^{*}$ , for all $z\in\mathbb{C}\setminus S$ . But $0\in\mathbb{R}\subset\mathbb{C}\setminus S$ , which thus implies $p^{*}=f(0)=\sum_{v\in\mathrm{par}(u)}\omega_{uv}\left\langle{v}\right\rangle^{\sigma}\!(0)+\theta_{u}\in\mathbb{R}$ , contradicting $P\subset\mathbb{C}\setminus\mathbb{R}$ . This completes the proof that any limit point of $E_{u}\setminus S$ is contained in $S$ . Now define the sets $E_{u}^{N}:=\{z\in E_{u}:|z|\leq N,\;d(z,S)\geq 1/N\Big{\}},\text{ for }N\in\mathbb{N},$ where $d$ denotes the Euclidean distance in $\mathbb{C}$ . We see that $E_{u}^{N}$ is finite, for each $N\in\mathbb{N}$ , for otherwise there would exist a sequence $(z_{n})_{n\in\mathbb{N}}$ of distinct elements of $E_{u}^{N}$ converging to a point $z^{*}\in\mathbb{C}$ . But then, by the claim above, we have $z^{*}\in S$ , which contradicts $d(z_{n},S)\geq 1/N$ , for all $n\in\mathbb{N}$ . We deduce that $E_{u}=S\cup\bigcup_{N\in\mathbb{N}}E_{u}^{N}$ is a closed countable set, and therefore $\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}=\mathbb{C}\setminus E_{u}$ is an open connected set. To see that $\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}\supset\mathbb{R}$ , note that, for $z\in\mathbb{R}$ , we have $z\in\mathbb{C}\setminus S=\bigcap_{v\in\mathrm{par}(u)}\mathcal{D}_{\left\langle{v}\right\rangle^{\sigma}}$ , and $f(z)\in\mathbb{R}\subset\mathcal{D}_{\sigma}$ , so $z\in\mathcal{D}_{\left\langle{u}\right\rangle^{\sigma}}$ . ∎

Proof of Lemma 4.

Let $\bm{a}$ , $\delta$ , and $T$ be as in the statement of the lemma, such that $D_{k}^{\circ}(\bm{a},\delta)\subset\mathcal{U}$ and $F|_{T}\equiv 0$ . Then the function $F_{\bm{a}}\vcentcolon=F(\,\cdot\,+\bm{a})$ is holomorphic on $\mathcal{U}-\bm{a}$ , and $F_{\bm{a}}|_{T-\bm{a}}\equiv 0$ . Thus, as $F|_{\mathcal{U}}\equiv 0$ if and only if $F_{\bm{a}}|_{\mathcal{U}-\bm{a}}\equiv 0$ , it suffices to prove the result for $\bm{a}=\bm{0}$ . Let $T_{0}\vcentcolon=T$ , $T_{k}\vcentcolon=D^{\circ}_{k}(\bm{0},\delta)$ , and, for $r=1,\dots,k-1$ , define the sets

[TABLE]

Note that $T_{r}\subset D^{\circ}_{k}(\bm{0},\delta)\subset\mathcal{U}$ , for $r\in\{0,\dots,k\}$ . We establish by induction over $r$ that $F|_{T_{r}}\equiv 0$ , $r\in\{0,\dots,k\}$ . The base case $F|_{T_{0}}\equiv 0$ holds by assumption. So suppose that $F|_{T_{r}}\equiv 0$ , for some $r\in\{0,\dots,k-1\}$ . If $0\leq r<k-1$ , fix arbitrary $z_{j}\in(-\delta,\delta)$ , for $j\in\{1,\dots,k-r-1\}$ . Similarly, if $0<r\leq k-1$ , fix arbitrary $s_{j}\in D_{1}^{\circ}(0,\delta)$ , for $j\in\{k-r+1,\dots,k\}$ . Consider the function $G:D_{1}^{\circ}(0,\delta)\to\mathbb{C}$ defined by

[TABLE]

Note that $G$ is holomorphic, and $G|_{(-\delta,\delta)}\,\equiv 0$ by the induction hypothesis. Since the zero set of a nonzero holomorphic function in one variable does not have a limit point in the domain, we deduce that $G|_{D_{1}^{\circ}(0,\delta)}\equiv 0$ . But $z_{j}$ and $s_{j}$ were arbitrary, so we have $F|_{T_{r+1}}\equiv 0$ . We have thus shown that $F$ is identically zero on an open subset $T_{k}=D^{\circ}_{k}(\bm{0},\delta)$ of its connected domain $\mathcal{U}$ , and so, by the multivariate identity theorem [19, 1.2.12], it must be identically zero on $\mathcal{U}$ . ∎

Proof of Lemma 5.

Let $t^{*}$ , $\bm{a}$ , $\delta$ , $T$ , and $\widetilde{T}$ be as in the statement of the lemma, such that $D_{k}^{\circ}(\bm{a},\delta)\subset\mathcal{U}$ , $\widetilde{T}\subset(\mathbb{C}\setminus\{t^{*}\})\times\mathbb{C}^{k}$ , $\mathrm{cl}(\widetilde{T})\supset T$ , and $F|_{\widetilde{T}}\equiv 0$ , and denote $\mathcal{V}\vcentcolon=D^{\circ}_{1+k}(\bm{a},\delta)$ . The function $F_{(t^{*}\!,\,\bm{a})}=F(\,\cdot\,+(t^{*},\bm{a}))$ is holomorphic on $\mathcal{U}-(t^{*},\bm{a})$ , and the sets

[TABLE]

and $\widetilde{T}_{(t^{*}\!,\,\bm{a})}\vcentcolon=\widetilde{T}-(t^{*},\bm{a})$ satisfy $\widetilde{T}_{(t^{*}\!,\,\bm{a})}\subset(\mathbb{C}\setminus\{0\})\times\mathbb{C}^{k}$ , $\mathrm{cl}(\widetilde{T}_{(t^{*}\!,\,\bm{a})})\supset T_{(t^{*}\!,\,\bm{a})}$ , and $F_{(t^{*}\!,\,\bm{a})}|_{{\widetilde{T}}_{(t^{*}\!,\,\bm{a})}}\equiv 0$ . Therefore, as $F|_{\mathcal{U}}\equiv 0$ if and only if $F_{(t^{*}\!,\,\bm{a})}|_{\mathcal{U}-(t^{*}\!,\,\bm{a})}\equiv 0$ , and $(t^{*}\!,\bm{a})$ was arbitrary, it suffices to prove the result for $(t^{*}\!,\bm{a})=(0,\bm{0})$ . Assume by way of contradiction that $F|_{\mathcal{V}}$ is not identically 0. Then, by inspection of the power series expansion of $F$ in the open neighborhood $\mathcal{V}$ of $(0,\bm{0})$ , we obtain that there exists a maximal $p\in\mathbb{N}_{0}$ such that $z_{0}^{-p}F(z_{0},z_{1},\dots,z_{k})$ is holomorphic in $\mathcal{V}$ . Write $G(z_{0},z_{1},\dots,z_{k})=z_{0}^{-p}F(z_{0},z_{1},\dots,z_{k})$ , with $G:\mathcal{V}\to\mathbb{C}$ holomorphic and not identically 0. Now, due to $\widetilde{T}\subset(\mathbb{C}\setminus\{0\})\times\mathbb{C}^{k}$ , we have $z_{0}\neq 0$ , for every $(z_{0},z_{1},\dots,z_{k})\in\widetilde{T}$ . Moreover, as $F|_{\widetilde{T}}\equiv 0$ , we have $G(z_{0},z_{1},\dots,z_{k})=z_{0}^{-p}\cdot 0=0$ , for all $(z_{0},z_{1},\dots,z_{k})\in\widetilde{T}$ . Now, since $G$ is continuous and $\mathrm{cl}(\widetilde{T})\supset T$ by assumption, it follows that $G(0,z_{1},\dots,z_{k})=0$ , for all $(0,z_{1},\dots,z_{k})\in T$ . The mapping $(z_{1},\dots,z_{k})\mapsto G(0,z_{1},\dots,z_{k})$ is holomorphic on $D_{k}^{\circ}(\bm{0},\delta)$ and identically zero on the set

[TABLE]

and so, by Lemma 4, we obtain $G(0,z_{1},\dots,z_{k})=0$ , for all $(0,z_{1},\dots,z_{k})\in\mathcal{V}$ . By inspection of the power series expansion of $G$ in $\mathcal{V}$ , we find that $G$ must have the form $G(z_{0},z_{1},\dots,z_{k})=z_{0}\,\frac{\partial G}{\partial z_{0}}(z_{0},z_{1},\dots,z_{k})$ . As the function $\frac{\partial G}{\partial z_{0}}$ is holomorphic in $\mathcal{V}$ , we have that $z_{0}^{-(p+1)}F(z_{0},\dots,z_{k})=\frac{\partial G}{\partial z_{0}}(z_{0},\dots,z_{k})$ is holomorphic in $\mathcal{V}$ , contradicting the maximality of $p$ . Our hypothesis that $F|_{\mathcal{V}}$ is not identically zero must hence be false, i.e., we have $F|_{\mathcal{V}}\equiv 0$ . Finally, by the multivariate identity theorem [19, 1.2.12], we deduce that $F|_{\mathcal{U}}\equiv 0$ . ∎

Proof of Lemma 6.

First note that $M$ is the closure of a one-parameter subgroup of $T^{d}=\mathbb{R}^{d}/\mathbb{Z}^{d}$ . Since $T^{d}$ is compact and abelian, so is $M$ . Moreover, $M$ is connected (as the closure of a connected set), and so, by [20, Theorem 11.2], it is itself isomorphic to a torus. It remains to determine its dimension. A character on a compact abelian group $G$ is a continuous group homomorphism $\chi:G\to S^{1}$ , where $S^{1}=\{z\in\mathbb{C}:|z|=1\}$ is the multiplicative circle group, and we denote by $\widehat{G}$ the set of all characters on $G$ . We claim that

[TABLE]

The inclusion of $M$ in the right-hand side is clear, so we only need to show the reverse inclusion. Note that, since $M$ is closed, $T^{d}/M$ is a Lie group. We will rewrite the right-hand side of (48) by establishing a bijective correspondence between the characters $\chi:T^{d}\to S^{1}$ such that $M\subset\ker(\chi)$ , and the characters $f:T^{d}/M\to S^{1}$ . To this end, let $\pi:T^{d}\to T^{d}/M$ be the projection map, and suppose that $\chi:T^{d}\to S^{1}$ is a character such that $M\subset\ker(\chi)$ . Then $\chi$ factors according to $\chi=f\circ\pi$ , for some continuous homomorphism $f:T^{d}/M\to S^{1}$ , in other words, $f$ is a character on $T^{d}/M$ . Conversely, for any such $f$ we have that $f\circ\pi$ is a character $\chi$ on $T^{d}$ with $M\subset\ker(\chi)$ . Therefore it suffices to show that

[TABLE]

Indeed, if this is the case, then

[TABLE]

as desired. We thus proceed to establishing (49). First note that, as $T^{d}$ is compact, connected, and abelian, then so is $T^{d}/M$ , and thus by [20, Theorem 11.2] we have that $T^{d}/M$ is isomorphic (as a Lie group) to the torus $T^{r}$ of some dimension $r\geq 0$ . Now suppose that $(u_{1},u_{2},\dots,u_{r})\in T^{r}$ is such that $f(u_{1},u_{2},\dots,u_{r})=1$ , for all characters $f:T^{r}\to S^{1}$ . Our goal is to show that $u_{j}=0\mod\mathbb{Z}$ , for all $j=1,\dots,r$ . For a given $j\in\{1,\dots,r\}$ let $f_{j}(t_{1},t_{2},\dots,t_{r})=e^{2\pi it_{j}}$ . Since $f_{j}:T^{r}\to S^{1}$ is a character, we have $1=f_{j}(u_{1},\dots,u_{r})=e^{2\pi iu_{j}}$ , and thus $u_{j}=0\mod\mathbb{Z}$ . Since this holds for all $j$ , we have (49), and therefore also (48). Note that any character on $T^{d}$ has the form

[TABLE]

where $\bm{m}=(m_{1},m_{2},\dots,m_{d})\in\mathbb{Z}^{d}$ (this is easily seen for $d=1$ , and follows by induction for other values of $d$ ). Now, for any character $\chi_{\bm{m}}:T^{d}\to S^{1}$ such that $M\subset\ker(\chi_{\bm{m}})$ , we have

[TABLE]

by definition of $M$ , which is equivalent to

[TABLE]

It follows immediately that $Z=\{\bm{m}\in\mathbb{Z}^{d}:\chi_{\bm{m}}\in\widehat{T^{d}},M\subset\ker(\chi)\}$ is a free abelian group of dimension $r=n-k$ , where $k=\dim\langle\alpha_{1},\dots,\alpha_{d}\rangle_{\mathbb{Q}}$ . We can thus pick a basis $\{\bm{m}^{1},\dots,\bm{m}^{r}\}$ for $Z$ , and then, for any character $\chi_{\bm{m}}$ with $\bm{m}\in Z$ , we have $\chi_{\bm{m}}=\chi_{\bm{m}^{1}}^{n_{1}}\dots\chi_{\bm{m}^{r}}^{n_{r}}$ , for some $n_{1},\dots,n_{r}\in\mathbb{Z}^{r}$ . Therefore $M$ is the kernel of the continuous surjective homomorphism $\Phi:T^{n}\to S^{r}$ given by $\Phi=(\chi_{\bm{m}^{1}},\dots,\chi_{\bm{m}^{r}})$ , and hence its dimension is $n-r=k$ , as desired. ∎

Proof of Lemma 7.

Define the following subsets of $T^{d}$ :

[TABLE]

as well as the map $\Phi:\mathbb{R}^{k}\to T^{d}$

[TABLE]

Let $K=\ker\Phi$ , and note that $M^{\prime}$ is the image of $\Phi$ . Further, note that $K$ is an abelian group, and a subgroup of $\mathbb{Z}^{k}$ . For $j=1,\dots,k$ , let $N_{j}\in\mathbb{Z}$ be such that $q_{pj}N_{j}\in\mathbb{Z}$ , for all $p=1,\dots,d$ . Let $\bm{e}_{j}\in\mathbb{R}^{k}$ be the vector with $N_{j}$ in the $j$ -th entry, and [math] in all the other entries. Then $\Phi(\bm{e}_{j})=\bm{0}+\mathbb{Z}^{d}$ , for all $j=1,\dots,k$ , so $E:=\{\bm{e}_{1},\dots,\bm{e}_{k}\}\subset K$ . Moreover, $E$ is a basis for $\mathbb{R}^{k}$ , so $K$ is a lattice of rank $k$ . Therefore $M^{\prime}$ and $\mathbb{R}^{k}/K$ are isomorphic as groups via the induced map

[TABLE]

Since $\widetilde{\Phi}$ is a continuous bijection, $\mathbb{R}^{k}/K$ is compact, and $T^{d}$ is Hausdorff, it follows that the map $\widetilde{\Phi}$ is, in fact, a Lie group isomorphism (when $M^{\prime}$ is equipped with the subspace topology inherited from $T^{d}$ ). In particular, $M^{\prime}$ is a torus of dimension $k$ . Let $\{\bm{b}_{1},\dots,\bm{b}_{k}\}$ be a basis for $K$ , and let

[TABLE]

be a fundamental domain of the lattice $K$ . Then, for any $\bm{u}\in\mathbb{R}^{k}$ we can write $\bm{u}=\bm{b}+\bm{k}$ with $\bm{b}\in B$ and $\bm{k}\in K$ . We will prove the lemma with

[TABLE]

where $\mathrm{int}(B)$ denotes the interior of $B$ . Note that $C$ is open and $0\in C$ . For $t\in\mathbb{R}$ we have

[TABLE]

and so $M\subset M^{\prime}$ . Moreover, by Lemma 6 we have that $\mathrm{cl}(M)$ is a torus of dimension $k$ , so we deduce $\mathrm{cl}(M)=M^{\prime}$ . We next establish that $\mathrm{cl}(M_{R})=M^{\prime}$ , for every $R>0$ . To this end, we distinguish between the cases $k=1$ and $k\geq 2$ .

The case $k=1$ . Let $(\alpha_{1}t,\alpha_{2}t,\dots,\alpha_{d}t)+\mathbb{Z}^{d}$ , $t\in\mathbb{R}$ , be an arbitrary element of $M$ . As $\dim\langle\alpha_{1},\dots,\allowbreak\alpha_{d}\rangle_{\mathbb{Q}}=k=1$ , there exist $a\in\mathbb{R}\setminus\{0\}$ and $m_{1},\dots,m_{d}\in\mathbb{Z}$ such that $(\alpha_{1},\alpha_{2},\dots,\alpha_{d})=(am_{1},am_{2},\dots,am_{d})$ . Now let $n\in\mathbb{Z}$ be an integer such that $t+n/a\notin[-R,R]$ . Then

[TABLE]

Therefore $M_{R}=M$ , and so $\mathrm{cl}(M_{R})=\mathrm{cl}(M)=M^{\prime}$ .

The case $k\geq 2$ . First note that

[TABLE]

is the image of $[-R,R]\subset\mathbb{R}$ under a continuous bijective map from $\mathbb{R}$ to $T^{d}$ . Since $[-R,R]\subset\mathbb{R}$ is compact and $T^{d}$ is Hausdorff, it follows by [21, Cor. 15.1.7] that $L_{R}$ is homeomorphic to $[-R,R]$ . In particular, $L_{R}$ is a 1-dimensional submanifold of $M$ with boundary. Now, by general properties of the closure, we have $\mathrm{cl}(M_{R})=\mathrm{cl}(M\setminus L_{R})\supset\mathrm{cl}(M)\setminus\mathrm{cl}(L_{R})=M^{\prime}\setminus L_{R}$ . Therefore, as $M^{\prime}$ has dimension $k>1$ and $L_{R}$ has dimension 1, we have $\mathrm{cl}(M_{R})=\mathrm{cl}(\mathrm{cl}(M_{R}))\supset\mathrm{cl}(M^{\prime}\setminus L_{R})=M^{\prime}$ . On the other hand, $\mathrm{cl}(M_{R})\subset\mathrm{cl}(M)=M^{\prime}$ , and thus $\mathrm{cl}(M_{R})=M^{\prime}$ , as desired. Now fix some $\bm{s}=(u_{1}/\alpha_{1},\dots,u_{k}/\alpha_{k})\in C$ , where $\bm{u}=(u_{1},\dots,u_{k})\in\mathrm{int}(B)$ . Since $M_{R}$ is dense in $M^{\prime}$ , for every $R>0$ , there exists a sequence $(t^{n,\bm{s}})_{n\in\mathbb{N}}$ in $\mathbb{R}$ with $|t^{n,\bm{s}}|\to\infty$ such that

[TABLE]

As $M\subset M^{\prime}$ , there exists a sequence $(\widetilde{\bm{u}}^{n,\bm{s}})_{n\in\mathbb{N}}$ such that

[TABLE]

for all $n\in\mathbb{N}$ . With this, (52) reads

[TABLE]

and after applying the isomorphism $\widetilde{\Phi}^{-1}$ , we obtain $\widetilde{\bm{u}}^{n,\bm{s}}+K\to\bm{u}+K$ as $n\to\infty$ . Now, for each $n\in\mathbb{N}$ , let $\bm{u}^{n,\bm{s}}=(u_{1}^{n,\bm{s}},\dots,u_{k}^{n,\bm{s}})\in B$ be such that $\bm{u}^{n,\bm{s}}-\widetilde{\bm{u}}^{n,\bm{s}}\in K$ . Then we have ${\bm{u}}^{n,\bm{s}}+K\to\bm{u}+K$ as $n\to\infty$ . Since $\bm{u}\in\mathrm{int}(B)$ , there exists an $n_{0}\in\mathbb{N}$ such that ${\bm{u}}^{n,\bm{s}}\in\mathrm{int}(B)$ , for $n\geq n_{0}$ . By discarding the first $n_{0}$ terms of the sequences $(t^{n,\bm{s}})_{n\in\mathbb{N}}$ and $(\widetilde{\bm{u}}^{n,\bm{s}})_{n\in\mathbb{N}}$ , we may assume w.l.o.g. that $n_{0}=0$ . It follows that $\bm{u}^{n,\bm{s}}\to\bm{u}$ as $n\to\infty$ . Now define $\bm{r}^{n,\bm{s}}=(u_{1}^{n,\bm{s}}/\alpha_{1},\dots,u_{k}^{n,\bm{s}}/\alpha_{k})$ . We then have $\bm{r}^{n,\bm{s}}\in C$ , $\bm{r}^{n,\bm{s}}\to\bm{s}$ , and (53) yields

[TABLE]

as desired. ∎

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. J. Sussman, “Uniqueness of the weights for minimal feedforward nets with a given input-output map,” Neural Networks , vol. 5, no. 4, pp. 589–593, July 1992.
2[2] F. Albertini, E. D. Sontag, and V. Maillot, “Uniqueness of weights for neural networks,” Artificial Neural Networks for Speech and Vision , pp. 113–125, 1993.
3[3] C. Fefferman, “Reconstructing a neural net from its output,” Revista Matemática Iberoamericana , vol. 10, no. 3, pp. 507–555, 1994.
4[4] Y. Le Cun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Müller, E. Säckinger, P. Simard, and V. Vapnik, “Comparison of learning algorithms for handwritten digit recognition,” International Conference on Artificial Neural Networks , pp. 53–60, 1995.
5[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 . Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
6[6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag. , vol. 29, no. 6, pp. 82–97, 2012.
7[7] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning . MIT Press, 2016.
8[8] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen, “Optimal approximation with sparsely connected deep neural networks,” SIAM Journal on Mathematics of Data Science , vol. 1, no. 1, pp. 8–45, 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Neural Network Identifiability for

Abstract

I Introduction

Definition 1** (Neural network).**

Definition 2**.**

Definition 3** (Identifiability).**

Assumptions 1** (Fefferman’s genericity conditions).**

II Contributions

Definition 4** (Neural network isomorphism).**

Definition 5** (No-clones condition).**

Theorem 1** (Uniqueness Theorem).**

Theorem 2** (Linear Independence Theorem).**

Remark*.*

III Directed acyclic graphs, general neural networks, and

Definition 6** (Directed acyclic graph).**

Definition 7** (Parent set, input nodes, and node level).**

Definition 8**.**

Definition 9** (Subnetwork and ancestor subnetwork).**

Definition 10**.**

Definition 11** (Output maps of nodes and networks).**

Definition 12** (Natural domain).**

Definition 13** (Clone pairs and the no-clones condition).**

Definition 14** (Extensional and faithful isomorphisms of GFFNs).**

Remark*.*

Definition 15** (Non-degeneracy).**

Definition 16** (Amalgam of two layered neural networks).**

Proposition 1**.**

Lemma 1**.**

Proof.

Definition 17**.**

Lemma 2** (Input anchoring).**

Proof.

IV Auxiliary results from complex analysis and Kronecker’s theorem

Lemma 3**.**

Lemma 4**.**

Lemma 5**.**

Lemma 6** ([16] Kronecker).**

Lemma 7**.**

V Imaginary period and the self-avoiding property

Definition 18** (Self-avoiding set).**

Remark*.*

Proposition 2**.**

Proof.

Proposition 3**.**

Proof.

VI The main theorems

Theorem 3**.**

Theorem 4**.**

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Theorem 4.

Proof of Theorem 3.

Acknowledgment

Appendix: proofs of auxiliary results

Proof of Proposition 1.

Proof of Lemma 3.

Proof of Lemma 4.

Proof of Lemma 5.

Proof of Lemma 6.

Proof of Lemma 7.

Definition 1 (Neural network).

Definition 2.

Definition 3 (Identifiability).

Assumptions 1 (Fefferman’s genericity conditions).

Definition 4 (Neural network isomorphism).

Definition 5 (No-clones condition).

Theorem 1 (Uniqueness Theorem).

Theorem 2 (Linear Independence Theorem).

*Remark**.*

Definition 6 (Directed acyclic graph).

Definition 7 (Parent set, input nodes, and node level).

Definition 8.

Definition 9 (Subnetwork and ancestor subnetwork).

Definition 10.

Definition 11 (Output maps of nodes and networks).

Definition 12 (Natural domain).

Definition 13 (Clone pairs and the no-clones condition).

Definition 14 (Extensional and faithful isomorphisms of GFFNs).

*Remark**.*

Definition 15 (Non-degeneracy).

Definition 16 (Amalgam of two layered neural networks).

Proposition 1.

Lemma 1.

Definition 17.

Lemma 2 (Input anchoring).

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6 ([16] Kronecker).

Lemma 7.

Definition 18 (Self-avoiding set).

*Remark**.*

Proposition 2.

Proposition 3.

Theorem 3.

Theorem 4.