This paper investigates when the input-output map of a neural network with various nonlinearities uniquely determines its architecture and parameters, providing minimal conditions for identifiability across diverse network structures.
Contribution
It derives necessary genericity conditions for neural network identifiability of arbitrary depth and connectivity, and constructs a broad family of nonlinearities satisfying these conditions.
Findings
01
Identifiability conditions are established for networks of any depth and connectivity.
02
A large family of nonlinearities is constructed that meets the minimal genericity conditions.
03
The family of nonlinearities can approximate many common nonlinear functions arbitrarily well.
Abstract
This paper addresses the following question of neural network identifiability: Does the input-output map realized by a feed-forward neural network with respect to a given nonlinearity uniquely specify the network architecture, weights, and biases? Existing literature on the subject Sussman 1992, Albertini, Sontag et al. 1993, Fefferman 1994 suggests that the answer should be yes, up to certain symmetries induced by the nonlinearity, and provided the networks under consideration satisfy certain "genericity conditions". The results in Sussman 1992 and Albertini, Sontag et al. 1993 apply to networks with a single hidden layer and in Fefferman 1994 the networks need to be fully connected. In an effort to answer the identifiability question in greater generality, we derive necessary genericity conditions for the identifiability of neural networks of arbitrary depth and connectivity with an…
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Full text
Neural Network Identifiability for
a Family of Sigmoidal Nonlinearities
Verner Vlačić and Helmut Bölcskei
Dept. of EE and Dept. of Math., ETH Zurich, Switzerland
This paper addresses the following question of neural network identifiability: Does the input-output map realized by a feed-forward neural network with respect to a given nonlinearity uniquely specify the network architecture, weights, and biases?
Existing literature on the subject [1, 2, 3] suggests that the answer should be yes, up to certain symmetries induced by the nonlinearity, and provided the networks under consideration satisfy certain “genericity conditions”.
The results in [1] and [2] apply to networks with a single hidden layer and in [3] the networks need to be fully connected.
In an effort to answer the identifiability question in greater generality, we derive necessary genericity conditions for the identifiability of neural networks of arbitrary depth and connectivity with an arbitrary nonlinearity. Moreover, we construct a family of nonlinearities for which these genericity conditions are minimal, i.e., both necessary and sufficient. This family is large enough to approximate many commonly encountered nonlinearities to within arbitrary precision in the uniform norm.
I Introduction
Deep learning has become a highly successful machine learning method employed in a wide range of applications such as optical character recognition [4],
image classification [5], and speech recognition [6]. In a typical deep learning scenario one aims to fit a parametric model, realized by a deep neural network, to match a set of training data points. In order to make the ensuing discussion more concrete, we begin with the definition of a neural network and the map it realizes under a nonlinearity.
Definition 1** (Neural network).**
We call an ordered sequence
[TABLE]
a neural network, where
–
L is a positive integer, referred to as the depth of N,
–
(D0,D1,…,DL) is an (L+1)-tuple of positive integers, called the layout,
–
Wℓ=(Wjkℓ)∈RDℓ×Dℓ−1
, ℓ∈{1,…,L}, are matrices whose entries are referred to as the network’s weights, and
–
θℓ=(θjℓ)∈RDℓ, ℓ∈{1,…,L}, are vectors of the so-called biases.
Furthermore, we stipulate that none of the Wℓ, ℓ∈{1,…,L}, have an identically zero row or an identically zero column.
Definition 2**.**
Given a neural network N and a nonlinear function ρ:R→R, referred to as the nonlinearity, we define the map realized by N under ρ as the function ⟨N⟩ρ:RD0→RDL given by
[TABLE]
where ρ acts on real vectors in a componentwise fashion.
The requirement that the matrices Wℓ in Definition 1 have nonzero rows corresponds to the absence of nodes whose contributions depend on the biases only, and are therefore constant as functions of the input. Similarly, columns that are identically zero correspond to nodes whose contributions do not enter the computation at the next layer.
The map of a neural network failing this requirement can be realized by a network obtained by simply removing such spurious nodes.
In practical applications, the numbers L,D0,D1,…,DL are typically determined through heuristic considerations, whereas the coefficients Wℓ,θℓ of the affine maps x↦Wℓx+θℓ are learned based on training data. For an overview of practical techniques for deep learning, see [7].
Neural networks are often studied as mathematical objects in their own right, for instance in approximation theory [8, 9, 10, 11] and in control theory [12, 13]. In this context, a natural question is that of identification: Can a neural network be uniquely identified from the map it is to realize? Specifically, we will be interested in identifiability according to the following definition.
Definition 3** (Identifiability).**
Given positive integers Din and Dout, define NDin,Dout to be the set of all neural networks whose layouts (D0,…,DL) satisfy D0=Din and DL=Dout, but are otherwise arbitrary. Let N be a subset of NDin,Dout, ρ a nonlinearity, and ∼ an equivalence relation on NDin,Dout.
(i)
We say that ∼ is compatible with(N,ρ) if, for all N1,N2∈N,
[TABLE]
2. (ii)
We say that (N,ρ) *is identifiable up to *∼ if, for all N1,N2∈N,
[TABLE]
Thus, by informally saying that a neural network N1 in a certain class is identifiable, we mean that any neural network N2 in the same class giving rise to the same output map, i.e., ⟨N1⟩ρ=⟨N2⟩ρ, is necessarily equivalent to N2. The role of the equivalence relation ∼ in the previous definition is thus to “measure the degree of non-uniqueness”, and in particular, to accommodate symmetries within the network that may arise either from symmetries induced by the network weights and biases (such as the presence of clone pairs, to be introduced in Definition 5), symmetries of the nonlinearity (e.g., tanh is odd), or both simultaneously. These abstract concepts will be incarnated momentarily when discussing the seminal work by Fefferman [3], and in Section II through Definitions 4 and 5, as well as in the examples leading up to the formulation of the paper’s main results.
In [3], Fefferman showed that neural networks satisfying the following genericity conditions are, indeed, uniquely determined by the map they realize under the nonlinearity ρ=tanh, up to certain obvious isomorphisms of networks:
θjℓ=0, for all ℓ and j, and ∣θjℓ∣=∣θj′ℓ∣, for all ℓ and j,j′ with j=j′.
2. (ii)
Wjkℓ=0, for all ℓ, j, and k, and
3. (iii)
for all ℓ, k and j,j′ with j=j′,
[TABLE]
More precisely, for fixed positive integers Din and Dout, Fefferman showed that (NA1Din,Dout,tanh) is identifiable up to ∼±, where NA1Din,Dout is defined as the set of all neural networks in NDin,Dout satisfying Assumptions 1, and ∼± is defined by stipulating that N∼±N if and only if
(i)
L=L and (D0,D1,…,DL)=(D0,D1,…,DL), and
2. (ii)
there exists a collection of signs {ϵjℓ:0≤ℓ≤L,1≤j≤Dℓ}, ϵjℓ∈{−1,+1}, and permutations γℓ:{1,…,Dℓ}→{1,…,Dℓ} such that
–
γℓ is the identity permutation and ϵjℓ=+1 , j∈{1,…,Dℓ}, whenever ℓ=0 or ℓ=L, and
–
for all ℓ∈{1,…,L}, k∈{1,…,Dℓ−1}, and j∈{1,…,Dℓ},
[TABLE]
It can be verified that ∼± is an equivalence relation on NA1Din,Dout.
Networks N, N such that N∼±N are said to be isomorphic up to sign changes.
The permutations γℓ reflect the fact that the ordering of the neurons in the hidden layers 1,…,L−1 is not unique, whereas the freedom in choosing the signs ϵjℓ reflects that tanh is an odd function. It can be verified that any two networks isomorphic up to sign changes give rise to the same map under the tanh nonlinearity, so ∼± is compatible with (NA1Din,Dout,tanh). The crux of Fefferman’s result therefore lies in proving the converse statement, namely that two networks giving rise to the same map with respect to tanh are necessarily isomorphic up to sign changes. This is effected by the insight that the depth, the layout, and the weights and biases of a network N∈NA1Din,Dout are encoded in the geometry of the singularities of the analytic continuation of ⟨N⟩tanh.
We note that Fefferman distilled the precise conditions of Assumptions 1 from his proof technique, in order to define a class of neural networks that is, on the one hand, sufficiently small to guarantee identifiability, and on the other hand, sufficiently large to encompass “generic” networks. Indeed, if we consider the network weights and biases (W1,θ1,…,WL,θL) as elements of the space RD1×D0×RD1×⋯×RDL×DL−1×RDL, then Assumptions 1 rule out only a set of measure zero.
In the contemporary practical machine learning literature, however, a network satisfying Assumptions 1 would hardly be considered generic, as Part (i) of Assumptions 1 implies that all biases are nonzero, and Part (ii) imposes full connectivity throughout the network.
Indeed, Fefferman remarks explicitly that it would be interesting to replace Assumptions 1 with minimal hypotheses, and to study nonlinearities other than tanh. The present paper aims to address these two issues.
Characterizing the fundamental nature of conditions necessary for identifiability with respect to a fixed nonlinearity, even a simple one such as tanh, is likely a rather formidable task.
In fact, the minimal identifiability conditions may generally depend on “fine” properties of the nonlinearity under consideration, and it is hence unclear how much insight can be obtained by having conditions that are specific to a given nonlinearity.
We will thus be interested in an identification result with very mild conditions on the weights and biases of the neural networks to be identified, while still accommodating a broad class of nonlinearities.
II Contributions
We begin with two motivating examples. These lead up to the statements of our main contributions, whose corresponding proofs are developed in the remainder of the paper.
We consider nonlinearities ρ which are not necessarily odd (as tanh), and thus need an equivalence relation which dispenses with sign changes.
Definition 4** (Neural network isomorphism).**
We say that the neural networks N and N are isomorphic, and write N≃N, if
(i)
L=L and (D0,D1,…,DL)=(D0,D1,…,DL), and
2. (ii)
there exist permutations γℓ:{1,…,Dℓ}→{1,…,Dℓ} such that
–
γℓ is the identity permutation for ℓ=0 and ℓ=L, and
–
for all ℓ∈{1,…,L}, k∈{1,…,Dℓ−1}, and j∈{1,…,Dℓ},
[TABLE]
In the remainder of the paper we will work exclusively with isomorphisms in the sense of Definition 4.
Note that any two isomorphic networks give rise to the same map with respect to any nonlinearity ρ,
and thus ≃ is an equivalence relation compatible with any pair (N,ρ).
The requirement that γℓ be the identity map for ℓ∈{0,L} in the previous definition again corresponds to the fact that the inputs and the outputs of a neural network are not generally interchangeable. Indeed, suppose that Nρ:R2→R2, Nρ(x,y)=(x,2y) is the map of a neural network with respect to some nonlinearity ρ. Let N1, N2, and N3 be the networks obtained from N by interchanging the inputs of N, the outputs of N, and both inputs and outputs, respectively. Then N1ρ(x,y)=(y,2x), N2ρ(x,y)=(2y,x), and N3ρ(x,y)=(2x,y) are, indeed, distinct functions.
We now give an example that Fefferman uses to motivate the necessity of restricting the class of all neural networks NDin,Dout to a smaller class to be identifiable up to an equivalence relation. In Fefferman’s case, the equivalence relation is ∼±, but the example is equally pertinent to the relation ≃.
Suppose that N is a neural network with L≥2, and ℓ0,j1,j2 with 1≤ℓ0≤L−1 and 1≤j1<j2≤Dℓ0 are such that θj1ℓ0=θj2ℓ0 and Wj1kℓ0=Wj2kℓ0, for all k. Then, if N is obtained from N by replacing W1j1ℓ0+1 and W1j2ℓ0+1 with an arbitrary pair of numbers W1j1ℓ0+1 and W1j2ℓ0+1 such that W1j1ℓ0+1+W1j2ℓ0+1=W1j1ℓ0+1+W1j2ℓ0+1, then ⟨N⟩ρ=⟨N⟩ρ, for any ρ. This example motivates the following definition.
Definition 5** (No-clones condition).**
Let N be a neural network as in Definition 1.
We say that N has a clone pair if there exist ℓ∈{1,…,L} and j,j′∈{1,…,Dℓ} with j=j′ such that
[TABLE]
If N does not have a clone pair, we say that N satisfies the no-clones condition.
As the nonlinearity ρ in the example above is completely arbitrary, the no-clones condition is necessary to have any hope of obtaining identifiability up to ≃.
Hence, with our program in mind, given positive integers Din and Dout, we define
[TABLE]
and seek nonlinearities ρ such that (NncDin,Dout,ρ) is identifiable up to ≃.
As any class strictly containing NncDin,Dout, paired with any nonlinearity, fails identifiability up to ≃, the no-clones condition furnishes a canonical minimal assumption for identifiability up to ≃.
Similarly to NA1Din,Dout, the class NncDin,Dout, paired with any measurable nonlinearity ρ such that x→∞limρ(x) and x→−∞limρ(x) exist and are not equal, satisfies the universal approximation property in the sense of Hornik [14] and Cybenko [15].
The following example demonstrates that insisting on the no-clones condition as the only assumption on the weights, biases, and layout will necessarily come at the cost of restricting the class of nonlinearities that allow for identifiability.
Let ρ(x)=min{1,max{0,x}} be the clipped rectified linear unit (ReLU) function. Note that
[TABLE]
Now, given an arbitrary neural network N=(W1,θ1,W2,θ2,…,WL,θL) with DL=1 satisfying the no-clones condition, the network
[TABLE]
also satisfies the no-clones condition, and yields the identically-zero output, i.e., N0ρ≡0. We have thus constructed an infinite collection of distinct networks satisfying the no-clones condition and all yielding the identically-zero map. The class of identically-zero output maps therefore contains networks of different depths and layouts, and thus identifiability up to ≃ fails. This leads to the conclusion that a uniqueness result for neural networks with the clipped ReLU nonlinearity would need to encompass genericity conditions more stringent than the no-clones condition.
Nonetheless, we are able to construct a class of real meromorphic nonlinearities σ yielding identifiability without any assumptions on the neural networks beyond the no-clones condition, and which is large enough to uniformly approximate any piecewise C1 nonlinearity ρ with ρ′∈BV(R), where
[TABLE]
is the space of functions of bounded variation on R.
Concretely, we have the following main result of this paper.
Theorem 1** (Uniqueness Theorem).**
Let Din and Dout be arbitrary positive integers. Furthermore, let ρ be a piecewise C1 function with
ρ′∈BV(R) and let ϵ>0. Then there exists a meromorphic function σ:D→C, D⊃R, σ(R)⊂R such that ∥ρ−σ∥L∞(R)<ϵ and (NncDin,Dout,σ) is identifiable up to ≃.
We note that, having fixed the input and output dimensions Din and Dout, the depths and the layouts of the networks in NncDin,Dout are completely arbitrary.
Examples of nonlinearities ρ(x) covered by Theorem 1 include many sigmoidal functions such as the aforementioned clipped ReLU, the logistic function 1+e−x1, the hyperbolic tangent tanh(x)
, the inverse tangent arctan(x), the softsign function 1+∣x∣x, the inverse square root unit 1+ax2x, the clipped identity max{1,∣x∣/a}x, and the soft clipping function a1log1+ea(x−1)1+eax, where a>0 is fixed in the last two cases. Unbounded nonlinearities such as the ReLU are not comprised. The nonlinearities σ for which we have identifiability, unfortunately, need to be constructed, and, at the present time, we do not have an identification result for arbitrary given σ.
Furthermore, we remark that the statement of Theorem 1 is “not continuous” in the approximation error ϵ. Indeed, while the clipped ReLU function satisfies the conditions of Theorem 1, as shown in the example above,
there exist non-isomorphic networks N0 and N0 satisfying the no-clones condition and ⟨N0⟩ρ(x)=0=⟨N0⟩ρ(x), for all x∈RD0, where ρ is the clipped ReLU function.
We will see that Theorem 1 is, in fact, a consequence of the following result, which states that the maps
realized by pairwise non-isomorphic networks
with DL=1, under a nonlinearity σ according to Theorem 1, are linearly independent functions RD0→R.
Theorem 2** (Linear Independence Theorem).**
Let Din be an arbitrary positive integer, let ρ be a piecewise C1 function with ρ′∈BV(R), and let ϵ>0. Then there exists a meromorphic function σ:D→C, D⊃R, σ(R)⊂R such that ∥ρ−σ∥L∞(R)<ϵ with the following property: Suppose that Nj, j=1,2,…,n, are pairwise non-isomorphic (in the sense of ≃) neural networks in NncDin,1.
Then, {⟨Nj⟩σ}j=1n∪{1} is a linearly independent set of functions RD0→R, where 1 denotes the constant function taking on the value 1.
Remark*.*
The function 1 is included in the linearly independent set both for the sake of greater generality of the statement, and to facilitate the proof of Theorem 2.
Unfortunately, Theorem 2 does not generalize to multiple outputs Dout>1, as shown by the following example: Fix an arbitrary network N according to Definition 1 such that L≥2, DL=4, θL=0,
and N satisfies the no-clones condition. Define Um∈R2×DL−1, m∈{1,2,3,4}, as the submatrices of WL consisting of the rows 1 and 3, 1 and 4, 2 and 4, and 2 and 3, respectively. Furthermore, define the networks
[TABLE]
for m∈{1,2,3,4}. As N satisfies the no-clones condition, the networks Nm, m∈{1,2,3,4}, also satisfy the no-clones condition, and are pairwise non-isomorphic.
Now, let ρ be an arbitrary nonlinearity, and write ⟨N⟩ρ=(f1,f2,f3,f4), where fm:RD0→R, m∈{1,2,3,4}. Then
[TABLE]
and so
[TABLE]
The set {⟨Nm⟩ρ}m=14 is hence linearly dependent, showing that Theorem 2 cannot be generalized to multiple outputs by replacing NncDin,1 with NncDin,Dout.
We now provide a panorama of the proofs of Theorems 1 and 2.
The proof of Theorem 1 is by way of contradiction with Theorem 2. Specifically, assume that Din, Dout, ρ, and ϵ>0 are as in the statement of Theorem 1, and let σ be a nonlinearity satisfying the conclusion of Theorem 2 with these Din, ρ, and ϵ.
For a network N∈NncDin,Dout, we write the map ⟨N⟩σ=((⟨N⟩σ)1,…,(⟨N⟩σ)Dout) in terms of the coordinate functions (⟨N⟩σ)j:RDin→R, j∈{1,…,Dout}.
Now, let N1,N2∈NncDin,Dout be networks such that ⟨N1⟩σ(x)=⟨N2⟩σ(x), for all x∈RDin, and suppose by way of contradiction that they are non-isomorphic.
We construct a network M containing both N1 and N2 as subnetworks (a precise definition of “subnetwork” is given in Section III, Definition 9). It follows that M contains subnetworks Mm,j∈NncDin,1 with maps satisfying ⟨Mm,j⟩σ=(⟨Nm⟩σ)j, for m∈{1,2} and j∈{1,…,Dout}. We then show that, as a consequence of N1 and N2 being non-isomorphic, there exists a j∈{1,…,Dout} such that M1,j and M2,j are non-isomorphic. But then
[TABLE]
which stands in contradiction to Theorem 2. This completes the proof of Theorem 1.
The proof of Theorem 2 is significantly more involved, as it requires extensive “fine tuning” of the function σ. Let σ:D→C be as in the statement of Theorem 2.
In addition to the properties stated in Theorem 2, the function σ we construct exhibits the following convenient structural properties:
The domain D⊂C of σ is the complement of an (infinite) discrete set of poles,
2. 2.
σ is i-periodic, i.e., σ(z+i)=σ(z), for all z∈D, and
3. 3.
for any network N∈N1,1, the natural domain D⟨N⟩σ⊂C of ⟨N⟩σ, viewed as a holomorphic function, is the complement of a closed countable subset of C, and therefore a connected open set.
These three properties are all satisfied by the function tanh(π⋅), and are essentially the key insight leading to Fefferman’s identifiability result in [3], which establishes that, under the genericity conditions stated in Assumptions 1, a neural network can be read off from the asymptotic (as the imaginary part of the argument tends to infinity) locations of the singularities of the map it realizes under the tanh nonlinearity.
The properties 1) – 3) will be key to our results as well, but instead of studying the set of singularities of the map in its own right,
our proof of Theorem 2 will proceed by contradiction. The proof consists of three steps that we call amalgamation, input splitting, and input anchoring, and involves the use of analytic continuation, graph-theoretic constructions, and Kronecker’s theorem [16], the latter two of which are novel tools in this context and signify a significant departure from Fefferman’s proof technique in [3].
We now briefly describe the proof of Theorem 2 according to the aforementioned program.
Suppose that N1,…,Nn are pairwise non-isomorphic neural networks satisfying the no-clones condition. For the sake of simplicity of this informal discussion, we assume that L1=L2=⋯=Ln, D01=D02=⋯=D0n=1, and DL11=DL22=⋯=DLnn=1. By way of contradiction, we suppose that there exists a nontrivial linear combination such that λ01(x)+∑j=1nλjNjσ(x)=0, for all x∈R.
Amalgamation: In Section III we construct a neural network M∈Nnc1,n, called the amalgam of {Nj}j=1n, containing each Nj as a subnetwork. In particular, we have (⟨M⟩σ)j=⟨Nj⟩σ, for all j∈{1,…,n}.
The linear dependence of {⟨Nj⟩σ}j=1n∪{1} thus translates to
[TABLE]
for all z∈R. By our construction of σ, the natural domains D⟨Nj⟩σ=D(⟨M⟩σ)j are complements of closed countable sets, and hence, by analytic continuation, (1) is valid for all z∈⋂j=1nD⟨Nj⟩σ.
Now define M to be the set of all neural networks in ⋃m=1nNnc1,m
with linear dependency as in (1) between the output functions and the constant function. Note that M is nonempty, simply as M∈M.
We then fix a network M′∈M of minimum size (the precise definition of size will be given in the proof of Theorem 4). Write (1,D1M′,…,DmM′) for the layout of M′, and let (ω1,…,ωD1M′) be the weights of the first layer of M′ (i.e., the entries of W1 according to Definition 1). At this point the proof splits into two cases, depending on whether there exist j,j′∈{1,…,D1M′}, j=j′, such that ωj/ωj′ is irrational.
Input splitting, the easy case. Provided there do exist such j and j′, we use Kronecker’s theorem [16] and the properties (i) – (iii) of σ to construct a network M′′∈M with layout (k,D1M′,…,DmM′), for some k∈{2,…,D1M′}, and first-layer weights W1∈RD1M′×k such that the first k rows of W1 form a k×k identity matrix.
Input anchoring.
We then construct a third network N∈M, obtained by fixing k−1 of the k inputs of M′′ to specific real numbers, and “cutting out” all the parts of the network whose contributions to the output map have become constant in the process. The resulting network N will be a network in M of size smaller than M′, which contradicts the minimality of M′, and thereby completes the proof.
Input splitting, the hard case. If, however, all the ratios ωj/ωj′, j=j′ are rational, the input splitting construction described above cannot be carried out. This problem will be remedied by further refining our initial construction of σ. Specifically, we will ensure that the real parts of the poles of σ form a subset of R satisfying what we call the self-avoiding property, to be introduced in Section V. This will enable an alternative construction of a network M′′ with at least two inputs. The resulting M′′ will, however, not be a neural network in the sense of Definition 1, but rather a generalized network in the sense of Definition 8, to be introduced in Section III.
Input anchoring. Finally, we apply an input anchoring procedure to M′′ similar to the one described above. Even though now M′′ is not a network in the sense of Definition 1, the input anchoring procedure will result in a network N∈M which is a network in the sense of Definition 1, and is of smaller size than M′, again completing the proof by contradiction.
We conclude this section by laying out the organization of the remainder of the paper. In Section III we develop a graph-theoretic framework needed to define amalgams of neural networks and several other technical concepts. In Section IV we state results from complex analysis and Kronecker’s theorem needed in arguments involving analytic continuation and input splitting, respectively. The proofs of these results are relegated to the Appendix. In Section V we discuss the fine structural properties of the function σ constructed in the proof of Theorem 2. Finally, Section VI contains the proofs of our two main results.
III Directed acyclic graphs, general neural networks, and
neural network amalgams
As already mentioned, in the proof of Theorem 2 we will work with a form of neural networks that does not fit in with Definitions 1 and 2. In order to accommodate this notion of neural networks, and to lighten the manipulations needed to formalize the aforementioned techniques of amalgamation and input anchoring, we introduce a graph-theoretic framework.
We start by introducing the concept of a directed acyclic graph (DAG), commonly encountered in the graph theory literature [17].
Definition 6** (Directed acyclic graph).**
–
A directed graph is an ordered pair G=(V,E) where V is a finite set of nodes, and E⊂V×V is a set of directed edges.
–
A directed cycle of a directed graph G is a set {v1,…,vk}⊂V such that, for every j∈{1,…,k}, (vj,vj+1)∈E, where we set vk+1:=v1.
–
A directed graph G is said to be a directed acyclic graph (DAG) if it has no directed cycles.
We interpret an edge (v,v) as an arrow connecting the nodes v and v and pointing at v.
Definition 7** (Parent set, input nodes, and node level).**
Let G=(V,E) be a DAG.
–
We define the parent set of a node by par(v)={v:(v,v)∈E}.
–
We say that v∈V is an input node if par(v)=∅, and we write In(G) for the set of input nodes.
–
We define the level lv(v) of a node v∈V recursively as follows. If par(v)=∅, we set lv(v)=0. If par(v)={v1,v2,…,vk} and lv(v1),lv(v2),…,lv(vk) are defined, we set lv(v)=max{lv(v1),lv(v2),…,lv(vk)}+1.
Since the graph G in Definition 7 is assumed to be acyclic, the level is well-defined for all nodes of G. We are now ready to introduce our generalized definition of a neural network.
Definition 8**.**
A general feed-forward neural network (GFNN) is an ordered sextuple N=(V,E,Vin,Vout,Ω,Θ), where
–
G=(V,E) is a
DAG, called the architecture of N,
–
Vin=In(G) is the set of inputs of N,
–
Vout⊂V∖Vin is the set of outputs of N,
–
Ω={ωvv∈R∖{0}:(v,v)∈E} is the set of weights of N, and
–
Θ={θv∈R:v∈V∖Vin} is the set of biases of N.
The depth of a GFNN is defined as L(N)=max{lv(v):v∈V}.
When translating from Definition 1 to Definition 8, we will interpret a zero weight Wjkℓ=0 simply as the absence of a directed edge between the nodes concerned, hence we do not allow the edges of a GFNN to have zero weight. If V1 and V2 are the sets of nodes of GFNNs N1 and N2, respectively, and v∈V1∩V2, we will say that N1 and N2 share the node v. When dealing with several networks sharing a node v, we will write parN(v) for the parent set of v in the architecture (V,E) of N, to avoid ambiguity.
Note that the set of outputs of a GFNN can be an arbitrary subset of the non-input nodes. In particular, Vout can include nodes w with lv(w)<L(N).
Related to the concept of the parent set of a node is the concept of a subnetwork introduced next.
Definition 9** (Subnetwork and ancestor subnetwork).**
Let N=(V,E,Vin,Vout,Ω,Θ) be a GFNN. A subnetwork of N is a GFNN
N′=(V′,E′,Vin′,Vout′,Ω′,Θ′) such that there exists a set S⊂V so that
(i)
V′={v∈V:v∈parr(u) for some r≥0}, where, for a set W⊂V, we define par0(W)=W and parr(W)=⋃s∈Wparr−1(par(s)), for r≥1.
2. (ii)
E′={(v,v)∈E:v,v∈V′},
3. (iii)
Vin′=Vin∩V′,
4. (iv)
Ω′={ωvv:(v,v)∈E′}, and
5. (v)
Θ′={θv:v∈V′}.
If additionally Vout′=S, then N′ is uniquely specified by S. In this case we say that N′ is the ancestor subnetwork of S in N, and write N(S) for this network.
Definition 10**.**
A layered feed-forward neural network (LFNN) is a GFNN satisfying lv(v)=lv(v)+1, for all (v,v)∈E.
For an example of a GFNN that is not layered, see Figure 1.
We notice that LFNNs correspond to neural networks as specified by Definition 1, with the nodes of level ℓ corresponding to the ℓ-th network layer. Specifically, if N=(V,E,Vin,Vout,Ω,Θ) is a LFNN, we can label the nodes {v∈V:lv(v)=ℓ} by vjℓ, j=1,…,Dℓ, and let θjℓ=θvjℓ, Wjkℓ=ωvjℓvkℓ−1 when (k,j)∈E and Wjkℓ=0 else. Apropos, this correspondence is the reason for the indices of the weight ωvv associated with the edge (v,v~) of a GFNN appearing in “reverse order”.
The following definition generalizes Definition 2 to GFNNs.
Definition 11** (Output maps of nodes and networks).**
Let N=(V,E,Vin,Vout,Ω,Θ) be a GFNN, and let ρ:R→R be a nonlinearity.
The map realized by a node v∈V under ρ is the function ⟨v⟩ρ:RVin→R defined recursively as follows:
–
If v∈Vin, set ⟨v⟩ρ(t)=tv, for all t=(tu)u∈Vin∈RVin.
–
Otherwise set ⟨v⟩ρ(t)=ρ(∑u∈par(v)ωvu⋅⟨u⟩ρ(t)+θv), for all t∈RVin.
The map realized by N under ρ is the function ⟨N⟩ρ:RVin→RVout given by ⟨N⟩ρ=(⟨w⟩ρ)w∈Vout. When dealing with several networks
we will write ⟨v⟩ρ,N for the map realized by v in N, to avoid ambiguity.
We will treat nodes v∈V only as “handles”, and never as variables or functions. This is relevant when dealing with several networks with shared nodes, such as depicted in Figure 2. On the other hand, the output map ⟨v⟩ρ realized by v is a function.
In the special case when the nonlinearity is holomorphic on a neighborhood of R, the output maps realized by the nodes of a network will extend to holomorphic functions on their natural domains, as given by the following definition.
Definition 12** (Natural domain).**
Let N=(V,E,Vin,Vout,Ω,Θ) be a GFNN, and let σ:Dσ→C be a function holomorphic on an open domain Dσ⊃R and such that σ(R)⊂R.
For a node v∈V, we define the natural domain D⟨v⟩σ⊂CVin and extend the definition of the function ⟨v⟩σ:D⟨v⟩σ→C recursively as follows:
–
For v∈Vin, let D⟨v⟩σ=CVin, and set ⟨v⟩σ(z)=zv, for all z=(zu)u∈Vin∈CVin.
–
Otherwise, set D⟨v⟩σ={z∈⋂u∈par(v)D⟨u⟩σ:∑u∈par(v)ωvu⟨u⟩σ(z)+θv∈Dσ}, and let ⟨v⟩σ(z)=σ(∑u∈par(v)ωvu⋅⟨u⟩σ(z)+θv), for all z∈D⟨v⟩σ.
It follows that the natural domain D⟨u⟩σ of a node u is open, as it is the preimage of an open set with respect to a continuous map. Moreover, the output map ⟨u⟩σ realized by u is holomorphic on D⟨u⟩σ, as it is given explicitly by a concatenation of affine maps and the nonlinearity σ, which are themselves holomorphic functions.
The following definition is a straightforward generalization of Definition 5.
Definition 13** (Clone pairs and the no-clones condition).**
Let N=(V,E,Vin,Vout,Ω,Θ) be a GFNN. We say that the nodes v1,v2∈V, v1=v2, are clones if par(v1)=par(v2), θv1=θv2, and ∀u∈par(v1), ωv1u=ωv2u. We say that N satisfies the no-clones condition (or briefly, N is clones-free), if no two nodes v1,v2∈V, v1=v2, are clones.
The following definition generalizes Definition 4 to GFNNs, and introduces two new concepts, termed extensional isomorphism and faithful isomorphism, which will play an important technical role throughout the remainder of the paper.
Definition 14** (Extensional and faithful isomorphisms of GFFNs).**
Let N1=(V1,E1,Vin,Vout1,Ω1,Θ1) and N2=(V2,E2,Vin,Vout2,Ω2,Θ2) be
GFNNs with the same input nodes Vin.
–
We say that N1 and N2 are extensionally isomorphic, and write N1∼eN2,
if there exists a bijection π:V1→V2, called an extensional isomorphism, such that the following holds:
(i)
π restricted to Vin is the identity map,
2. (ii)
π(Vout1)=Vout2,
3. (iii)
for all (v,v)∈E1, we have ωπ(v)π(v)2=ωvv1, and
4. (iv)
for all v∈V1∖Vin, we have θπ(v)2=θv1.
–
We say that N1 and N2 are faithfully isomorphic, and write N1∼fN2, if they are extensionally isomorphic via π:V1→V2 with the following additional property:
(v)
Vout1=Vout2, and π restricted to Vout1 is the identity map.
In this case we call π a faithful isomorphism.
Remark*.*
The concept of faithful isomorphisms in Definition 14 generalizes that of isomorphisms according to Definition 4. It is easily seen that extensional isomorphism is an equivalence relation on the set of all GFNNs with the same input nodes, whereas faithful isomorphism is an equivalence relation on the set of all GFNNs with the same input and output nodes. Furthermore, if N1∼eN2 via π:V1→V2, then we have ⟨π(v)⟩ρ,N2=⟨v⟩ρ,N1, for all v∈V1 and any nonlinearity ρ, and if additionally N1∼fN2, then ⟨N1⟩ρ=⟨N2⟩ρ.
The following definition introduces the non-degeneracy property of a GFNN, which corresponds to the absence of spurious nodes, i.e., nodes that do not contribute to the map realized by the GFNN (with respect to an arbitrary nonlinearity). In the special case of LFNNs considered in the introduction, this property corresponds to the requirement that no matrix Wℓ in Definition 1 has an identically zero row or column.
Definition 15** (Non-degeneracy).**
We say that a GFNN N=(V,E,Vin,Vout,Ω,Θ) is non-degenerate if
V=VN(Vout), where VN(Vout) is the set of nodes of the ancestor subnetwork of Vout in N. Networks that are not non-degenerate are referred to as degenerate.
Informally, a network is non-degenerate if its every node “leads up” to at least one output. This notion is best understood with the help of examples as in Figure 3.
We are now ready to introduce the concept of amalgams of LFNNs.
Definition 16** (Amalgam of two layered neural networks).**
Let N1=(V1,E1,Vin,Vout1,Ω1,Θ1) and N2=(V2,E2,Vin,Vout2,Ω2,Θ2) be non-degenerate clones-free LFNNs with the same input set Vin.
–
Let A=(VA,EA,Vin,VoutA,ΩA,ΘA) be a non-degenerate LFNN with the following properties:
(i)
There exist injective maps π1:V1→π1(V1)⊂VA and π2:V2→π2(V2)⊂VA such that the networks N1 and N2 are extensionally isomorphic to the ancestor subnetworks A(π1(Vout1)) and A(π2(Vout2)) via π1 and π2, respectively.
2. (ii)
VA=π1(V1)∪π2(V2) and VoutA=π1(Vout1)∪π2(Vout2).
We then say that A is a proto-amalgam of N1 and N2.
–
If A is a clones-free proto-amalgam of N1 and N2, we say that A is an amalgam of N1 and N2.
Proposition 1**.**
Let N1=(V1,E1,Vin,Vout1,Ω1,Θ1) and N2=(V2,E2,Vin,Vout2,Ω2,Θ2) be non-degenerate clones-free LFNNs with a shared input set Vin. Then there exists an amalgam A of N1 and N2. Moreover, the amalgam is unique up to extensional isomorphisms.
As asserted in Proposition 1 (whose proof is deferred to the Appendix), an amalgam of two given non-degenerate clones-free LFNNs N1 and N2 always exists and is unique up to extensional isomorphisms. With slight abuse of notation, we will write N1∨N2 for an arbitrary element of the equivalence class (induced by ∼e) of all the amalgams of N1 and N2. A concrete example of an amalgam construction is provided in Figure 4.
Having defined the amalgam of two non-degenerate clones-free LFNNs, we define the amalgam of any finite collection N1,…,Nn of non-degenerate clones-free LFNNs according to
[TABLE]
By Definition 16, ⋁k=1nNk is a non-degenerate clones-free LFNN. Moreover, there exist extensional isomorphisms πj:Nj→πj(Nj)⊂⋁k=1nNk, for j∈{1,…,n}, and we have ⟨πj(v)⟩ρ,⋁k=1nNk=⟨v⟩ρ,Nj, for j∈{1,…,n}, v∈VNj, and any nonlinearity ρ.
We are now in a position to prove two lemmas that form the basis for the proof of Theorem 2. The first lemma formalizes the idea of combining multiple pairwise non-isomorphic single-output networks with linearly dependent ouput maps into one multiple-output network with linear dependency among the maps of its ouput nodes.
Lemma 1**.**
Let N1, N2, …, Nn be non-degenerate, clones-free LFNNs with a shared input set Vin and the same single output node {vout}. Furthermore, assume that no two networks Nj1,Nj2, j1=j2, are extensionally isomorphic. Let ρ be a nonlinearity and suppose that 1,⟨N1⟩ρ,⟨N2⟩ρ,…,⟨Nn⟩ρ are linearly dependent as functions RVin→R. Then there exists a non-degenerate clones-free LFNN M=(VM,EM,VinM,VoutM,ΩM,ΘM) (obtained by modifying ⋁k=1nNk) with a single input node VinM={vin}, such that {⟨w⟩ρ:w∈VoutM}∪{1} is a linearly dependent set of functions from R to R.
Proof.
We first create a new node vin and select an arbitrary set {ωvvin:v∈Vin}⊂R∖{0} of cardinality #Vin. Now, we enlarge each Nj to a new network Nj by gluing the node vin to the set Vin through the edges {(vin,v):v∈Vin} along with the corresponding weights ωvvin. The nodes v∈Vin are non-input nodes of the Nj, as their parent sets parNj(v)={vin} are non-empty, and we set their biases θv to [math]. The node vin is now the shared single input of the networks Nj, j=1,…,n. Note that, as the networks Nj are clones-free, and the weights ωvvin are distinct, the networks Nj are clones-free by assumption. Further, since Nj, j∈{1,…,n}, are pairwise non-isomorphic, so are the Nj, j∈{1,…,n}.
We now construct a network M by amalgamating Nj, j=1,…,n, according to M=(…(N1∨N2)∨…)∨Nn. Denote by πj:VNj→πj(VNj)⊂VM the extensional isomorphism between Nj and the corresponding subnetwork of M, and let wj=πj(vout) be the node of M corresponding to the output node of Nj. We claim that wj1=wj2, for j1=j2. To see this, take j1,j2 such that wj1=wj2, i.e., πj1(vout)=πj2(vout). Then, by Property (i) of Definition 16, Nj1(vout)∼eNj2(vout), and therefore Nj1(vout)∼eNj2(vout) as well. But Nj1(vout)=Nj1 and Nj2(vout)=Nj2 by the non-degeneracy assumption, and hence Nj1∼eNj2. It follows that j1=j2, as Nj, j=1,…,n, are assumed to be pairwise non-isomorphic. Thus the wj are, indeed, distinct nodes of M, and we have VoutM={w1,w2,…,wn}.
As 1,⟨N1⟩ρ,⟨N2⟩ρ,…,⟨Nn⟩ρ are linearly dependent by assumption, there exists a nonzero vector (c,λ1,λ2,…,λn)∈Rn+1 such that \left(c\,\bm{1}+\sum_{j=1}^{n}\lambda_{j}\left\langle{\mathcal{N}_{j}}\right\rangle^{\rho}\right)\big{(}(t_{v})_{v\in V_{in}}\big{)}=0, for all (tv)v∈Vin∈RVin. We then have
[TABLE]
for all t∈R. This establishes that {⟨w1⟩ρ,M,⟨w2⟩ρ,M,…,⟨wn⟩ρ,M}∪{1} is a linearly dependent set, so M is the desired network.
∎
Before stating the next lemma, we describe the procedure of input anchoring, which is a method for selecting and modifying a subnetwork of a non-degenerate GFNN in a manner that preserves linear dependencies between the maps realized by the output nodes of the original network.
Concretely, let M=(VM,EM,VinM,VoutM,ΩM,ΘM) be a non-degenerate, clones-free GFNN with input nodes VinM={v10,…,vD00}, D0≥2. For specificity, let w.l.o.g. vD00 be the input node to be anchored, and let a∈R be the value vD00 is anchored to. Furthermore, let ρ be a nonlinearity. We seek to construct a network Ma=(VMa,EMa,VinMa,VoutMa,ΩMa,ΘMa) with VinMa={v10,…,vD0−10} and VoutMa=VoutM∩VMa satisfying the following two properties:
(IA-1)
For all w∈VoutMa,
[TABLE]
for all (t1,t2,…,tD0−1)∈RD0−1 (after identifying RVin with RD0).
(IA-2)
For all w∈VoutM∖VoutMa, the function RD0−1→R given by
[TABLE]
is constant, and we denote its value by ⟨w⟩ρ,M(a).
As VMa⊂VM∖{vD00}, the network Ma will, indeed, have fewer nodes than M.
Now suppose that Ma is such a network, and suppose that {wρ,M}w∈VoutM is a linearly dependent set of functions RD0→R. In particular, let (λw)w∈VoutM be a nonzero set of scalars such that
[TABLE]
We then have
[TABLE]
and thus {⟨w⟩ρ,Ma}w∈VoutMa∪{1} is a linearly dependent set of functions RD0−1→R. Apropos, this derivation illustrates why it is often convenient to include the constant function 1 when dealing with linear dependencies between the outputs of GFNNs. In the following definition we construct a network Ma with the desired properties, and in Figure 5 we provide an illustration of this construction.
Definition 17**.**
Let M=(VM,EM,VinM,VoutM,ΩM,ΘM) be a non-degenerate, clones-free GFNN with input nodes VinM={v10,…,vD00}, D0≥2. Let a∈R, and let ρ be a nonlinearity.
The network obtained from M by anchoring the input vD00 to a is the GFNN
Ma=(VMa,EMa,VinMa,VoutMa,ΩMa,ΘMa) given by the following:
–
VMa={v∈VM:{v10,…,vD0−10}∩VM(v)=∅}, where M(v) denotes the ancestor network of v,
–
EMa={(v,v),v,v∈VMa},
–
VinMa={v10,…,vD0−10}, VoutMa=VoutM∩VMa, and
–
ΩMa={ωvv:(v,v)∈EMa}.
–
For a node v∈VM∖VMa
we define recursively
[TABLE]
(Note that all av are well-defined, as parM(v)⊂VM∖VMa whenever v∈VM∖VMa.)
Now, for v∈VMa let
[TABLE]
and set ΘMa={θv:v∈VMa}.
The network Ma satisfies
(IA-1) and (IA-2) by construction, and if M is layered, then so is Ma.
Moreover, Ma is non-degenerate. To see this, let v∈VMa be arbitrary. Then, by non-degeneracy of M, there exists a w∈VoutM such that v∈VM(w). As w is connected directly with a node in VMa, it follows that w∈VMa, and so w∈VoutMa.
Therefore v∈VMa(w), and, as v was arbitrary, we obtain VMa⊂⋃w∈VoutMaVMa(w), establishing by Definition 15 that Ma is non-degenerate.
However, Ma will not, generally, be clones-free. This is unfortunate, as our program for proving Theorem 2 envisages maintaining the no-clones property when constructing networks with linearly dependent outputs. However, not all is lost, as the following lemma says that, for nonlinearities holomorphic on a neighborhood of R, either there exists some value of a∈R such that the network Ma is, indeed, clones-free, or it is possible to modify a subnetwork of M (different from the subnetwork giving rise to Ma) to yield a clones-free subnetwork N of M with input {vD00} and linear dependency among the maps realized by its output nodes. This will be sufficient for our purposes.
Lemma 2** (Input anchoring).**
Let M=(VM,EM,VinM,VoutM,ΩM,ΘM), be a non-degenerate, clones-free GFNN with input nodes VinM={v10,…,vD00}, D0≥2. Let ρ:U→R be holomorphic on an open domain U⊂C containing R, such that ρ(R)⊂R. Let Ma denote the network obtained by anchoring the input vD00 to some a∈R, according to Definition 17. Then one of the following two statements must be true:
(i)
There exists an a∈R such that Ma is clones-free.
2. (ii)
There exist a non-degenerate clones-free GFNN N=(VN,EN,{vD00},VoutN,ΩN,ΘN)
(obtained by modifying a subnetwork of M), a real number λ0, and nonzero real numbers (λw)w∈VoutN, such that the function houtN:=λ01+∑w∈VoutNλwwρ,N is identically zero on R.
Proof.
For a pair of nodes (c1,c2)∈VM×VM define
[TABLE]
Suppose that (i) is false, so that, for every a∈R, we have a∈E(c1,c2) for some (c1,c2). Then we can write R as a finite union
[TABLE]
It follows that there exists a pair (c1,c2) such that at least one of the sets E(c1,c2) is not discrete, i.e., it has a limit point. Fix such a pair (c1,c2).
Note that we have vD00∈VM(cj), for at least one of j=1 or j=2, as otherwise we would have
parMa(cj)=parM(cj),
for j∈{1,2} and all a∈E(c1,c2), and thus c1, c2 would be clones in Ma if and only if they are clones in M. But, by the no-clones property of M, this would imply E(c1,c2)=∅, contradicting the fact that E(c1,c2) is not discrete. Thus, we may w.l.o.g. assume that vD00∈VM(c1), which leaves us with the cases vD00∈VM(c2) and vD00∈/VM(c2) that will be treated separately when needed. Define the GFNN
N=(VN,EN,{vD00},VoutN,ΩN,ΘN) according to the following:
–
Let S={v∈VM({c1,c2}):VinM∩VM(v)={vD00}}, and set
[TABLE]
–
EN={(v,v),v,v∈VN},
–
VoutN={c1,c2}∩VN,
–
ΩN={ωvv:(v,v)∈EN},
–
choose a number r\in\mathbb{R}\setminus\big{(}\{\theta_{v}-\theta_{c_{1}}:v\in S\}\cup\{\theta_{v}-\theta_{c_{2}}:v\in S\}\big{)}, and set θc1=θc1+r, θc2=θc2+r, and θv=θv, for v∈S. Define ΘN={θv:v∈VN}.
Informally, the so-constructed network N consists of the parts of M propagating the input at vD00 to c1 and c2 (and it might happen that this input does not reach c2, in which case this node is not included in VN), and the biases θc1 and θc2 are chosen so as to ensure that N has no clone pair (v,v~) with v∈{c1,c2} and v~∈S. Thus, in order to show that N is clones-free, it suffices to establish that c1 and c2 are not clones in N (note that c1 and c2 can be clones in N only in the case vD00∈VM(c2)), as any clone pair (v,v~) with v,v~∈S would also be a clone pair in M.
By way of contradiction, assume that c1 and c2 are clones in N, i.e.,
[TABLE]
As the construction of N does not depend on a, we can fix an arbitrary a∈E(c1,c2), and the condition that c1 and c2 are clones in Ma then implies
[TABLE]
where the real numbers au are defined according to (2). This, together with (4), yields
[TABLE]
which would say that c1 and c2 are clones in M and hence stands in contradiction to the no-clones property of M. This establishes the no-clones property of N. The non-degeneracy of N follows by its construction.
Now, by adding r to both sides of (5) and applying ρ, we find
[TABLE]
for all a∈E(c1,c2) (note that parM(c2)∩VN=∅ in the case vD00∈/VM(c2), and so the sum on the right-hand side of (5) evaluates to [math] in this case). As ρ is holomorphic on an open neighborhood of R and ρ(R)⊂R, we also have that ⟨c1⟩ρ,N, ⟨c2⟩ρ,N are holomorphic on a neighborhood of R. Further, since E(c1,c2) has a limit point, it follows by the identity theorem [18, Thm. 10.18] that (7) holds for all a∈R.
We have hence shown that Statement (ii) is valid with this N, and
[TABLE]
∎
IV Auxiliary results from complex analysis and Kronecker’s theorem
We state the remaining auxiliary results needed in the proof of our main statements. Since these results are relatively simple consequences of standard results in complex analysis and of Kronecker’s theorem, their proofs are relegated to the appendix.
Recall the definition of the natural domain D⟨u⟩σ of the map realized by a GFNN node u with respect to a holomorphic nonlinearity as given in Definition 12.
In the proof of Theorem 2 it will be crucial that D⟨u⟩σ be connected for all nodes u of a certain GFNN with a single input. The following lemma establishes this fact.
Lemma 3**.**
Let N=(V,E,{vin},Vout,Ω,Θ) be a GFNN, and let σ:Dσ→C be a meromorphic function on C with its set of poles given by P⊂C∖R.
Furthermore, suppose that σ(R)⊂R. Then, for every u∈V, we have D⟨u⟩σ=C∖Eu, where Eu⊂C is a closed countable subset of C∖R. In particular, we have that D⟨u⟩σ is an open connected set with D⟨u⟩σ⊃R.
In the following we write Dk∘(a,δ):={(z1,…,zk)∈Ck:∣zj−aj∣<δ,∀j} for the open polydisc of radius δ>0, centered at a=(a1,…,ak)∈Ck. Further, for a set S⊂Ck, we write cl(S) for the closure of S in Ck.
Lemma 4**.**
Let F:U→C be holomorphic on a connected open domain U⊂Ck containing Rk. Let a=(a1,…,ak)∈Rk and δ>0 be given, and let
[TABLE]
Suppose that Dk∘(a,δ)⊂U, and F(z)=0, for all z∈T. Then F=0 identically on U.
Lemma 5**.**
Let t∗∈C, a=(a1,…,ak)∈Rk, and δ>0, and let F:U→C be holomorphic on a connected open domain U⊂C1+k containing {t∗}×Rk. Define the set
[TABLE]
and suppose that D1+k∘(a,δ)⊂U. If there exists a set T⊂C1+k such that T⊂(C∖{t∗})×Ck, cl(T)⊃T, and F∣T≡0,
then F∣U≡0.
We will now elaborate on the tools needed in the proof of Theorem 2. The material touches upon the theory of Lie groups and representation theory, and will be presented in a self-contained fashion, only assuming familiarity with finitely-generated abelian groups and basic point-set topology. We write Td=Rd/Zd for the d-dimensional torus considered as a compact abelian topological group. For a finite set of real numbers {αj}j=1d we let ⟨α1,…,αd⟩Q denote the span of {αj}j=1d in the vector space R over the scalar field Q, and we write dim⟨α1,…,αd⟩Q for its dimension. We will need the following lemma, which is an easy consequence of Kronecker’s theorem [16]. For the sake of completeness, we provide an elementary proof from first principles.
Let d∈N and let {αj}j=1d be an arbitrary set of nonzero real numbers with k=dim⟨α1,…,αd⟩Q. Define the following subset of Td:
[TABLE]
where cl denotes the closure in Td. Then M is isomorphic to a k-dimensional torus as a Lie group, i.e., there exists a Ψ:M→Rk/Zk that is both a homeomorphism (between M and Rk/Zk as topological spaces) and a homomorphism (between M and Rk/Zk as abelian groups).
When d=2, Lemma 6 simply says that the line ℓ:t↦(α1t,α2t)+Z2, t∈R, either exhibits discrete periodic behavior and is thus homeomorphic to a 1-dimensional torus, which is the case if k=1, i.e., α1/α2 is rational, or otherwise, if k=2, i.e., when α1/α2 is irrational, ℓ is dense in the whole square, and so its closure is a 2-dimensional torus, namely R2/Z2 itself. This is illustrated in Figure 6.
When d≥3, the situation can be more complicated, as illustrated in Figure 7. Specifically, the torus M obtained as the closure of the line ℓ:t↦(α1t,…,αdt)+Zd, t∈R, may not occupy the entirety of Rd/Zd.
In this case, Lemma 6 provides the precise dimension of M, namely k=dim⟨α1,…,αd⟩Q. For the purpose of proving Theorem 2, it will suffice to consider the behavior of ℓ in a neighborhood of the point 0+Zd∈Td. Concretely, if Q∈Qd×k is the matrix representing α1,…,αd in the basis {α1,…,αk}, the following lemma states that, in a neighborhood of 0, ℓ visits points arbitrarily close to the k-dimensional subspace of Rd spanned by the columns of Q.
Lemma 7**.**
Suppose that {αj}j=1d are nonzero real numbers, and let k=dim⟨α1,…,αd⟩Q. Furthermore, assume that {αj}j=1k is a basis for ⟨α1,…,αd⟩Q over Q, and let Q=(Qpj)∈Qd×k be the matrix such that (α1,…,αd)=Q⋅(α1,…,αk).
Then there exists an open set C⊂Rk with
0∈C, such that, for every s=(s1,…,sk)∈C, there are sequences (tn,s)n∈N⊂R and (rn,s)n∈N=(r1n,s,…,rkn,s)n∈N⊂C with the following properties:
(i)
(α1tn,s,α2tn,s,…,αdtn,s)+Zd=Q⋅(α1r1n,s,…,αkrkn,s)+Zd, for all n∈N,
2. (ii)
∣tn,s∣→∞* as n→∞,*
3. (iii)
rn,s→s* in Rk, as n→∞.*
V Imaginary period and the self-avoiding property
We say that a holomorphic function f:D→C is i-periodic if f(z+i)=f(z), for all z∈D. An example of such a function is the scaled hyperbolic tangent function tanh(π⋅). More generally, for an arbitrary discrete set S⊂R, and arbitrary C∈R and real sequence {cs}s∈S∈ℓ1(S), the function σ=C+∑s∈Scstanh(π(⋅−s)) is also i-periodic, and in particular, the set of its poles P has the structure P=⋃n∈Z(S+(n+21)i).
We now introduce a property defined for discrete subsets of R, which will, when applied to the set S,
be the final technical ingredient in the proof of our main results.
Definition 18** (Self-avoiding set).**
Let S⊂R be a discrete set. We say that S is self-avoiding if, for every finite collection of distinct pairs {(ωj,θj)}j=1m⊂(2Z+1)×R, there exist a j∗∈{1,…,m} and a t∗ such that
[TABLE]
Remark*.*
In other words, a set S is self-avoiding if the union of a finite number of distinct copies of S obtained by translating and scaling by an odd integer contains a real number which is an element of exactly one of the copies.
Proposition 2**.**
Let S={sk:k∈Z}, sk−sk−1>0, ∀k∈Z, be an infinite discrete set such that {sk−sk−1:k∈Z} is rationally independent. Then S is self-avoiding.
Proof.
We use the shorthand notation Sω,θ=ωS−θ.
Suppose by way of contradiction that A⊂(2Z+1)×R, #A≥2, is a set of pairs such that, for every (ω,θ)∈A and every t∈Sω,θ, there exists a pair (ω′,θ′)∈A∖{(ω,θ)} such that t∈Sω′,θ′. Fix a pair (ω1,θ1)∈A. We then have, by assumption,
[TABLE]
Since S is infinite, there exists a (ω2,θ2)∈A∖{(ω1,θ1)} such that #(Sω1,θ1∩Sω2,θ2)≥3. Pick an arbitrary subset {t1<t2<t3}⊂Sω1,θ1∩Sω2,θ2 and note that there exist k11,k21,k31∈Z and k12,k22,k32∈Z such that
[TABLE]
Moreover, for r=1,2, we have k1r<k2r<k3r if ωr>0 and k1r>k2r>k3r if ωr<0. Define the index sets
[TABLE]
For brevity write ak=sk−sk−1, ∀k∈Z. We then have
[TABLE]
Now, since {ak:k∈Z} is rationally independent and ∣ω1∣,∣ω2∣∈Z, (9) implies ∣ω1∣=∣ω2∣ and Kj1=Kj2, for j=1,2. In particular, Kj1=Kj2, for j=1,2, implies sgn(ω1)=sgn(ω2), so we have ω1=ω2. Then, from the definition of Kjr, it follows that kj1=kj2, for j=1,2,3. We thus obtain from (8) that θ1=θ2, contradicting (ω1,θ1)=(ω2,θ2). Therefore, our initial assumption was false, so we deduce that S must be self-avoiding.
∎
The following proposition formalizes the notion that nonlinearities σ of the form considered at the beginning of the chapter are dense in the set of sigmoidal nonlinearities, even after imposing the additional constraint that S be self-avoiding.
Proposition 3**.**
Let ρ be a piecewise C1 nonlinearity with ρ′∈BV(R)∩L1(R). Then, for every ϵ>0, there exist a discrete self-avoiding set S⊂R, a sequence {cs}s∈S∈ℓ1(S) with cs=0, for all s∈S, and real numbers α>0 and C, such that the function σ given by
[TABLE]
satisfies ∥σ−ρ∥L∞(R)<ϵ.
Proof.
First note that
[TABLE]
is a well-defined real number, as ρ′∈L1(R). Let H denote the Heaviside step function. We now have, for all x∈R,
[TABLE]
Denote hα=21(1+tanh(α⋅)) and consider the function ρα defined by
[TABLE]
We then have
[TABLE]
Now note that ∥ρ′∥L∞(R)<∞ as ρ′∈BV(R), and ∥H−hα∥L1(R)→0 as α→∞ by dominated convergence, so there exists α>0 such that ∥ρ−ρα∥L∞(R)<3ϵ.
Let b:Z→N be a bijection, and β∈(0,1) a parameter to be specified. Define the infinite discrete set Sβ={skβ:=β(k+π−b(k)):k∈Z}⊂R. Then, since π is transcendental, Proposition 2 implies that Sβ is self-avoiding. Now, since ρ′ is integrable on R and piecewise continuous, and hα is bounded and continuous, we have that ρ′⋅hα(x−⋅) is integrable on R and piecewise continuous. Hence, as mesh(Sβ):=supk∈Z∣skβ−sk−1β∣→0 for β→0, we have the following convergence of Riemann sums
[TABLE]
Therefore ρ(−∞)+∑k∈Z(skβ−sk−1β)ρ′(skβ)hα(⋅−skβ)→ρα pointwise. To upgrade this to convergence in ∥⋅∥L∞(R), we proceed as follows.
By the mean value theorem, for any x∈R and β>0, there exist ykβ,x∈[sk−1β,skβ] such that
[TABLE]
We can therefore write
[TABLE]
Since ρ′∈BV(R) by assumption, and hα∈BV(R) by definition, the quantities in the parentheses are all finite. As they are moreover independent of β, and mesh(Sβ)→0 for β→0, we can pick a β>0 such that
[TABLE]
where we used (10) to replace ∫Rρ′(y)hα(x−y)dy in (11) with ρα−ρ(−∞).
Finally, let {ds}s∈Sβ be an arbitrary sequence of real numbers such that mesh(Sβ)∑k∈Z∣dskβ∣<3ϵ and, for each s∈Sβ, ds=0 if and only if ρ′(s)=0. We then have
[TABLE]
Now, combining the estimates (12), (13), and ∥ρ−ρα∥L∞(R)<3ϵ yields
[TABLE]
so the claim of the proposition holds with S=Sβ, cskβ=21(skβ−sk−1β)(ρ′(skβ)+dskβ), and C=ρ(−∞)+∑k∈Zcskβ.
∎
VI The main theorems
Theorem 3**.**
Let N1 and N2 be non-degenerate clones-free LFNNs with the same input and ouput sets Vin and Vout. Let
[TABLE]
where C∈R, S is a discrete self-avoiding set, and {cs}s∈S∈ℓ1(S) are all nonzero and real. Suppose that ⟨N1⟩σ(t)=⟨N2⟩σ(t), for all t∈RVin. Then N1 and N2 are faithfully isomorphic.
Theorem 4**.**
Let Nj, j∈{1,2,…,n}, be non-degenerate clones-free LFNNs with the same input set Vin and the same single output node {vout}. Furthermore, suppose that no two networks Nj1, Nj2, j1=j2, are extensionally isomorphic. Consider the nonlinearity
[TABLE]
with C∈R, S a discrete self-avoiding set, and {cs}s∈S∈ℓ1(S), where each cs is nonzero and real. Then {⟨Nj⟩σ}j=1n∪{1} is a linearly independent set of functions from RVin to R.
Before embarking on the proofs of Theorems 3 and 4, we show how Theorems 1 and 2 follow from these two results together with Proposition 3.
Let ρ be as in the statement of Theorem 1, and let ϵ>0 be arbitrary. Proposition 3 guarantees the existence of a discrete self-avoiding set S⊂R, a sequence {cs}s∈S∈ℓ1(S) with cs=0, for all s∈S, and real numbers α>0 and C, such that the function σ defined by
[TABLE]
satisfies ∥σ−ρ∥L∞(R)<ϵ.
Now suppose that N=(V,E,Vin,Vout,Ω,Θ) and N=(V,E,Vin,Vout,Ω,Θ) are clones-free non-degenerate LFNNs with the same input set Vin and such that ⟨N⟩σ(x)=⟨N⟩σ(x), for all x∈RVin. Consider the scaled objects σα:=σ(απ⋅), Sα=παS, \mathcal{N^{\alpha}}=\big{(}V,E,V_{in},V_{out},\allowbreak\frac{\alpha}{\pi}\Omega,\frac{\alpha}{\pi}\Theta\big{)}, and \widetilde{\mathcal{N}}^{\alpha}=\big{(}\widetilde{V},\widetilde{E},{V}_{in},{V}_{out},\frac{\alpha}{\pi}\widetilde{\Omega},\frac{\alpha}{\pi}\widetilde{\Theta}\big{)}, where παΩ={παω:ω∈Ω}, and παΘ,παΩ,παΘ are defined analogously.
Then ⟨Nα⟩σα(x)=⟨N⟩σ(x)=⟨N⟩σ(x)=⟨Nα⟩σα(x), for all x∈RVin.
Moreover,
[TABLE]
and Sα is a discrete self-avoiding set (as the self-avoiding property is preserved under scaling by a nonzero real number), so by Theorem 3 we obtain Nα∼fNα, which implies N≃N.
∎
Let ρ be as in the statement of Theorem 2, and let ϵ>0 be arbitrary. Proposition 3 guarantees the existence of a discrete self-avoiding set S⊂R, a sequence {cs}s∈S∈ℓ1(S) with cs=0, for all s∈S, and real numbers α>0 and C, such that the function σ defined by
[TABLE]
satisfies ∥σ−ρ∥L∞(R)<ϵ.
Now suppose that Nj=(Vj,Ej,Vin,{vout},Ωj,Θj), j∈{1,…,n}, are non-degenerate clones-free LFNNs such that no two Nj1, Nj2, j1=j2, are faithfully isomorphic. As {vout} is a singleton, it follows that no two Nj1, Nj2, j1=j2, are extensionally isomorphic either.
Now, define the scaled objects σα:=σ(απ⋅), Sα=παS, and Njα=(Vj,Ej,Vin,{vout},παΩj,παΘj), for j∈{1,…,n}, where
παΩj={παω:ω∈Ωj} and παΘj={παθ:θ∈Θj}.
Then the Njα are non-degenerate and clones-free, and no two Nj1α, Nj2α, j1=j2, are extensionally isomorphic.
Moreover,
[TABLE]
and Sα is a discrete self-avoiding set, so by Theorem 4 we obtain that {⟨Njα⟩σα}j=1n∪{1} is linearly independent.
Now, suppose by way of contradiction that there is linear dependency
λ0+∑j=1nλj⟨Nj⟩σ=0
among {⟨Nj⟩σ}j=1n∪{1}. But then
[TABLE]
which contradicts the linear independence of {⟨Njα⟩σα}j=1n∪{1}. We deduce that {⟨Nj⟩σ}j=1n∪{1} must be linearly independent, as desired.
∎
We argue by contradiction, so suppose that the statement is false. Specifically, let Nj, j∈{1,2,…,n}, be LFNNs and σ a nonlinearity as in the statement of the theorem, and suppose that {⟨Nj⟩σ}j=1n∪{1} is linearly dependent. Then, by Lemma 1, there exists a non-degenerate clones-free LFNN M=(VM,EM,VinM,VoutM,ΩM,ΘM) with a single input node VinM={vin}, such that {⟨w⟩σ:w∈VoutM}∪{1} is a linearly dependent set of functions from R to R.
Let M denote the set of all non-degenerate clones-free LFNNs M=(VM,EM,{vin},VoutM,ΩM,ΘM) such that {⟨w⟩σ:w∈VoutM}∪{1} is linearly dependent. We then have M=∅, simply as M∈M.
Denote by Mmin the set of all networks in M of minimum depth, and fix a network M′∈Mmin with the minimal number of nodes among all the networks in Mmin. The proof proceeds by constructing a network N∈Mmin with a strictly smaller number of nodes than M′, thereby deriving a contradiction and concluding the proof.
First note that linear dependence of {⟨w⟩σ:w∈VoutM′}∪{1} is equivalent to the existence of a nonzero set of real numbers {λw}w∈VoutM′ and a real number c∈R such that hout:R→R, given by
[TABLE]
is constant-valued, i.e., hout(t)=c, for all t∈R. Note that λw=0, for all w∈VoutM′, for otherwise the ancestor subnetwork M′({w∈VoutM′,λw=0}) would be an element of Mmin with strictly fewer nodes than M′, contradicting the minimality of M′.
Next, note that σ is a real meromorphic function whose set of poles is
[TABLE]
and in particular, M′ and σ satisfy the assumptions of Lemma 3, and so the sets C∖D⟨w⟩σ are closed and countable, where D⟨w⟩σ denotes the natural domain of ⟨w⟩σ, for w∈VoutM′. Therefore, as a linear combination of holomorphic functions, hout is a holomorphic function on Dhout:=⋂w∈VoutM′D⟨w⟩σ.
As C∖D⟨w⟩σ are closed and countable, C∖Dhout is also closed and countable, and therefore Dhout is a connected open set.
It follows by the identity theorem [18, Thm. 10.18] that hout continues in a unique fashion to a holomorphic function on Dhout with hout(t)=c, for all t∈Dhout.
Set Vℓ={v∈VM′:lv(v)=ℓ}, for ℓ≥1. Let k=dim⟨{ωuvin:u∈V1}⟩Q and enumerate the nodes V1={v11,…,vD11} so that {ωv11vin,…,ωvk1vin} is a basis for ⟨ωv11vin,…,ωvD11vin⟩Q.
In the remainder of the proof, we distinguish between the cases k≥2 and k=1.
The case k≥2. Fix a real number
[TABLE]
chosen so that none of ⟨vp1⟩σ(z)=σ(ωvp1vinz+θvp1), p∈{1,…,D1}, has singularities along A+iR. Such a number always exists, as ⋃p=1D1(S−θvp1)/ωvp1vin is a discrete set.
Now, write (ωvp1vin)p=1D1=Q⋅(ωvp1vin)p=1k, where Q=(qpj)∈QD1×k is a rational matrix whose first k rows form a k×k identity matrix.
Let C⊂Rk be a set satisfying the conclusion of Lemma 7 applied with αp=ωvp1vin, p∈{1,…,D1}.
Given an arbitrary s=(s1,s2,…,sk)∈C, Lemma 7 yields sequences (tn,s)n∈N⊂R and (rn,s)n∈N⊂C such that
[TABLE]
We now perform a calculation that will enable us to interpret the single input variable of M′ as a rational linear combination of k input variables of another LFNN M′′, to be specified below. The argument will then proceed by anchoring at all but one of the inputs of M′′. It is this last step that uses k≥2 as a key assumption, as anchoring requires at least two input nodes to be meaningful.
We thus have
[TABLE]
for p∈{1,…,D1}, where in (19) we used the i-periodicity of σ, in (20) we used (16), and in (21) we used ωvp1vin=∑j=1kqpjωvj1vin and the i-periodicity of σ again. Owing to (15), none of ⟨vp1⟩σ, p∈{1,…,D1}, has singularities along A+iR, and thus all the quantities in (19) – (21) are well-defined. The calculation just presented suggests constructing a new LFNN by “splitting” the input node vin of M′ into k new input nodes. Formally, we define an LFNN M′′=(VM′′,EM′′,VinM′′,VoutM′′,ΩM′′,ΘM′′) as follows:
–
VinM′′={u1,…,uk} is a set of k newly-created input nodes (disjoint from VM′),
Define ωvp1uj:=qpjωvj1vin, for p∈{1,…,D1}, j∈{1,…,k}, and let
[TABLE]
–
ΘM′′:=ΘM′.
The procedure for constructing M′′ for a given M′ is illustrated in Figure 8.
Owing to (19) – (21) and the construction of M′′, we have the following “input splitting” relationship
[TABLE]
for p∈{1,…,D1}.
We now show that M′′ is non-degenerate and clones-free. To this end, first note that, for every j∈{1,…,k}, there exists a w∈VoutM′ such that vj1∈VM′(w), by non-degeneracy of M′, and as uj∈par(vj1), we have uj∈VM′′(w). This establishes non-degeneracy.
Next,
we observe that a clone pair in M′′ would have to consist of nodes in {v11,v21,…,vD11}, as a clone pair in M′′ consisting only of nodes in ⋃ℓ≥2Vℓ would also be a clone pair in M′. Thus, by way of contradiction, suppose that (vp11,vp21), 1≤p1<p2≤D1, is a clone pair in M′′. Then θp11=θp21 and ωvp11vin=∑j=1kqp1jωvj1vin=∑j=1kqp2jωvj1vin=ωvp21vin, so (vp11,vp21) is a clone pair in M′, which stands in contradiction to the no-clones property of M′, and hence establishes that M′′ is clones-free.
We now revisit the constant-valued function hout(t)=∑w∈VoutM′λw⟨w⟩σ,M′(t)=c, for all t∈Dhout. Examining the structure of M′, we see that, for each w∈VoutM′, we can write
[TABLE]
where Fw corresponds to the map realized by the LFNN with nodes
[TABLE]
inputs {v11,…,vD11}, output {w}, and edges, weights, and biases inherited from M′. As Fw is the map realized by a node of a GFNN according to Definition 12, it is holomorphic on its natural domain DFw⊂CD1 containing RD1.
We can therefore write
[TABLE]
where F:DF→C, F=∑w∈VoutM′λwFw, is holomorphic on DF:=⋂w∈VoutM′DFw⊃RD1.
Now, by definition of natural domain, for each w∈VoutM′′, we have
[TABLE]
where the variables z1,…,zk correspond to the input nodes u1,…,uk, respectively. Therefore, for (z1,…,zk) in the open domain Dhout:=⋂w∈VoutM′′D⟨w⟩σ,M′′,
we can define the function hout:Dhout→C according to
[TABLE]
Moreover, as M′ and M′′ share the nodes in (23), as well as the associated edges, weights, and biases, we have
[TABLE]
for all w∈VoutM′′, and thus
[TABLE]
We are now in a position to show that, like hout, the function hout is constant valued. As this will be effected by an analytic continuation argument through Lemma 4, we first need to ensure that the relevant quantities lie in Dhout. To this end, as ⟨vp1⟩σ,M′′(z1,…,zk)∈R, for all (z1,…,zk)∈Rk, p∈{1,…,D1}, and DF is an open set containing RD1, we can choose a small enough δ>0 so that Dhout⊃Dk∘((A,…,A),δ). Now, fix an arbitrary s=(s1,…,sk) in the smaller open set C∩Dk∘(0,δ). We then have
[TABLE]
and since
[TABLE]
as n→∞, we obtain
[TABLE]
for large enough n∈N. We may assume w.l.o.g. that this is true for all n∈N by discarding finitely many elements of the sequence (rn,s)n∈N.
Now, we use (22), (24), and (25)
to get
[TABLE]
Define the set
[TABLE]
and note that
cl(T)⊃((A,…,A)+(iC)∩Dk∘(0,δ)),
so it follows by Lemma 4 that hout−c≡0 everywhere in a neighborhood of Rk, and thus, in particular, hout∣Rk≡c.
We now repeatedly apply Lemma 2 to M′′, anchoring successively each of the inputs u1,…,uk−1. Observe that we will never find ourselves in the circumstance (ii) of Lemma 2, as this would mean that we have obtained a network N∈Mmin with a strictly smaller number of nodes than M′. Moreover, as the first k rows of Q form an identity matrix, we have
[TABLE]
for all p,j∈{1,…,k}. Therefore, for each j∈{1,…,k}, the node vj1 will be removed when anchoring the input uj.
A concrete example of this input anchoring procedure in the case k≥2 is shown schematically in Figure 9.
Thus, having anchored the nodes u1,u2,…,uk−1 to appropriate real numbers a1,…,ak−1, we will be left with a non-degenerate clones-free LFNN N=(VN,EN,{uk},VoutN,ΩN,ΘN) such that the function houtN:=∑w∈VoutNλw⟨w⟩σ,N satisfies
[TABLE]
We have shown that the first term on the right-hand side of (26) evaluates identically to c.
Moreover, as input anchoring yields networks satisfying (IA-2),
the values ⟨w⟩σ,M′′, for w∈VoutM′′∖VoutN, are constant with respect to the input at uk. Therefore the value of the sum on the right-hand side of (26) is independent of t, that is, houtN≡cN, for some cN∈R. As λw=0, for w∈VoutM′′, it follows that {⟨w⟩σ,N:w∈VoutN}∪{1} is linearly dependent.
We have thus shown that the network N is in Mmin. As N has strictly fewer nodes than M′, we have established the desired contradiction and proved the theorem for k≥2.
The case k=1. We have dim⟨ωv11vin,…,ωvD11vin⟩Q=1, so we can write ωvj1vin=Nja, where a∈R and Nj∈Z, for j=1,…,D1. Moreover, by replacing a with 2la and all Nj with Nj/2l for an appropriate integer l, we may assume w.l.o.g. that at least one of the Nj is odd. We make the following crucial observation. For all j=1,…,D1 and t∈R, we have
[TABLE]
We see that, along the line R+2ai, the functions ⟨vj1⟩σ,M′ are real-valued, for all j=1,…,D1, and, provided that Nj is odd, they have poles at the points a1[NjS−θvj1+2i]. As S is self-avoiding, and at least one of the Nj is odd, there exist a j∗∈{1,…,D1} and a t∗∈R+2ai such that ⟨vj∗1⟩σ,M′ has a pole at t∗, and all the other ⟨vj1⟩σ,M′, j∈{1,…,D1}∖{j∗}, are analytic and real-valued at t∗. Let ϵ>0 be such that ⟨vj1⟩σ,M′, j∈{1,…,D1}∖{j∗}, are analytic on an open set containing the closed disk D(t∗,ϵ), and such that ⟨vj∗1⟩σ,M′ is analytic on the punctured disk D(t∗,ϵ)∖{t∗}.
Before embarking on the construction of N in the case k=1, we verify the following auxiliary statement:
*Claim 1: We have L(M′)≥2 and {v∈V2:(vj∗1,v)∈EM′}=∅.
Proof of Claim 1.* We first show that L(M′)≥2. To this end, suppose by way of contradiction that L(M′)=1. Then VoutM′=V1 by non-degeneracy, so the function hout=∑w∈VoutM′λw⟨w⟩σ,M′ can be written as
[TABLE]
where g is analytic in an open neighborhood of t∗. But ⟨vj∗1⟩σ,M′ has a pole at t∗, and so hout has a pole at t∗, which stands in contradiction to hout≡c, and thus establishes L(M′)≥2.
Next, by way of contradiction assume that {v∈V2:(vj∗1,v)∈EM′}=∅. Then, by non-degeneracy of M′, we have vj∗1∈VoutM′, and ⟨w⟩σ,M′, for w∈VoutM′∖{vj∗1}, are real holomorphic functions of \big{(}\langle{v_{j}^{1}}\rangle^{\sigma,\,{\mathcal{M}^{\prime}}}\big{)}_{j\in\{1,\dots,D_{1}\}\setminus\{j^{*}\}}. Now, as ⟨vj1⟩σ,M′, j∈{1,…,D1}∖{j∗}, are analytic and real-valued at t∗, the function hout can again be written in the form (28) with g analytic in an open neighborhood of t∗. This again contradicts hout≡c, and thus {v∈V2:(vj∗1,v)∈EM′}=∅, establishing the claim.
We can therefore enumerate the nodes V2={v12,…,vd2,vd+12,…,vD22} so that
–
vj∗1∈⋂p≤dpar({vp2})∖⋃p>dpar({vp2}), and
–
{ωv12vj∗1,…,ωvkˉ2vj∗1} is a basis for ⟨ωv12vj∗1,…,ωvd2vj∗1⟩Q.
In particular, we have kˉ=dim⟨ωv12vj∗1,…,ωvd2vj∗1⟩Q.
We will apply a similar input splitting procedure as in the case k≥2, but this time with the nodes vj∗1 and v12,…,vd2 taking on the roles of vin and v11,…,vD11. Specifically, we will use the pole of ⟨vj∗1⟩σ,M′ at t∗ to obtain sequences (tn,s)n∈N and (rn,s)n∈N according to Lemma 7, that is to say, we will “split the non-input node” vj∗1 of M′ into input nodes of the new network M′′ to be constructed. We remark that
the outputs of v12,…,vd2 depend on ⟨vj∗1⟩σ,M′, which, in turn, is a function of the input variables. This “extra level of separation” will cause the construction of M′′ to be more involved in the case k=1 than it was in the case k≥2.
In order to motivate the construction of M′′ in the case k=1, we will carry out a calculation analogous to (19)–(21). We begin by determining a B∈R such that none of the functions
[TABLE]
for p∈{1,…,d}, have singularities in the set LB:={z∈D(t∗,ϵ):⟨vj∗1⟩σ,M′(z)∈B+iR}, where the functions fp:Dfp→C, for p∈{1,…,d}, are defined according to
[TABLE]
When D1=1, the functions fp are all identically zero.
For given p∈{1,…,d}, z∈LB is a singularity of ⟨vp2⟩σ,M′ if and only if z is an element of D(t∗,ϵ) such that
[TABLE]
where P is the set of poles of σ, expressed in terms of S by (14). But
[TABLE]
for all z∈D(t∗,ϵ), so it suffices to ensure that
[TABLE]
Next, let
[TABLE]
and note that, as fp, p=1,…,d, are continuous in a neighborhood of t∗, we have η(ϵ)→0 as ϵ→0.
Let Leb denote the Lebesgue measure on R. We then have
[TABLE]
for small enough values of ϵ. Therefore, by choosing a sufficiently small ϵ, we can ensure that there exists a B∈[0,1] such that (31) holds, as desired.
Now, write (ωvp2vj∗1)p=1d=Qˉ⋅(ωvp2vj∗1)p=1kˉ, where Qˉ=(qˉpj)p,j∈Qd×kˉ is a rational matrix whose first kˉ rows form a kˉ×kˉ identity matrix.
Let C⊂Rkˉ be a set satisfying the conclusion of Lemma 7 applied with αp=ωvp2vj∗1, p=1,…,kˉ.
Given an arbitrary s=(s1,s2,…,skˉ)∈C, Lemma 7 yields sequences (tn,s)n∈N⊂R, (rn,s)n∈N⊂C such that
[TABLE]
As ⟨vj∗1⟩σ,M′ is analytic on the punctured disk D(t∗,ϵ)∖{t∗} and its singularity at t∗ is a pole, it follows that the reciprocal 1/⟨vj∗1⟩σ,M′ is holomorphic on D(t∗,ϵ) with a zero at t∗. Thus, by the complex open mapping theorem [18, Thm. 10.32] applied to 1/⟨vj∗1⟩σ,M′, there exists a δ>0 such that, for every y∈D(0,δ), there is a zy∈D(t∗,ϵ) with 1/⟨vj∗1⟩σ,M′(zy)=y. Now, since ∣tn,s∣→∞, we also have ∣B+itn,s∣→∞, so it follows that there exists a sequence (zn,s)n∈N in D(t∗,ϵ)∖{t∗} with zn,s→t∗, such that ⟨vj∗1⟩σ,M′(zn,s)=B+itn,s (a finite number of elements of the sequence (tn,s)n∈N may need to be discarded to ensure that (zn,s)n∈N is, indeed, contained in D(t∗,ϵ)∖{t∗}).
Now, for p∈{1,…,d}, compute
[TABLE]
where in (35) we used the definition of zn,s, in (36) we used the i-periodicity of σ, in (37) we used (32), and in (38) we used ωvp2vj∗1=∑j=1kˉqˉpjωvj2vj∗1 and the i-periodicity of σ again. As B was chosen so that the functions (29) do not have singularities in LB, all the quantities in the calculation (35)–(38) are well-defined.
Motivated by (35)–(38), we construct a GFNN M′′=(VM′′,EM′′,VinM′′,VoutM′′,ΩM′′,ΘM′′) as follows
–
First, kˉ new nodes are created and enumerated as {u1,…,ukˉ}. Now, if D1>1, then let VinM′′={vin,u1,…,ukˉ}, and if D1=1, set VinM′′={u1,…,ukˉ}.
The construction of M′′ for a concrete M′ is illustrated in Figure 10. Note that M′′ is not layered in the case D1>1, due to the presence of the node vin.
Owing to (35)–(38) and the construction of M′′, we have the following “input splitting” relationship:
[TABLE]
for p∈{1,…,d}.
We next show that M′′ is non-degenerate and clones-free. To establish non-degeneracy, it suffices to show VinM′′⊂⋃w∈VoutM′′VM′′(w). First note that, in both cases D1=1 and D1>1, for a given j∈{1,…,kˉ}, there exists a w∈VoutM′∖{vj∗1} such that vj2∈VM′(w), by non-degeneracy of M′. It follows that vj2∈VM′′(w) and thus uj∈VM′′(w). As j was arbitrary, we have {u1,…,ukˉ}⊂⋃w∈VoutM′′VM′′(w), which establishes non-degeneracy of M′′ in the case D1=1. For D1>1 we need to additionally show that vin∈VM′′(w). To this end, note that there exist an m∗∈{1,…,D1}∖{j∗} and a w∈VoutM′∖{vj∗1} such that vm∗1∈VM′(w), and so vin∈VM′′(w), as desired.
The clones-free property of M′′ follows by the same argument as in the case k≥2.
Once again, we revisit the function hout(t)=∑w∈VoutM′λw⟨w⟩σ,M′(t)=c, for all t∈Dhout, and proceed in a similar fashion as in the case k≥2. This time, however,
the output sets VoutM′ and VoutM′′ may differ by the node vj∗1. This is a nuisance that will be dealt with below in Claim 2, but in the meantime, it is convenient to introduce the “truncated” linear dependency function
[TABLE]
and proceed exactly as in the case k≥2.
By examining the structure of M′, we see that, for each w∈VoutM′∖{vj∗1}, we can write
[TABLE]
where Hw:DHw→C corresponds to the map realized by the GFNN with nodes
[TABLE]
inputs {vp2}p=1d∪{vj1}j∈{1,…,D1}∖{j∗}, single output {w}, and edges, weights, and biases inherited from M′.
The function Hw:DHw→C is holomorphic on its natural domain DHw⊂Cd+(D1−1) containing Rd+(D1−1). We can therefore write
[TABLE]
where H:DH→C, H=∑w∈VoutM′∖{vj∗1}λwHw, is holomorphic on DH=⋂w∈VoutM′∖{vj∗1}DHw⊃Rd+(D1−1).
Now, by definition of natural domain, for each w∈VoutM′′, the natural domain D⟨w⟩σ,M′′ is the set of all z∈⋂p=1dD⟨vp2⟩σ,M′′∩⋂j=j∗D⟨vj1⟩σ,M′′ such that
[TABLE]
where the variable z=(z0,z1,…,zkˉ) corresponds to the input nodes vin,u1,…,ukˉ, in the case D1>1, and z=(z1,…,zkˉ) corresponds to the input nodes u1,…,ukˉ, in the case D1=1.
Therefore, for z in the open domain Dhout:=⋂w∈VoutM′′D⟨w⟩σ,M′′, we can define the function hout:Dhout→C according to
[TABLE]
Moreover, as M′ and M′′ share the nodes in (41), as well as the associated edges, weights, and biases, we have
[TABLE]
for all w∈VoutM′′, and thus
[TABLE]
At this point we verify another auxiliary claim, which states that htr and hout are always, in fact, the same function, and therefore h~out≡c follows by a similar argument as in the case k≥2.
Claim 2: Recall that t∗∈R+2ai is such that ⟨vj∗1⟩σ,M′ has a pole at t∗, and all the other ⟨vj1⟩σ,M′, j∈{1,…,D1}∖{j∗}, are analytic and real-valued at t∗. Further recall the open set C⊂Rkˉ containing 0. We have {t∗}×Rkˉ⊂Dhout and hout∣Rkˉ+1≡c, in the case D1>1, and Rkˉ⊂Dhout and hout∣Rkˉ≡c, in the case D1=1. Moreover, in both cases we have vj∗1∈/VoutM′.Proof of Claim 2. First assume that D1>1. To show that {t∗}×Rkˉ⊂Dhout, first observe that, for j∈{1,…,D1}∖{j∗} and (z1,…,zkˉ)∈Rkˉ, we have ⟨vj1⟩σ,M′′(t∗,z1,…,zkˉ)=⟨vj1⟩σ,M′(t∗), which, by (VI), is a real number. By (30), this further implies fp(t∗)∈R, for p=1,…,d. Therefore
[TABLE]
for p∈{1,…,d} and (z1,…,zkˉ)∈Rkˉ. As Rd+(D1−1)⊂DH, we deduce that
[TABLE]
This establishes {t∗}×Rkˉ⊂Dhout. We proceed to showing hout∣Rkˉ+1≡c.
As Dhout is open, it follows that Dhout⊃U, for some connected open U⊂C1+kˉ containing {t∗}×Rkˉ. Choose a small enough δ>0 so that U⊃D1∘(t∗,δ)×Dkˉ∘((B,…,B),δ).
Now, fix an arbitrary s=(s1,…,skˉ) in the smaller open set C∩Dkˉ∘(0,δ). We then have
[TABLE]
and since
[TABLE]
as n→∞, we obtain
[TABLE]
for large enough n∈N. We may again assume w.l.o.g. that this is true for all n∈N by discarding finitely many elements of the sequences (zn,s)n∈N and (rn,s)n∈N.
Now, we use (39), (42), and (43)
to get
[TABLE]
for all s∈C∩Dkˉ∘(0,δ).
We are now ready to show that vj∗1∈/VoutM′ (still in the case D1>1). To this end, suppose by way of contradiction that vj∗1∈VoutM′ and set s=0. Note that hout(t∗,B,…,B) is a well-defined (finite) complex number, simply as (t∗,B,…,B)∈{t∗}×Rkˉ⊂Dhout. Thus, by (40) and (44), we have
[TABLE]
as n→∞, which contradicts the fact that ⟨vj∗1⟩σ,M′ has a pole at t∗. This establishes vj∗1∈/VoutM′. As a consequence we further have htr=hout, and so (44) reads
[TABLE]
for all s∈C∩Dkˉ∘(0,δ). Now, define the set
[TABLE]
Note that T satisfies
[TABLE]
so by Lemma 5, it follows that hout−c≡0 everywhere in an open neighborhood of Rkˉ+1, and thus hout∣Rkˉ+1≡c in particular. This establishes Claim 2 in the case D1>1.
It remains to prove the claim for D1=1. Showing that Rkˉ⊂Dhout is fully analogous to showing {t∗}×Rkˉ⊂Dhout in the case D1>1. We can hence proceed to establishing hout∣Rkˉ≡c. To this end, we first note that there is a connected open set U and a δ>0 such that Rkˉ⊂U⊂Dhout and Dkˉ∘((B,…,B),δ)⊂U, and we similarly obtain
[TABLE]
for all n∈N and s∈C∩Dkˉ∘(0,δ). Again, showing vj∗1∈/VoutM′ now proceeds in a manner entirely analogous to the case D1>1, as does obtaining the identity
[TABLE]
for all s∈C∩Dkˉ∘(0,δ).
Now, define the set
[TABLE]
Note that T satisfies
cl(T)⊃((B,…,B)+(iC)∩Dkˉ∘(0,δ)), so, by Lemma 4, we have hout≡c everywhere in an open neighborhood of Rkˉ, which concludes the proof of Claim 2.
Finally, it remains to apply an input anchoring procedure to M′′, which will conclude the proof in a manner similar to the case k≥2.
Specifically, we use Lemma 2 to successively eliminate inputs of M′′, starting with vin (if present), and proceeding with u1,…,ukˉ−1.
If D1>1, the network M′′ is not layered (unlike in the case k≥2 and the case k=1, D1=1). However, every network obtained from M′′ by anchoring all but one of the input nodes {vin,u1,…,ukˉ} is layered. This means that, when anchoring vin, we do not find ourselves in the circumstance (ii) of Lemma 2, as this would mean we have obtained a network N∈Mmin with strictly fewer nodes than M.
Thus, after having anchored vin, we are left with a layered network with inputs u1,…,ukˉ. At this point we proceed completely analogously to the case k≥2 by successively eliminating the inputs u1,…,ukˉ−1.
We are left with a non-degenerate clones-free LFNN N=(VN,EN,{ukˉ},VoutN,ΩN,ΘN) and a vector of real constants a (specifically, a∈Rkˉ in the case D1>1, and a∈Rkˉ−1 in the case D1=1),
such that the function houtN:=∑w∈VoutNλw⟨w⟩σ,N satisfies
[TABLE]
A concrete example of this input anchoring procedure in the case k≥2 is shown schematically in Figure 11.
By Claim 2, the first term on the right-hand side of (45) evaluates identically to c.
Moreover, as input anchoring yields networks satisfying (IA-2),
the values of the functions ⟨w⟩σ,M′′, for w∈VoutM′′∖VoutN, do not depend on the input at ukˉ.
Therefore houtN≡cN, for some cN∈R.
We have thus shown that the network N is in M. But L(N)=L(M)−1, which stands in contradiction to the minimality of depth of the elements of Mmin, and therefore completes the proof of the theorem.
∎
Let Nj=(Vj,Ej,Vin,Vout,Ωj,Θj), j∈{1,2}, be networks as in the theorem statement.
Let N=N1∨N2 be their amalgam and πj:VNj→πj(VNj)⊂VN the extensional isomorphisms between Nj and the corresponding subnetworks of N, for j∈{1,2}.
We start by claiming that π1(w)=π2(w), for all w∈Vout. Indeed, suppose to the contrary that we have π1(w′)=π2(w′), for some w′∈Vout, and denote wj=πj(w′), j∈{1,2}. Since w1=w2, it follows that N(w1) and N(w2) are not extensionally isomorphic, for otherwise w1 and w2 would be clones, contradicting the no-clones condition for N. Now,
[TABLE]
by assumption. But this contradicts the conclusion of Theorem 4, and thus establishes π1(w)=π2(w), for all w∈Vout.
By non-degeneracy of N1, for every v∈V1, there exists a w∈Vout such that v∈VN1(w). Then π1(v)∈VN(π1(w))=VN(π2(w))=π2(VN2(w))⊂π2(V2). Similarly, for every v∈V2, we have π2(v)∈π1(V1). Thus, the function ψ:V1→V2 given by ψ=π2−1∘π1 is well-defined. This function is invertible with inverse π1−1∘π2, so it is a bijection. Therefore ψ is an extensional isomorphism between N1 and N2, by virtue of being a composition of two extensional isomorphisms. Moreover, we have ψ(w)=π2−1(π1(w))=w, for all w∈Vout, so ψ restricted to Vout is the identity map, and thus ψ is a faithful isomorphism.
∎
Acknowledgment
The authors would like to thank Thomas Allard for useful suggestions regarding the proof of Proposition 3 and an anonymous reviewer for proposing a clearer exposition of Lemma 6.
Fix N1 and N2 as in the statement of the proposition.
We begin by establishing the existence of a corresponding amalgam A.
Let A denote the set of all proto-amalgams of N1 and N2. To see that A is non-empty, consider the LFNN N=(VN,EN,Vin,VoutN,ΩN,ΘN) specified as follows:
–
Let S be a set of cardinality #(V1∖Vin)+#(V2∖Vin) disjoint from Vin, and set VN:=Vin∪S. Furthermore, let πjN:Vj→πjN(Vj)⊂VN be injective functions such that πjN(v)=v, for v∈Vin, j∈{1,2}, and π1N(V1∖Vin)∩π2N(V2∖Vin)=∅, but otherwise arbitrary.
–
EN:=⋃j=1,2{(πjN(v),πjN(v)):v,v∈Vj,(v,v~)∈Ej}.
–
VoutN:=π1N(Vout1)∪π2N(Vout2).
–
For j∈{1,2} and v,v∈Vj such that (v,v~)∈Ej, let ωπjN(v)πjN(v)=ωvv, and set
ΩN:={ωvu:(u,v)∈EN}.
–
For j=1,2 and v∈Vj∖Vin, let θπjN(v)=θv, and set ΘN:={θu:u∈VN∖Vin}.
Informally, the network N is obtained by putting N1 and N2 “side by side”, sharing only the input nodes Vin. As N1 and N2 are non-degenerate, so is N. Moreover, Properties (i) and (ii) of Definition 16 hold for N with πjN:Vj→πj(Vj)⊂VN, for j=1,2.
Thus N is a proto-amalgam of N1 and N2, and so A=∅.
Now, let A=(VA,EA,VinA,VoutA,ΩA,ΘA)∈A be a network with the least possible number of nodes among all the networks in A, and let πj:Vj→πj(Vj)⊂VA, for j∈{1,2}, be extensional isomorphisms between Nj and the appropriate subnetworks of A. We now show that A is clones-free. To this end, suppose by way of contradiction that c1,c2∈VA are clones. As N1 is clones-free, c1,c2 cannot both be in π1(V1), for otherwise π1−1(c1) and π1−1(c2) would be clones in N1. By the same token, c1,c2 cannot both be in π2(V2). Thus, we may write w.l.o.g. c1=π1(v1) and c2=π2(v2), for some v1∈V1 and v2∈V2.
Now, let A be the network obtained from A by making the following alterations:
–
For every edge (c2,v)∈EA, where v∈VA, introduce a new edge (c1,v) together with the associated weight ωvc2, and delete the edge (c2,v).
–
Delete the edges (v,c2)∈EA, as well as the node c2.
–
If c2 was a node in π2(Vout2), then add c1 to the set VoutA.
The network A is a proto-amalgam of N1 and N2 via the extensional isomorphisms π1=π1 and
[TABLE]
But A has strictly fewer nodes than A, which contradicts the minimality of A, and thereby establishes that A is clones-free, and hence A is an amalgam of N1 and N2, completing the proof of existence.
To establish uniqueness—up to extensional isomorphisms—of the amalgam, suppose that A and A′ are both amalgams of N1 and N2 via extensional isomorphisms πj:Vj→πj(Vj)⊂VA, πj′:Vj→πj′(Vj)⊂VA′, for j∈{1,2}. We first show that
[TABLE]
by induction on lvA(v). If v∈Vin, then (46) holds trivially as the restrictions of the maps πj, πj′, for j∈{1,2}, to the set Vin, both equal the identity map idVin.
Now, let L≥1 and suppose that (46) holds for all u∈π1(V1)∩π2(V2) with lvA(u)<L. Let v∈π1(V1)∩π2(V2) with lvA(v)=L, but otherwise arbitrary, and write wj=(πj′∘πj−1)(v), for j=1,2.
By Property (i) of Definition 16 for the amalgam A we have N1∼eA(π1(Vout1)) and N2∼eA(π2(Vout2)),
and so N1(π1−1(v))∼eA(v) and N2(π2−1(v))∼eA(v) by appropriately restricting π1 and π2.
Similarly, N1((π1′)−1(w1))∼eA′(w1) and N2((π2′)−1(w2))∼eA′(w2).
But (πj′)−1(wj)=πj−1(v), and so Nj((πj′)−1(wj))=Nj(πj−1(v)), for j∈{1,2}. Therefore A′(w1)∼eA(v) and A′(w2)∼eA(v) via π1∘(π1′)−1 and π2∘(π2′)−1, respectively. Now, as A′ is an amalgam, it is clones-free, and thus we deduce that w1=w2, for otherwise w1 and w2 would be clones in A′. This establishes (46).
Now define ψ:VA→VA′ according to
[TABLE]
It follows by (46) that this definition is consistent, in the sense that the two cases in (47) yield the same value for ψ(v) when v∈π1(V1)∩π2(V2). Now, Properties (i) and (ii) of Definition 14 for ψ follow, so ψ is an extensional isomorphism between A and A′, finishing the proof.
∎
Denote by Dσ=C∖P the domain of holomorphy of σ. We proceed by induction on lv(u). In the base case lv(u)=0, i.e., u=vin, the claim is trivially true with Eu=∅. Now suppose that lv(u)≥1, and assume the statement holds for all v∈V with lv(v)<lv(u), i.e., D⟨v⟩σ=C∖Ev, where Ev are closed countable subsets of C∖R. Set Eu=C∖D⟨u⟩σ. We will show that Eu is a closed countable subset of C∖R. To this end, first note that S:=⋃v∈par(u)Ev is a closed countable subset of C∖R, and thus C∖S
is an open connected set containing R.
We claim that if z∗ is a limit point of Eu∖S, then z∗∈S. Suppose otherwise, i.e., there exist a sequence (zn)n∈N of distinct elements of Eu∖S, and a point z∗∈C∖S, such that zn→z∗. Define the function f:C∖S→C, f(z)=∑v∈par(u)ωuv⟨v⟩σ(z)+θu. As the functions ⟨v⟩σ are holomorphic on D⟨v⟩σ, they are, in particular, continuous, and so f is continuous. Therefore f(zn)→f(z∗) as n→∞.
As
[TABLE]
it follows by definition of natural domain that f(zn)∈P, for all n∈N.
Moreover, since P is discrete, we deduce that there exists a point p∗∈P such that f(zn)=p∗, for all sufficiently large n∈N. Now, since C∖S is connected and f is holomorphic, it follows that f(z)=p∗, for all z∈C∖S. But 0∈R⊂C∖S, which thus implies p∗=f(0)=∑v∈par(u)ωuv⟨v⟩σ(0)+θu∈R, contradicting P⊂C∖R. This completes the proof that any limit point of Eu∖S is contained in S.
Now define the sets
E_{u}^{N}:=\{z\in E_{u}:|z|\leq N,\;d(z,S)\geq 1/N\Big{\}},\text{ for }N\in\mathbb{N},
where d denotes the Euclidean distance in C. We see that EuN is finite, for each N∈N, for otherwise there would exist a sequence (zn)n∈N of distinct elements of EuN converging to a point z∗∈C. But then, by the claim above, we have z∗∈S, which contradicts d(zn,S)≥1/N, for all n∈N. We deduce that Eu=S∪⋃N∈NEuN is a closed countable set, and therefore D⟨u⟩σ=C∖Eu is an open connected set. To see that D⟨u⟩σ⊃R, note that, for z∈R, we have z∈C∖S=⋂v∈par(u)D⟨v⟩σ, and f(z)∈R⊂Dσ, so z∈D⟨u⟩σ.
∎
Let a, δ, and T be as in the statement of the lemma, such that Dk∘(a,δ)⊂U and F∣T≡0. Then the function Fa:=F(⋅+a) is holomorphic on U−a, and Fa∣T−a≡0. Thus, as F∣U≡0 if and only if Fa∣U−a≡0, it suffices to prove the result for a=0. Let T0:=T, Tk:=Dk∘(0,δ), and, for r=1,…,k−1, define the sets
[TABLE]
Note that Tr⊂Dk∘(0,δ)⊂U, for r∈{0,…,k}. We establish by induction over r that F∣Tr≡0, r∈{0,…,k}. The base case F∣T0≡0 holds by assumption. So suppose that F∣Tr≡0, for some r∈{0,…,k−1}.
If 0≤r<k−1, fix arbitrary zj∈(−δ,δ), for j∈{1,…,k−r−1}. Similarly, if 0<r≤k−1, fix arbitrary sj∈D1∘(0,δ), for j∈{k−r+1,…,k}. Consider the function G:D1∘(0,δ)→C defined by
[TABLE]
Note that G is holomorphic, and G∣(−δ,δ)≡0 by the induction hypothesis. Since the zero set of a nonzero holomorphic function in one variable does not have a limit point in the domain, we deduce that G∣D1∘(0,δ)≡0. But zj and sj were arbitrary, so we have F∣Tr+1≡0.
We have thus shown that F is identically zero on an open subset Tk=Dk∘(0,δ) of its connected domain U, and so, by the multivariate identity theorem [19, 1.2.12], it must be identically zero on U.
∎
Let t∗, a, δ, T, and T be as in the statement of the lemma, such that Dk∘(a,δ)⊂U, T⊂(C∖{t∗})×Ck, cl(T)⊃T, and F∣T≡0, and denote V:=D1+k∘(a,δ). The function F(t∗,a)=F(⋅+(t∗,a)) is holomorphic on U−(t∗,a), and the sets
[TABLE]
and T(t∗,a):=T−(t∗,a) satisfy T(t∗,a)⊂(C∖{0})×Ck, cl(T(t∗,a))⊃T(t∗,a), and F(t∗,a)∣T(t∗,a)≡0.
Therefore, as F∣U≡0 if and only if F(t∗,a)∣U−(t∗,a)≡0, and (t∗,a) was arbitrary, it suffices to prove the result for (t∗,a)=(0,0).
Assume by way of contradiction that F∣V is not identically 0. Then, by inspection of the power series expansion of F in the open neighborhood V of (0,0), we obtain that there exists a maximal p∈N0 such that z0−pF(z0,z1,…,zk) is holomorphic in V. Write G(z0,z1,…,zk)=z0−pF(z0,z1,…,zk), with G:V→C holomorphic and not identically 0. Now, due to T⊂(C∖{0})×Ck, we have z0=0, for every (z0,z1,…,zk)∈T. Moreover, as F∣T≡0, we have G(z0,z1,…,zk)=z0−p⋅0=0, for all (z0,z1,…,zk)∈T. Now, since G is continuous and cl(T)⊃T by assumption, it follows that G(0,z1,…,zk)=0, for all (0,z1,…,zk)∈T. The mapping (z1,…,zk)↦G(0,z1,…,zk) is holomorphic on Dk∘(0,δ) and identically zero on the set
[TABLE]
and so, by Lemma 4, we obtain G(0,z1,…,zk)=0, for all (0,z1,…,zk)∈V.
By inspection of the power series expansion of G in V, we find that G must have the form G(z0,z1,…,zk)=z0∂z0∂G(z0,z1,…,zk). As the function ∂z0∂G is holomorphic in V, we have that z0−(p+1)F(z0,…,zk)=∂z0∂G(z0,…,zk) is holomorphic in V, contradicting the maximality of p. Our hypothesis that F∣V is not identically zero must hence be false, i.e., we have F∣V≡0. Finally, by the multivariate identity theorem [19, 1.2.12], we deduce that F∣U≡0.
∎
First note that M is the closure of a one-parameter subgroup of Td=Rd/Zd. Since Td is compact and abelian, so is M. Moreover, M is connected (as the closure of a connected set), and so, by [20, Theorem 11.2], it is itself isomorphic to a torus. It remains to determine its dimension. A character on a compact abelian group G is a continuous group homomorphism χ:G→S1, where S1={z∈C:∣z∣=1} is the multiplicative circle group, and we denote by G the set of all characters on G.
We claim that
[TABLE]
The inclusion of M in the right-hand side is clear, so we only need to show the reverse inclusion. Note that, since M is closed, Td/M is a Lie group.
We will rewrite the right-hand side of (48) by establishing a bijective correspondence between the characters χ:Td→S1 such that M⊂ker(χ), and the characters f:Td/M→S1. To this end, let π:Td→Td/M be the projection map, and suppose that χ:Td→S1 is a character such that M⊂ker(χ). Then χ factors according to χ=f∘π, for some continuous homomorphism f:Td/M→S1, in other words, f is a character on Td/M. Conversely, for any such f we have that f∘π is a character χ on Td with M⊂ker(χ). Therefore it suffices to show that
[TABLE]
Indeed, if this is the case, then
[TABLE]
as desired. We thus proceed to establishing (49). First note that, as Td is compact, connected, and abelian, then so is Td/M, and thus by [20, Theorem 11.2] we have that Td/M is isomorphic (as a Lie group) to the torus Tr of some dimension r≥0. Now suppose that (u1,u2,…,ur)∈Tr is such that f(u1,u2,…,ur)=1, for all characters f:Tr→S1. Our goal is to show that uj=0modZ, for all j=1,…,r. For a given j∈{1,…,r} let fj(t1,t2,…,tr)=e2πitj. Since fj:Tr→S1 is a character, we have 1=fj(u1,…,ur)=e2πiuj, and thus uj=0modZ. Since this holds for all j, we have (49), and therefore also (48).
Note that any character on Td has the form
[TABLE]
where m=(m1,m2,…,md)∈Zd (this is easily seen for d=1, and follows by induction for other values of d). Now, for any character χm:Td→S1 such that M⊂ker(χm), we have
[TABLE]
by definition of M, which is equivalent to
[TABLE]
It follows immediately that Z={m∈Zd:χm∈Td,M⊂ker(χ)} is a free abelian group of dimension r=n−k, where k=dim⟨α1,…,αd⟩Q. We can thus pick a basis {m1,…,mr} for Z, and then, for any character χm with m∈Z, we have χm=χm1n1…χmrnr, for some n1,…,nr∈Zr. Therefore M is the kernel of the continuous surjective homomorphism Φ:Tn→Sr given by Φ=(χm1,…,χmr), and hence its dimension is n−r=k, as desired.
∎
Let K=kerΦ, and note that M′ is the image of Φ. Further, note that K is an abelian group, and a subgroup of Zk. For j=1,…,k, let Nj∈Z be such that qpjNj∈Z, for all p=1,…,d. Let ej∈Rk be the vector with Nj in the j-th entry, and [math] in all the other entries. Then Φ(ej)=0+Zd, for all j=1,…,k, so E:={e1,…,ek}⊂K. Moreover, E is a basis for Rk, so K is a lattice of rank k. Therefore M′ and Rk/K are isomorphic as groups via the induced map
[TABLE]
Since Φ is a continuous bijection, Rk/K is compact, and Td is Hausdorff, it follows that the map Φ is, in fact, a Lie group isomorphism (when M′ is equipped with the subspace topology inherited from Td). In particular, M′ is a torus of dimension k. Let {b1,…,bk} be a basis for K, and let
[TABLE]
be a fundamental domain of the lattice K. Then, for any u∈Rk we can write u=b+k with b∈B and k∈K. We will prove the lemma with
[TABLE]
where int(B) denotes the interior of B. Note that C is open and 0∈C.
For t∈R we have
[TABLE]
and so M⊂M′. Moreover, by Lemma 6 we have that cl(M) is a torus of dimension k, so we deduce cl(M)=M′. We next establish that cl(MR)=M′, for every R>0. To this end, we distinguish between the cases k=1 and k≥2.
The case k=1. Let (α1t,α2t,…,αdt)+Zd, t∈R, be an arbitrary element of M. As dim⟨α1,…,αd⟩Q=k=1, there exist a∈R∖{0} and m1,…,md∈Z such that (α1,α2,…,αd)=(am1,am2,…,amd). Now let n∈Z be an integer such that t+n/a∈/[−R,R]. Then
[TABLE]
Therefore MR=M, and so cl(MR)=cl(M)=M′.
The case k≥2. First note that
[TABLE]
is the image of [−R,R]⊂R under a continuous bijective map from R to Td. Since [−R,R]⊂R is compact and Td is Hausdorff, it follows by [21, Cor. 15.1.7] that LR is homeomorphic to [−R,R]. In particular, LR is a 1-dimensional submanifold of M with boundary. Now, by general properties of the closure, we have cl(MR)=cl(M∖LR)⊃cl(M)∖cl(LR)=M′∖LR.
Therefore, as M′ has dimension k>1 and LR has dimension 1, we have cl(MR)=cl(cl(MR))⊃cl(M′∖LR)=M′. On the other hand, cl(MR)⊂cl(M)=M′, and thus cl(MR)=M′, as desired.
Now fix some s=(u1/α1,…,uk/αk)∈C, where u=(u1,…,uk)∈int(B). Since MR is dense in M′, for every R>0, there exists a sequence (tn,s)n∈N in R with ∣tn,s∣→∞ such that
[TABLE]
As M⊂M′, there exists a sequence (un,s)n∈N such that
and after applying the isomorphism Φ−1, we obtain un,s+K→u+K
as n→∞. Now, for each n∈N, let un,s=(u1n,s,…,ukn,s)∈B be such that un,s−un,s∈K. Then we have un,s+K→u+K as n→∞. Since u∈int(B), there exists an n0∈N such that un,s∈int(B), for n≥n0. By discarding the first n0 terms of the sequences (tn,s)n∈N and (un,s)n∈N, we may assume w.l.o.g. that n0=0. It follows that un,s→u as n→∞. Now define rn,s=(u1n,s/α1,…,ukn,s/αk). We then have rn,s∈C, rn,s→s, and (53) yields
[TABLE]
as desired.
∎
Bibliography21
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1[1] H. J. Sussman, “Uniqueness of the weights for minimal feedforward nets with a given input-output map,” Neural Networks , vol. 5, no. 4, pp. 589–593, July 1992.
2[2] F. Albertini, E. D. Sontag, and V. Maillot, “Uniqueness of weights for neural networks,” Artificial Neural Networks for Speech and Vision , pp. 113–125, 1993.
3[3] C. Fefferman, “Reconstructing a neural net from its output,” Revista Matemática Iberoamericana , vol. 10, no. 3, pp. 507–555, 1994.
4[4] Y. Le Cun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Müller, E. Säckinger, P. Simard, and V. Vapnik, “Comparison of learning algorithms for handwritten digit recognition,” International Conference on Artificial Neural Networks , pp. 53–60, 1995.
5[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 . Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
6[6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag. , vol. 29, no. 6, pp. 82–97, 2012.
7[7] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning . MIT Press, 2016.
8[8] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen, “Optimal approximation with sparsely connected deep neural networks,” SIAM Journal on Mathematics of Data Science , vol. 1, no. 1, pp. 8–45, 2019.