When Can Neural Networks Learn Connected Decision Regions?
Trung Le
Dinh Phung
Abstract
Previous work has questioned the conditions under which the decision
regions of a neural network are connected and further showed the implications
of the corresponding theory to the problem of adversarial manipulation
of classifiers. It has been proven that for a class of activation
functions including leaky ReLU, neural networks having a pyramidal
structure, that is no layer has more hidden units than the input dimension,
produce necessarily connected decision regions. In this paper, we
advance this important result by further developing the sufficient
and necessary conditions under which the decision regions of a neural
network are connected. We then apply our framework to overcome the
limits of existing work and further study the capacity to learn connected
regions of neural networks for a much wider class of activation functions
including those widely used, namely ReLU, sigmoid, tanh, softlus,
and exponential linear function.
Generative Model, Adversarial Learning.
1 Introduction
Deep learning has witnessed a transformed success in a diverse variety
of application domains, notably computer vision (Krizhevsky et al., 2012),
natural language processing (Bahdanau et al., 2014), speech recognition
(Graves et al., 2013), and generative models (Kingma & Welling, 2013; Goodfellow et al., 2014).
While these applied deep learning methods have hugely fueled by successful
applications, important theoretical investigations are generally lacked
behind.
Theoretical studies tie hand-in-hand with practical aspects to help
us with insights to train and tame deep learning models. Some important
theoretical questions have been studied intensively in the literature,
these include the representation power of neural networks with respect
to their depth and width, the landscape of the loss surfaces of deep
learning networks, and the capacity to learn connected regions in
the input data space. The first question relates to the design of
architectures for neural networks; the second question concerns the
training aspect of deep learning models, while the last question has
important implications in the study of the generation of adversarial
samples.
The first important progress in the study of representation power
of deep NNs is the universal approximation theorems (Cybenko, 1989; Hornik et al., 1989)
which state that a feed-forward network with a single hidden layer
containing a finite number of neurons can approximate continuous functions
on compact subsets of Rd, under mild assumptions on
the activation function. Other subsequent works (Delalleau & Bengio, 2011; Eldan & Shamir, 2016; Safran & Shamir, 2017; Mhaskar & Poggio, 2016; Liang & Srikant, 2016; Yarotsky, 2017; Poggio et al., 2017)
have been proposed to analyze the representation power of neural
networks w.r.t their depth. In particular, it has been shown that
there exist functions that can be computed efficiently by deep networks
of linear or polynomial size but require exponential size for shallow
networks. Last but not least, some recent works have studied the power
of width efficiency (Lu et al., 2017; Hanin & Sellke, 2017).
In particular, these works have indicated that neural networks with
ReLU activation function have to be wide enough in order to have the
universal approximation property as depth increases. More specifically,
the authors prove that the class of continuous functions on a compact
set cannot be arbitrarily well approximated by an arbitrarily deep
network if the maximum width of the network is not larger than the
input dimension d.
Regarding the second question on the landscape of the loss surfaces
of deep learning networks, there have been several interesting results
recently (Brutzkus & Globerson, 2017; Poggio & Liao, 2017; Rister & Rubin, 2017; Soudry & Hoffer, 2017).
For some classes of networks it can be shown that the global optimum
can be obtained efficiently. However, due to the requirement of knowledge
about the data generating measure, or the strict specification of
the neural network structure and optimization objective formulation
(Gautier et al., 2016), these approaches are generally not practical
(Janzamin et al., 2015; Soltanolkotabi, 2017). Another class
of networks whose every local minimum is also a global minimum has
been shown to be deep linear networks (Baldi & Hornik, 1989; Kawaguchi, 2016).
While this is a highly non-trivial result as the optimization problem
is non-convex, deep linear networks are generally less preferable
in practice since they are limited in linear function regime. In order
to characterize the loss surface for general networks, an interesting
approach was taken by (Choromanska et al., 2015). By randomizing
the nonlinear part of a feedforward network with ReLU activation function
and making some additional simplifying assumptions, the authors can
map it to a certain spin glass model under which one can analyze analytically.
In particular, the local minima are shown to be close to the global
optimum and the number of bad local minima decreases quickly with
the distance to the global optimum. Recently, the works of (Nguyen & Hein, 2017, 2018)
have shown that for deep neural networks with a very wide layer, where
the number of hidden units is larger than the number of training points,
a large class of local minima is globally optimal, which generalizes
the previous work of (Yu & Chen, 1995).
The theoretical question on the capacity of deep networks to learn
connected decision regions is a particularly important one and has
been recently addressed in (Nguyen et al., 2018). In particular, (Nguyen et al., 2018)
has shown that for a feed-forward neural network with a pyramid architecture,
the full-ranked weigh matrices, and the strictly monotonically increasing
continuous activation functions σ with σ(R)=R
at each layer, the decision regions are connected. While this work
has pioneered the preliminary results for this problem, its theoretical
analysis only holds for a fairly narrow class of activation functions
notably including the leaky ReLU, which is less used in practice.
It is hence important to question the necessary and sufficient conditions
under which a feedforward neural network‘s decision regions are connected
and if the theory can be extended for a much wider class of activation
functions including those widely used in practice such as ReLU, sigmoid,
tanh, softlus, and exponential linear function. Our goal in this paper
is to advance the theories achieved in the previous work (Nguyen et al., 2018)
by answering these questions. Specifically, we first propose the sufficient
and necessary conditions for which a feedforward neural network‘s
decision regions are connected and then, base on these conditions
to study when a feedforward neural network with the popular aforementioned
activation functions can learn connected decision regions.
2 Related Background
We briefly introduce the convention used to describe feedforward neural
networks, followed by the definition of a path-connected set and related
properties.
2.1 Feedforward Neural Networks
We consider feedforward neural networks for the multi-class classification
problem. Let us denote the number of classes by M (i.e., the class
label y∈{1,2,…,M}) and the input dimension
by d (i.e., the data sample x∈Rd). Let us consider
a feedforward neural network with L layers wherein the input layer
is indexed by [math] and the output layer is indexed by L. We further
denote the width of layer k (i.e., 0≤k≤L) by nk.
For consistency, we enforce the constraints n0=d and nL=M.
For each hidden layer k (i.e., 1≤k≤L−1), we define the
activation function for this layer as σk:R→R.
We also define the feature map function over the layer k (0≤k≤L)
as a function fk:Rd→Rnk, which
computes for every input x∈Rd a feature vector
at layer k defined recursively as:
[TABLE]
where Wk∈Rnk×nk−1 is the weight matrix
and bk∈Rnk is the bias vector at the layer
k.
2.2 Activation Functions
We consider a range of the activation functions widely used in deep
learning.
Sigmoid function
The sigmoid function squashes its input into the range (0;1):
[TABLE]
Tanh function
The tanh function squashes its input into the range (−1;1):
[TABLE]
ReLU function
The ReLU function squashes its input into the range [0;+∞):
[TABLE]
Leaky ReLU function
The leaky ReLu function squashes its input into the range (−∞;+∞):
[TABLE]
where 0<α<1.
Softflus
The softlus function squashes its input into the range (0;+∞):
[TABLE]
Exponential linear function
The exponential linear function squashes its input into the range
(−α;+∞):
[TABLE]
We note that except the ReLU function all other activation function
are continuous bijections from R to their ranges.
2.3 Mapping Functions
Let f:U→V be a map from U⊂Rm to V⊂Rn.
We denote dom(f)=U and range\left(f\right)=f\left(U\right)=\left\{\boldsymbol{v}\mid\boldsymbol{v}=f\left(\boldsymbol{u}\right)\text{ for some \boldsymbol{u}∈U}\right\}.
Given a subset A⊂U, the image f(A) of this
set via the map f is defined as:
[TABLE]
Definition 1**.**
(Pre-image) Given a map f:U→V , the preimages of
an element v∈V and a subset A⊂V via this map are
defined as
[TABLE]
Proposition 2**.**
Let f:U→V, g:V→T with
U⊂Rm,V⊂Rn,T⊂Rp,
and A⊂Rp. Then we have
[TABLE]
2.4 Connectivity of Decision Regions
We briefly recap the definition and properties of path connectivity
used in sequel development. We will also recall key theoretical results
reported in (Nguyen et al., 2018).
Definition 3**.**
(Path-connected) Consider Rm with the standard
topology. A subset A⊂Rm is said to be path-connected
if for every u,v∈A, there exists a continuous map f
from [0;1] to A, i.e., f:[0;1]→A
such that f(0)=u and f(1)=v.
Corollary 4**.**
If g:U→V is a continuous map and A⊂U is a path-connected
set then g(A) is also a path-connected set.
Corollary 5**.**
If g:U→V is a continuous bijection and B⊂V is
a path-connected set then g−1(B) is also a path-connected
set.
With reference to the description of feedforward neural networks in
Section 2.1, we now present
the definition of decision region for each class whose connectivity
is central to our theory.
Definition 6**.**
(Decision region) Given a neural network with L layers,
the decision region of a given class 1≤m≤M, denoted by
Cm, is defined as
[TABLE]
We now recall the main results studied in (Nguyen et al., 2018).
Theorem 7**.**
(Nguyen et al., 2018)* Let the width of the layers
of the feedforward neural network satisfy d=n0≥n1≥n2≥⋯≥nL−1
and let σl:R→R be continuous, strictly
monotonically increasing activation function with σl(R)=R
for every layer 1≤l≤L−1 and all the weight matrices (Wl)l=1L−1
have full rank. Then every decision region Cm is an open connected
subset of Rd for every 1≤m≤M.*
3 Main Theoretical Results
3.1 Notations
We denote by 1∈Rn the vector of all 1,
1k∈Rn the one-hot vector with 1 at
the k-th index and [math] at others, and 0 as the vector
of all [math]. Given a vector u∈Rn and 1≤i≤j≤n,
ui:j is defined as the sub vector [uk]i≤k≤j.
Given two vectors u,v∈Rn, the segment [u,v]
connecting u and v defined as [u,v]={x=(1−t)u+tv∣t∈[0;1]}.
A set A⊂Rn is said to be a convex set if the
segment [u,v]⊂A for every u,v∈A.
We say that u≤v if only if ui≤vi for every
1≤i≤n; other operators, namely ≥,<, and >, are
defined in a similar element-wise manner. We define max{u,v}=[max{ui,vi}]i=1n
and min{u,v}=[min{ui,vi}]i=1n.
We also define \text{\overline{\text{Rect}}}\left(\boldsymbol{u},\boldsymbol{v}\right)=\left\{\boldsymbol{x}\in\mathbb{R}^{n}\mid\min\left\{\boldsymbol{u},\boldsymbol{v}\right\}\leq\boldsymbol{x}\leq\max\left\{\boldsymbol{u},\boldsymbol{v}\right\}\right\},
\text{\overline{\text{Rect}}}\left(\boldsymbol{u}\right)=\left\{\boldsymbol{x}\in\mathbb{R}^{n}\mid\boldsymbol{u}\leq\boldsymbol{x}\right\}
and Rect(u,v)={x∈Rn∣min{u,v}<x<max{u,v}},
Rect(u)={x∈Rn∣u<x}.
It is well-known that for a finite-dimensional normed space Rn,
all norms are equivalent (See Theorem 2.2.16 in (Hsing & Eubank, 2015)),
hence inducing the same topology. We use the standard topology on
Rn to imply this identical topology which can be induced
by any norm in this space. Consider Rn with the standard
topology and with the norm ∥⋅∥. An open ball with the
center x and the radius r>0 is defined as B(x,r)={y∈Rn∣∥y−x∥<r}.
Based on the standard topology on Rn, we define the
closure set A by cl(A), which is the smallest
closed super set of A and the interior set of A by int(A),
which is the largest open subset of A.
3.2 Theoretical Results
In this section, we present our main theory for the path connectivity
of decision regions induced by a feedforward neural network. We start
this section with the definition of the piecewise connectivity.
Definition 8**.**
(Piecewise-connected) Consider Rm with the
standard topology. A subset A⊂Rm is said to be
a piecewise-connected set if for every u,v∈A, there exists
a sequence of elements x1=u,x2,…,xn=v
in A such that the segments [xi,xi+1]⊂A
for every 1≤i≤n−1.
In the following theorem, we study the theoretical relationship between
path connectivity and piecewise connectivity. It turns out that in
a standard topology over Rm, these two concepts of
connectivity are equivalent. To prove this central theorem, we need
the following lemmas.
Lemma 9**.**
Let B1=B(x1,r1) and B2=B(x2,r2)
be two joint sets (i.e., B1∩B2=∅ ). Then the
segment [x1,x2]⊂B1∪B2.
Proof.
Let x=(1−t)x1+tx2 with 0≤t≤1.
We have ∥x−x1∥=t∥x1−x2∥ and ∥x−x2∥=(1−t)∥x1−x2∥.
Then
[TABLE]
Hence, either ∥x−x1∥ or ∥x−x2∥ is less
than r1 or r2 respectively which implies x∈B1∪B2.
∎
Lemma 10**.**
Let a path-connected subset A⊂Rm,u,v∈A,
and a continuous function f:[0;1]→A with f(0)=u,f(1)=v.
Let P,Q be two open sets such that u∈P,v∈Q,f([0;1])⊂P∪Q.
Then, P∩Q=∅.
Proof.
Since P,Q are two open sets, f−1(P) and f−1(Q)
are also open in [0;1] and these two sets are non-empty
due to 0∈f−1(P) and 1∈f−1(Q).
Moreover, f−1(P)∪f−1(Q)=[0;1].
This means that we can find two non-empty open sets f−1(P)
and f−1(Q) such that f−1(P)∪f−1(Q)=[0;1].
Therefore, f−1(P)∩f−1(Q)=∅
because otherwise [0;1] is not connected. Finally, we
obtain P∩Q=∅.
∎
Theorem 11**.**
Consider Rm with the standard topology. An open subset
A⊂Rm is path-connected if only if it is piecewise-connected.
Proof.
We prove two ways of this theorem.
Assume that A is piecewise-connected. Given two elements u,v
in A, there exists a sequence of elements x1=u,x2,…,xn=v
in A such that the segments [xi,xi+1]⊂A
for every 1≤i≤n−1. Let us consider the following function
that maps from [0;1] to A:
[TABLE]
where 1S(t) returns 1 if the statement
S is true and [math] otherwise.
This function is continuous, f(0)=x1=u, f(1)=xn=v,
and f([0;1])∈A. This implies that A
is also path-connected.
We now assume that A is path-connected. Given two elements u,v
in A, there exists a continuous function mapping from [0;1]
to A such that the arc f([0;1]) connecting
u,v lies in A. Since [0;1] is a compact
set and f is continuous, the arc f([0;1])
is a compact set in Rm. A is an open set, hence
for each x∈A there exists an open ball B(x,rx)⊂A.
We consider I={x∣B(x,rx)∩f([0;1])=∅}.
It is obvious that f([0;1])⊂I, hence
the collection {B(x,rx)∣x∈I}
is an open coverage of f([0;1]). From the
compactness of f([0;1]), there exists an
finite open coverage {B(x,rx)∣x∈J}
where J⊂I is finite. Without loss of generality, we assume
that u,v∈J because otherwise we can extend J. We now
construct a graph G=(V,E) where the set of vertices
V⊂J and the set of edges E are all initialized by ∅
and gradually conducted as follows. We first set V={z1}
where z1=u. We then set P=B(z1,rz1)
and Q=∪x∈J\VB(x,rx). This
is obvious P,Q are two open sets satisfying the conditions in Lemma
10, hence P∩Q=∅ which implies there
exists z2∈J\V such that P∩B(z2,rz2)=∅.
We then add z2 to V and also the edge z1z2
to E. In general, at each step we define P=∪x∈VB(x,rx)
and Q=∪x∈J\VB(x,rx). Two
open sets P,Q obviously satisfy the conditions in Lemma 10,
hence P∩Q=∅. We now consider two cases:
B(x1,rx1)∩B(v,rv)=∅
for some x1∈V: we set zn+1=v where n=∣V∣,
then add zn+1 to V as well as the edge x1zn+1
to E, and stop the algorithm to construct G=(V,E).
B(x1,rx1)∩B(x2,rx2)=∅
for some x1∈V,x2∈J\V but B(v,rv)∩P=∅:
we set zn+1=x2 where n=∣V∣, then add zn+1
to V as well as the edge x1zn+1 to E,
and continue the algorithm to construct G=(V,E).
It is worth noting that the graph G=(V,E) constructing
using the above algorithm is always a connected tree. In addition,
due to the finiteness of J, the aforementioned algorithm must be
halted and ends with v∈V. We now consider the path u=z1=zt0,zt1,…,ztk−1,ztk=v
connecting u and v in G. By way of constructing this
graph, we have B(ztj,rztj)∩B(ztj+1,rztj+1)=∅
for j=0,2,…,k−1. Using Lemma 9, we obtain [ztj,ztj+1]⊂B(ztj,rztj)∪B(ztj+1,rztj+1)⊂A.
This concludes that A is a path-connected set.
∎
Lemma 12**.**
Let h:U→V be an onto affine map with U⊂Rm,
V⊂Rn, and h(u)=Wu+b. Let
B⊂V be an open path-connected subset of V. Then A=h−1(B)
is an open path-connected subset of U.
Proof.
Let u1,u2∈A then v1=h(u1)∈B
and v2=h(u2)∈B. Due to the path and also
piecewise connectivity of the open set B, there exists y1=v1,y2,…,yn−1,yn=v2
such that [yi,yi+1]⊂B for 1≤i≤n−1.
Since h is an onto linear map, there exists xi∈U such
that h(xi)=yi for every 1≤i≤n.
In addition, the linearity of h gives us h([xi,xi+1])=[h(xi),h(xi+1)]=[yi,yi+1]⊂B,∀i.
This follows that [xi,xi+1]⊂h−1(B)=A,∀i.
This concludes A is an open path (piecewise) connected set.
∎
Lemma 13**.**
Let h:U→V be an onto affine map with U⊂Rm,
V⊂Rn, and h(u)=Wu+b. Let
B⊂V be a convex subset of V. Then A=h−1(B)
is a convex subset of U.
Proof.
Let u1,u2∈A then v1=h(u1)∈B
and v2=h(u2)∈B. Due to the convexity of
B, the segment [v1,v2]⊂B. In addition,
the linearity of h gives us h([u1,u2])=[h(u1),h(u2)]=[v1,v2]⊂B.
This follows that [u1,u2]⊂h−1(B)=A.
This concludes A is a convex set.
∎
Lemma 14**.**
Let h:U→V be an affine map with
U⊂Rm, V⊂Rn, and h(u)=Wu+b.
Let A⊂U be a convex subset of U. Then B=h(A)⊂V
is a convex subset of B.
Proof.
Let v1=h(u1)∈B and v2=h(u2)∈B
where u1,u2∈A. From the convexity ofA, the segment
[u1,u2]∈A. The the linearity of h gives
us [v1,v2]=[h(u1),h(u2)]=h([u1,u2])⊂B.
This follows that B is convex.
∎
Lemma 15**.**
Let g:U→V be an onto continuous map
with U⊂Rm, V⊂Rn and B⊂V
be a subset of V. If B is not path-connected, A=g−1(B)
is not path-connected too.
Proof.
This is trivial from the fact that if A is path-connected then
B=g(A) is also path-connected.
∎
Lemma 16**.**
Let σ:R→R be a
bijective, continuous activation function. Define σ^:Rn→Rn
as σ^(x)=[σ(x1)…σ(xn)]T
where x=[x1…xn]T. Let V⊂σ^(Rn)
be a path-connected set. Then U=σ^−1(V)
is also a path-connected set.
Proof.
This is trivial from the fact that σ^−1:V→U is
a continuous bijective map and V is path-connected.
∎
Lemma 17**.**
Let g:U→V be a continuous map with
U⊂Rm, V⊂Rn. Let A⊂B⊂U
such that cl(A)=B and g(B) is closed,
then g(A)⊂g(B)⊂V and cl(g(A))=g(B).
Proof.
We first have g(A)⊂g(B)⊂V,
hence cl(g(A))⊂cl(g(B))=g(B)
. Now let v=g(u)∈g(B) with u∈B,
since cl(A)=B, there exists a sequence [un]n⊂A
and limn→∞un=u. From the continuity of g,
we obtain limn→∞g(un)=g(u)=v.
This follows that v∈cl(g(A)) or
g(B)⊂cl(g(A)).
∎
We are now in a position to state the necessary and sufficient conditions
under which decision regions for classes are path-connected (cf. Theorem
18). To support the theorem stated, we further introduce
the set Dm defined as
[TABLE]
It is clear that Dm is an open convex set since formed by the
intersection of M half-spaces Hj={o∈RM∣om>oj}.
Theorem 18**.**
For every 1≤m≤M the decision region
Cm is an open path-connected set if only if fL(Rd)∩Dm
is an open path-connected set provided that fL(x)
is a feedforward neural network and the activation functions σk,1≤k≤L−1
used in this network are continuous bijections (i.e., σk
can be the sigmoid, tanh, leaky ReLU, softlus, and exponential linear
activation functions).
Proof.
While Cm and fL(Rd)∩Dm are
open sets, we prove that if fL(Rd)∩Dm
is a path-connected set, so is Cm and if fL(Rd)∩Dm
is not a path-connected set, so nor is Cm.
Let us denote A1=h1(Cm) where h1(⋅)=W1×⋅+b1
and B1=σ1^(A1)=f1(Cm).
The sets A2,B2 are defined based on B1 as A2=h2(B1)
where h2(⋅)=W2×⋅+b2 and B2=σ2^(A2).
In general, the sets Ak,Bk,∀2≤k≤L−1 are
defined recursively as Ak=hk(Bk−1) where hk(⋅)=Wk×⋅+bk
and Bk=σk^(Ak). Finally, we define
AL=hL(BL−1) where hL(⋅)=WL×⋅+bL.
We now prove that AL=fL(Rd)∩Dm,
hence fL(⋅) is an onto map from Cm to AL.
In fact, taking any o∈fL(Rd)∩Dm,
there exists x∈Rd such that fL(x)=o
and it is also obvious that x∈Cm from the definition of
Dm.
Since BL−1=hL−1(AL) and hL(⋅)
is an affine map, BL−1 is an open path-connected set. Using
the fact that AL−1=σ^L−1−1(BL−1)
and σL−1 is a continuous bijection, AL−1 is also
an open path-connected set (Lemma 16). Using the
fact that AL−1=hL−1(BL−2) and hL−1(⋅)
is an affine map, we reach BL−2 is an open path-connected set.
Using the fact that AL−2=σ^L−2−1(BL−2)
and σL−2 is a continuous bijection, AL−2 is also
an open path-connected set (Lemma 16). Repeating
this argument backward the layers of the neural network, we obtain
A1,B1 are open path-connected sets. Finally from A1=h1(Cm)
and h1(⋅) is an affine map, we reach the conclusion
that Cm is an open path-connected subset of Rd.
The converse is trivial from the fact that fL(Cm)=AL=fL(Rd)∩Dm
and fL(⋅) is a continuous map, hence if fL(Rd)∩Dm
is not a path-connected set, Cm is also not a path-connected
set (thanks to Lemma 15).
∎
Lemma 19**.**
If cl(B) is a
convex polyhedron, then B is a path-connected set.
Proof.
If B=∅, then it is path-connected. Now assume that B=∅,
let w∈int(B), then there exists B(w,r)⊂B.
Consider any u,v∈B. We prove that [w,u] and [w,v]
are subsets of B to reach the conclusion. In fact. we first have
[u,a]⊂cl(B) for any a∈B(w,r),
hence [u,w]\{u} is a subset of int(cl(B)).
Since cl(B) is a polyhedron, we obtain int(cl(B))=int(B)⊂B.
Therefore, we reach [w,u]⊂B. Similarly, we obtain [w,v]⊂B.
∎
Lemma**.**
If B1 and B2 are two polyhedrons, then cl(B1∩B2)=cl(B1)∩cl(B2).
Proof.
Let B1={u∣W11u<b11,W12u=b12}
and B2={u∣W21u<b21,W22u=b22}.
We then have:
[TABLE]
∎
Lemma**.**
Let B={u∣Mu≤m,Nu=n} be a closed
polyhedron and h(u)=Wu+b be an affine map, then
h(B) is a closed polyhedron.
Proof.
We consider
[TABLE]
then C is a closed polyhedron.
We now remark that h(B)=πv(C) where
πv(u,v)=v is the projection map. This leads
to the conclusion.
∎
Theorem 18 sheds light on devising neural networks
whose decision regions are connected. Based on this theorem, we can
formulate a sufficient condition for a given neutral network being
able to learn connected decision regions stated in Corollary 20.
Corollary 20**.**
Consider a feedforward neural network with L layers
and A=fL−1(Rd). If either A is a convex
set or cl(A) is a polyhedron, then Cm is
an open path-connected set for every 1≤m≤M.
Proof.
Let hL(⋅)=WL×⋅+bL be the affine
map at the last layer. Assume that A is convex, then fL(Rd)=hL(fL−1(Rd))=hL(A)
is a convex subset of RM (thanks to Lemma 14).
This follows that fL(Rd)∩Dm is
a convex set for every 1≤m≤M due to the convexity of Dm.
Theorem 18 can be applied to reach the conclusion
since fL(Rd)∩Dm is a path-connected
set for every m.
We now assume that cl(A) is a polyhedron. This
follows that hL(cl(A)) is a closed
polyhedron since hL is an affine map. Referring to Lemma 17,
we obtain cl(hL(A))=hL(cl(A)),
which is a polyhedron. Since Dm is also a polyhedron, we obtain
cl(hL(A)∩Dm)=cl(hL(A))∩cl(Dm)
is also a polyhedron. By applying Lemma 19,
we arrive hL(A)∩Dm=fL(Rd)∩Dm
is a path-connected set.
∎
To see the usefulness of our new result in Corollary 20,
we use it to provide an alternative proof for the result stated in
(Nguyen et al., 2018) (Theorem 3.10 in that paper). Compared original proof,
our alternative proof does not require monotonically increasing property.
Corollary 20 also becomes extremely useful in our later
theoretical development to study of decision regions for a general
continuous bijective activation function (e.g., the leaky ReLU, ELU,
softflus, sigmoid, and tanh activation functions) which was not possible
to develop under the framework of (Nguyen et al., 2018).
Theorem 21**.**
(first stated in (Nguyen et al., 2018) and being re-proved here)
Let the width of the layers of the feedforward neural network satisfy
d=n0≥n1≥n2≥⋯≥nL−1 . Let σl:R→R
be bijective continuous activation function for every layer 1≤l≤L−1
and all the weight matrices (Wl)l=1L−1 have
full rank. If σl(R)=R for
every layer 1≤l≤L−1 then every decision region Cm
(i.e., 1≤m≤M) is an open connected subset of Rd.
Proof.
Let us denote Al=fl(Rd) , Bl=σ^l(Al)
, and hl(⋅)=Wl×⋅+bl (i.e., the
linear map at the layer l) for 1≤l≤L−1. It is obvious
that Al+1=hl+1(Bl) for 0≤l≤L−2 with
the assumption that B0=Rd.
The facts A1=h1(Rd) and W1 is
full rank gives us A1=Rn1. The facts B1=σ^1(A1)**
**and σ1 is a bijective continuous map with σ1(R)=R
gives us B1=Rn1. Similarly, we obtain A2=B2=Rn2
and finally AL−1=BL−1=RnL−1. Note that fL−1(Rd)=BL−1=RnL−1
certainly satisfies the condition in Corollary 20, we
reach the conclusion.
∎
Given a full rank matrix W∈Rn×m with n≤m,
there exists n linearly independent columns, e.g., W1c,…,Wnc,
of W. In other words, the matrix W1=[W1cW2c…Wnc]∈Rn×n
formed by these columns has rank n and is invertible, while the
matrix W2 formed by the rest columns is in Rn×(n−m).
Here we note that the columns in W1 do not need to be consecutive.
However, for the sake of simplicity, without loss of generalization
we assume that they are in a row. Furthermore, since each column in
W2 can be represented as a linear combination of those in W1,
there exists a matrix U∈Rm×(m−n)
such that W2=W1U. We next study under which conditions an
affine transformation transform Rect(u1,u2)
to Rect(v1,v2) or Rect(u)
to Rect(v).
Corollary 22**.**
Let h:Rm→Rn
be an affine map with h(u)=Wu+b where W∈Rn×m
is a full rank matrix (m≥n). Let W=[W1W2]
wherein W1∈Rn×n and W2∈Rn×(m−n)
are defined as above. If W and V are two non-negative matrices
with V=(W1)−1, the image of \text{\overline{\text{Rect}}}\left(\boldsymbol{u}\right)\subset\mathbb{R}^{m}
is Rect(v)⊂Rn
with v=h(u).
Proof.
Let y≥v=h(u)=Wu+b. Let a1=V(y−v)∈Rn
, a2=0m−n∈Rm−n, and a=[a1a2]T.
Let x=u+a∈Rm . We then have
[TABLE]
In addition, since V≥0 and y−v≥0, we obtain
a1≥0 and hence a≥0. This follows that
x≥u and x∈Rect(u).
Thus, we reach the conclusion that \text{\overline{\text{Rect}}}\left(\boldsymbol{v}\right)\subset h\left(\text{\overline{\text{Rect}}}\left(\boldsymbol{u}\right)\right).
Moreover, let x∈Rect(u). Since
W≥0, we have
[TABLE]
Therefore, \boldsymbol{y}\in\text{\overline{\text{Rect}}}\left(\boldsymbol{v}\right)
and this implies h(Rect(u))⊂Rect(v)
. Finally, we arrive h(Rect(u))=Rect(v).
∎
Corollary 23**.**
Let h:Rm→Rn
be an affine map with h(u)=Wu+b where W∈Rn×m
is a full rank matrix (m≥n). Let W=[W1W2]
and W2=W1U wherein W1∈Rn×n ,W2∈Rn×(n−m),
and U∈Rm×(m−n) are defined as above.
If W and V are two non-negative matrices, and U[Δui]i=n+1m≤0
where V=(W1)−1,u1≤u2, Δu=u2−u1,
the image of \text{\overline{\text{Rect}}}\left(\boldsymbol{u}_{1},\boldsymbol{u}_{2}\right)\subset\mathbb{R}^{m}
with u1≤u2 is \text{\overline{\text{Rect}}}\left(\boldsymbol{v}_{1},\boldsymbol{v}_{2}\right)\subset\mathbb{R}^{n}
with v1≤v2 where v1=h(u1)
and v2=h(u2).
Proof.
It is trivial that v1≤v2 from the facts u1≤u2
and W≥0. Given u∈Rect(u1,u2),
hence u1≤u≤u2, it is obvious that h(u1)≤h(u)≤h(u2)
or v1≤h(u)≤v2. This follows that
h(u)∈Rect(v1,v2),
hence h\left(\text{\overline{\text{Rect}}}\left(\boldsymbol{u}_{1},\boldsymbol{u}_{2}\right)\right)\subset\text{\overline{\text{Rect}}}\left(\boldsymbol{v}_{1},\boldsymbol{v}_{2}\right).
Let 1i∈Rn be the one-hot vector with the only
1 at the i-th position. Let v∈Rect(v1,v2)
which can be represented as:
[TABLE]
where 0≤λi≤1,∀i.
Let us denote
[TABLE]
where Vic points out the i-th column of the matrix V,
and Wir points out the i-th row of the matrix W. We
now verify that u∈Rect(u1,u2)
or equivalently u1≤u≤u. Since u1≤u2,
W≥0, and V≥0, it is obvious that u1≤u.
We further derive as follows:
[TABLE]
Therefore, we obtain
[TABLE]
We now prove that h(u)=v. Indeed, we have
[TABLE]
[TABLE]
Here we note that since W1V=In, we have W1Vi1,c=1i.
Putting all-together, we have h(u)=v with u∈Rect(u1,u2)
and this implies Rect(v1,v2)⊂h(Rect(u1,u2)).
Finally, we reach the conclusion of h(Rect(u1,u2))=Rect(v1,v2).
∎
The matrix W in Corollaries 22 and 23
is constructed based on the non-negative matrix W1 whose inverse
V is also a non-negative matrix. This class of matrices, known
as non-negative monomial matrix, has been studied in (Ding & Rhee, 2014)
wherein it has been proven that W1 is a non-negative monomial
matrix if only if it can be factorized as the multiplication of a
non-negative diagonal matrix D and a permutation matrix P, i.e.,
W1=DP. Based on the matrix W1, we can flexibly construct
the matrix W2 satisfying the constrains in Corollaries 22
and 23. We now recruit Corollaries 22
and 23 as building blocks for Theorem 24
wherein we address the question under which conditions a feedforward
neural network with the sigmoid, tanh, softplus, and ELU activation
functions has connected decision regions.
Theorem 24**.**
Let the width of the layers of the feedforward
neural network satisfy d=n0≥n1≥n2≥⋯≥nL−1
. Let σl:R→R be bijective continuous
activation function for every layer 1≤l≤L−1 , all the weight
matrices (Wl)l=1L−1 have full rank.
i) If limt→−∞σl(t)=a1 is finite,
limt→+∞σl(t)=+∞, Wl
and Vl are two non-negative matrices where Vl=(Wl1)−1
in which Wl1 is defined from Wl as above for every
2≤l≤L then every decision region Cm (i.e., 1≤m≤M)
is an open path-connected subset of Rd.
ii) If limt→−∞σl(t)=a1 is finite,
limt→+∞σl(t)=a2 is finite, Wl
and Vl are two non-negative matrices, and Ul[Δuil]i=nl+1nl+1≤0
where Δul=u2l−u1l,u1l=σ^l(Wlu1l−1+bl),u2l=σ^l(Wlu2l−1+bl)
with u11=[a1]n1,u21=[a2]n1,
and Vl=(Wl1)−1,Wl2=Wl1Ul
in which Wl1 and Wl2 are defined from Wl as
above for every 2≤l≤L then every decision region Cm
(i.e., 1≤m≤M) is an open path-connected subset of Rd.
Proof.
Let us denote Al=fl(Rd) , Bl+1=hl+1(Al)
with hl(⋅)=Wl×⋅+bl (i.e., the
affine map at the layer l) for 0≤l≤L−1. It is obvious
that Al=σ^l(Bl) for 1≤l≤L−1.
Since the matrix W1 is full-ranked, B1=h1(A0)=h1(Rd)=Rn1.
i) This follows that A1=σ^1(B1)=σ^1(Rn1)=(a1,+∞)n1=Rect(u1)
where u1=[a1]n1. Using the facts that
W2≥0, V2≥0 where V2=(W21)−1
, Corollary 22 gives us h1(Rect(u1))=Rect(v)
where v=h2(u1). Using the facts that cl(A1)=Rect(u1)
and h1(Rect(u1))=Rect(v)
is closed, Corollary 17 gives us cl(B2)=Rect(v),
hence B2⊃Rect(v). Since A2=σ^2(B2),
we obtain cl(A2)=Rect(u2)
where u2=σ^2(v). Using the same
argument forward the network, we obtain cl(AL−1)=Rect(uL−1)
(i.e., a polyhedron), which concludes this proof (thanks to Corollary
20).
ii) This follows that A1=σ^1(B1)=σ^1(Rn1)=(a1,a2)n1=Rect(u11,u21)
where u1=[a1]n1 and u2=[a2]n1.
Using the facts that W2≥0, and Ul[Δuil]i=nl+1nl+1≤0
where V2=(W21)−1 , Corollary 23
gives us h1(Rect(v1,v2))=Rect(v1,v2)
where v1=h2(u11) and v2=h2(u21).
Using the facts that cl(A1)=Rect(u11,u21)
and h1(Rect(u11,u21))=Rect(v1,v2)
is closed, Corollary 17 gives us cl(B2)=Rect(v1,v2),
hence B2⊃Rect(v1,v2). Since
A2=σ^2(B2), we obtain cl(A2)=Rect(u12,u22).
Using the same argument forward the network, we obtain cl(AL−1)=Rect(u1L−1,u2L−1)
(i.e., a polyhedron), which concludes this proof (thanks to Corollary
20).
∎
It is worth noting that Theorem 24 can be
applied to all bijective continuous activation functions including
the leaky ReLU, ELU, softflus, sigmoid, and tanh activation functions.
However, this cannot be applied to the ReLU activation function, which
is one of the most widely used activation functions. The reason is
that this activation function is not bijective. In what follows, we
study the capacity to learn path-connected regions of a feed-forward
neural net with the ReLU activation function.
Corollary 25**.**
Let h:Rm→Rn
be an affine map with h(u)=Wu+b where W∈Rn×m
is a full rank matrix. If V is a non-negative matrix and Vb≤0
where V=(W1)−1 with W1,W2 to be defined
as above, we have \text{\overline{\text{Rect}}}\left(\mathbf{0}_{n}\right)\subset h\left(\overline{\text{Rect}}\left(\mathbf{0}_{m}\right)\right).
Proof.
Let \boldsymbol{v}=\left[a_{1}\dots a_{n}\right]^{\mathsf{T}}\in\text{\overline{\text{Rect}}}\left(\mathbf{0}_{n}\right)\backslash$$\left\{\mathbf{0}_{n}\right\}.
Let vi=V(1i−∑i=1nai1b)=V1i−∑i=1nai1Vb≥0∈Rn
where 1i is the one-hot vector with 1 at the i-th
index for 1≤i≤n. We further define ui=[viT0m−nT]T
and u=∑i=1naiui∈Rm. We then have
u∈Rect(0m) and h(u)=v
because
[TABLE]
[TABLE]
Now let u=[−(Vb)T0m−nT]T≥0m,
we then have
[TABLE]
Therefore, we reach the conclusion.
∎
The matrix W1 whose inverse V is a non-negative matrix as
in Corollary 25 is known as a monotone (inverse-positive)
matrix (Fujimoto & Ranade, 2004), which forms a supper class of M-matrices
(Plemmons, 1977).
Lemma 26**.**
Assume U⊂Rect(0)⊂Rn
is a path-connected set. If u0 (exists negative
coordinate) and u∈σ^−1(U), the segment
[σ^(u),u]⊂σ−1(U).
Proof.
See the proof in our supplementary material.Given u∈Rn
and 0≤λ≤1, we verify that σ^(λu+(1−λ)σ^(u))=σ^(u).
Let v=λu+(1−λ)σ^(u).
For i such that ui≥0 then σ^(u)i=ui,
hence vi=λui+(1−λ)σ^(u)i=ui
and σ^(v)i=σ^(vi)=σ^(ui)=σ^(u)i.
For i such that ui<0 then σ^(u)i=σ^(ui)=0,
hence vi=λui+(1−λ)σ^(u)i<0
and σ^(v)i=σ^(vi)=0=σ^(u)i.
This follows that σ^(v)i=σ^(u)i,∀i,
hence σ^(v)=σ^(u)
and v=λu+(1−λ)σ^(u)∈σ^−1(U).
In addition, it is trivial that U⊂σ^−1(U)
since U⊂Rect(0)⊂Rn.
We now prove that if u0 (exists negative coordinate)
and u∈σ^−1(U) then the segment [σ^(u),u]⊂σ^−1(U).
Let v=λu+(1−λ)σ^(u)∈[σ^(u),u]
for some 0≤λ≤1. Then σ^(v)=σ^(u)∈U
which implies v∈σ^−1(U).
∎
Lemma 27**.**
If U⊂Rect(0)⊂Rn
is a path-connected set and C is a convex set containing U (i.e.,
U⊂C) then σ^−1(U)∩C is also
a path-connected set provided that σ is the ReLU activation
function.
Proof.
Let u1,u2∈σ^−1(U)∩C. We
consider the following three cases:
- u1≥0 and u2≥0:
σ^(u1)=u1∈U and σ^(u2)=u2∈U.
Since U is path-connected, there exists a path in U connected
σ^(u1) and σ^(u2)
hence this path also connects u1 and u2 in σ^−1(U)∩C
due to U⊂σ^−1(U) and U⊂C.
- u1≥0 and u20:
σ^(u1)=u1 and [σ^(u2),u2]⊂σ^−1(U)
and also [σ^(u2),u2]⊂C
due to u2,σ^(u2)∈C and the convexity
of C. Since U is path-connected, there exists a path in U
connected σ^(u1) and σ^(u2)
hence this path also connects u1 and σ^(u2)
in σ^−1(U)∩C. Combining this path with
the segment [σ^(u2),u2]⊂σ^−1(U)∩C,
we have a path connecting u1,u2 in σ^−1(U)∩C.
- u10 and u20:
[σ^(u1),u1]⊂σ^−1(U)∩C
and [σ^(u2),u2]⊂σ^−1(U)∩C.
The path connecting σ^(u1),σ^(u2)
is formed by the interval [σ^(u1),u1]⊂σ^−1(U)∩C,
the path connecting u1,u2 in U, hence in σ^−1(U)∩C
and the segment [u2,σ^(u2)]⊂σ^−1(U)∩C.
∎
Theorem 28**.**
Consider a neural network with the ReLU activation
function. Let Al=hl(fl−1(Rd))
where hl(⋅)=Wl×⋅+bl for every
1≤l≤L. If Al is a convex set for every 1≤l≤L
and satisfies σ^l(Al)⊂Al for
every 1≤l≤L−1, the decision region Cm is path-connected
for every 1≤m≤M.
Proof.
We first prove that \hat{\sigma}_{l}\left(A_{l}\right)=A_{l}\cap\text{\overline{\text{Rect}}}\left(\mathbf{0}\right).
In fact, we first have σ^l(Al)⊂Al
and \text{\overline{\text{Rect}}}\left(\mathbf{0}\right)
(because σl is ReLU), hence \hat{\sigma}_{l}\left(A_{l}\right)\subset A_{l}\cap\text{\overline{\text{Rect}}}\left(\mathbf{0}\right).
Moreover, let u∈Al∩Rect(0),
then σ^l(u)=u, hence u∈σ^l(Al).
It is obvious that fL(Cm)=AL∩Dm=UL
is a convex set. Let BL−1=hL−1(UL)∩σ^L−1(AL−1),
then hL is an onto affine map from BL−1 to UL, hence
BL−1 is a path-connected subset of \text{\overline{\text{Rect}}}\left(\mathbf{0}\right).
Let UL−1=σ^L−1−1(BL−1)∩AL−1,
then using Lemma 27 with noting that BL−1⊂σ^L−1(AL−1)⊂AL−1,
we obtain UL−1 is path-connected. Let BL−2=hL−1−1(UL−1)∩σ^L−2(AL−2),
then we have BL−2 is a path-connected subset of Rect(0).
Let UL−2=σ^L−2−1(BL−2)∩AL−2,
then UL−2 is a path-connected set. Using the same argument backward
the network, we arrive B1 and U1 are path-connected. Finally,
from U1=h1(Cm) and h1(⋅)
is an affine map, we obtain Cm is an open connected set.
∎
Theorem 29**.**
Let the width of the layers of the feedforward neural network satisfy
d=n0≥n1≥n2≥⋯≥nL−1 . Let σl:R→R
be the ReLU activation function for every layer 1≤l≤L−1.
If all the weight matrices (Wl)l=1L−1 have
full rank, Vl is non-negative, and Vlbl≤0
where Vl=(Wl1)−1 where Wl1 is defined
from Wl as above for every layer 1≤l≤L−1 then every
decision region Cm (i.e., 1≤m≤M) is an open connected
subset of Rd.
Proof.
Let Al=hl(fl−1(Rd))
where hl(⋅)=Wl×⋅+bl and Bl=fl(Rd)=σ^l(Al)
for every 1≤l≤L. According to Theorem 28,
we need to prove Al is a convex set for every 1≤l≤L
and σ^l(Al)⊂Al for every 1≤l≤L−1.
In fact, we have A1=h1(Rd)=Rn1
since W1 has full rank. This follows that B1=σ^1(A1)=Rect(0n1)⊂A1.
Corollary 25 gives us the convex set A_{2}=h_{2}\left(B_{1}\right)=h_{2}\left(\overline{\text{Rect}}\left(\mathbf{0}_{n_{1}}\right)\right)\supset\text{\overline{\text{Rect}}}\left(\mathbf{0}_{n_{2}}\right)
. This follows that B_{2}=\hat{\sigma}_{2}\left(A_{2}\right)=\text{\overline{\text{Rect}}}\left(\mathbf{0}_{n_{2}}\right)\subset A_{2}.
Using the same argument forward, we arrive AL−1=hL−1(BL−2)⊃Rect(0nL−1).
This follows that B_{L-1}=\hat{\sigma}_{L-1}\left(A_{L-1}\right)=\text{\overline{\text{Rect}}}\left(\mathbf{0}_{n_{L-1}}\right)\subset A_{L-1}.
Finally, AL=hL(BL−1) is convex. That concludes
the proof.
∎
Theorem 30**.**
Let the one hidden layer network satisfy
d=n0≥n1 and let σ1 be the ReLU activation
function and the hidden layer’s weight matrix W1 has full rank.
Then every decision region Cm is an open connected subset of
Rd for every 1≤m≤M.
Proof.
The proof of this theorem can be directly derived from Theorem 28
by noting that A1=h1(f0(Rd))=h1(Rd)=Rn1
which contains \hat{\sigma}_{1}\left(A_{1}\right)=\text{\overline{\text{Rect}}\left(\mathbf{0}\right)}.
∎
4 Conclusion
Previous work has examined an important theoretical the question regarding
the capacity of feedforward neural networks to learn connected decision
regions. It has been proven that for a particular class of activation
functions including leaky ReLU, neural networks having a pyramidal
structure (i.e., no layer has more hidden units than the input dimension),
produce necessarily connected decision regions. In this paper, we
significantly extend this result to a more general theory by providing
the sufficient and necessary conditions under which the decision regions
of a neural network are connected and then developed main theoretical
results for neural networks’ capacity to learn connected regions under
a wide range choice for activations functions that were not possible
to study before, namely ReLU, sigmoid, tanh, softlus, and exponential
linear function.