The capacity of feedforward neural networks
Pierre Baldi, Roman Vershynin

TL;DR
This paper introduces a quantitative measure of neural network capacity based on the number of functions they can compute, providing formulas for layered architectures and insights into their expressive power.
Contribution
It defines the capacity of layered neural networks, develops new techniques for capacity bounds, and analyzes how architecture influences function complexity and regularization.
Findings
Capacity is a cubic polynomial in layer sizes.
Bottleneck layers limit capacity.
Deep networks produce more regular, interesting functions.
Abstract
A long standing open problem in the theory of neural networks is the development of quantitative methods to estimate and compare the capabilities of different architectures. Here we define the capacity of an architecture by the binary logarithm of the number of functions it can compute, as the synaptic weights are varied. The capacity provides an upper bound on the number of bits that can be extracted from the training data and stored in the architecture during learning. We study the capacity of layered, fully-connected, architectures of linear threshold neurons with layers of size and show that in essence the capacity is given by a cubic polynomial in the layer sizes: , where layers that are smaller than all previous layers act as bottlenecks. In proving the main result, we also develop new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The Capacity of feedforward neural networks
Pierre Baldi and Roman Vershynin
Department of Computer Science, University of California, Irvine
Department of Mathematics, University of California, Irvine
Abstract.
A long standing open problem in the theory of neural networks is the development of quantitative methods to estimate and compare the capabilities of different architectures. Here we define the capacity of an architecture by the binary logarithm of the number of functions it can compute, as the synaptic weights are varied. The capacity provides an upperbound on the number of bits that can be extracted from the training data and stored in the architecture during learning. We study the capacity of layered, fully-connected, architectures of linear threshold neurons with layers of size and show that in essence the capacity is given by a cubic polynomial in the layer sizes: , where layers that are smaller than all previous layers act as bottlenecks. In proving the main result, we also develop new techniques (multiplexing, enrichment, and stacking) as well as new bounds on the capacity of finite sets. We use the main result to identify architectures with maximal or minimal capacity under a number of natural constraints. This leads to the notion of structural regularization for deep architectures. While in general, everything else being equal, shallow networks compute more functions than deep networks, the functions computed by deep networks are more regular and “interesting”.
Work in part supported by DARPA grant D17AP00002 and NSF grant 1839429 to P. B., and U.S. Air Force grant FA9550-18-1-0031 to R. V
Keywords: neural networks; capacity; complexity; deep learning.
Contents
- 1 Introduction
- 2 Neural architectures and their capacities
- 3 Overview of new results
- 4 Useful examples of threshold maps
- 5 Capacity of networks: upper bounds
- 6 Capacity of product sets: slicing
- 7 Capacity of general sets
- 8 Networks with one hidden layer: multiplexing
- 9 Networks with two hidden layers: enrichment
- 10 Networks with arbitrarily many layers: stacking
- 11 Extremal capacity
- 12 Structural regularization
- 13 Polynomial threshold functions
- 14 Open questions
- 15 Conclusion
1. Introduction
Since their early beginnings (e.g. [17, 21]), neural networks have come a significant way. Today they are at the center of myriads of successful applications, spanning the gamut from games all the way to biomedicine [22, 23, 4]. In spite of these successes, the problem of quantifying the power of a neural architecture, in terms of the space of functions it can implement as its synaptic weights are varied, has remained open. This quantification is fundamental to the science of neural networks. It is also important for applications in order to compare architectures, including the basic comparison between deep and shallow architectures, and to select the most efficient architectures. Furthermore, this quantification is essential for understanding the apparently unreasonable properties of deep learning and the well known paradox that deep learning architectures have a tendency to not overfit, even when the number of synaptic weights significantly exceeds the number of training examples [27]. To address these problems, in this work we introduce a notion of capacity for neural architectures and study how this capacity can be computed. We focus primarily on architectures that are feedforward, layered, and fully connected denoted by: , where is the number of neurons in layer .
1.1. Functional capacity of neural networks
Ideally, one would like to be able to describe the functional capacity of a neural network architecture, i.e. completely characterize the class of functions that it can compute as its synaptic weights are varied. In the purely linear case, such a program can easily be carried out. Indeed, let . If , then is simply the class of all linear functions from to , i.e. it is equivalent to . If , then there is a rank restriction and is the class of all linear functions from to with rank less or equal to , i.e. it is equivalent to [5]. In addition, if then the effect of the bottleneck layer is nullified and is equivalent to , i.e. the effect of the bottleneck restriction is nullified. The exact same result is true in the case of unrestricted Boolean architectures (i.e. architectures with no restrictions on the Boolean functions being used), where the notion of rank is replaced by the fact that a Boolean layer of size can only take distinct values.
Unfortunately, in the other and most relevant non-linear settings, such a program has proven to be difficult to carry out, except for some important, but limited, cases. Indeed, for a single threshold gate neuron, corresponds to the set of linearly separable functions. Variations of this model using sigmoidal or other non-linear transfer functions can be understood similarly. Furthermore, in the case of an architecture, for a given input, the output of each neuron is independent of the weights, or the outputs, of the other neurons. Thus the functional capacity of can be described in terms of independent components. When a single hidden layer is introduced, the main known results are those of universal approximation properties. In the Boolean case, using linear threshold gates, and noting that these gates can easily implement the standard AND, OR, and NOT Boolean operations, it is easy to see using conjunctive or disjunctive normal form that can implement any Boolean function of variables, and thus can implement any Boolean map from to . It is also known that in the case of Boolean unrestricted autoencoders, the corresponding architectures implement clustering [3, 2]. In the continuous case, there are various universal approximation theorems [15, 13] showing, for instance, that continuous functions defined over compact sets can be approximated to arbitrary degrees of precision by architectures of the form , where we use “” to denote the fact that the hidden layer may be arbitrary large. Beyond these results, very little is known about the functional capacity of .
1.2. Cardinal capacity of neural networks
In order to make progress on the capacity issue, here we define a simpler notion of capacity, the cardinal capacity. The cardinal capacity of a finite class of functions is simply the logarithms base two of the number of functions contained in (Figure 1): . The cardinal capacity can thus be viewed as the number of bits required to specify, or communicate, an element of , in the worst case of a uniform distribution over . We are particularly interested in computing the cardinal capacity of feedforward architectures of linear threshold functions. In continuous settings, the cardinal capacity can be defined in a similar way in a measure theoretic sense by taking the logarithm of the volume associated with . In the rest of the paper, in the absence of any qualification, the term capacity is used to mean cardinal capacity.
While in general the notion of cardinal capacity is simpler and less informative than the notion of functional capacity, in the case of neural architectures the cardinal capacity has a very important interpretation. Namely, it provides an upperbound on the number of bits that can be stored in the architecture during learning. Indeed, the learning process can be viewed as a communication process over the learning channel [7], whereby information is transferred from the training data to the synaptic weights. Thus the learning process is a process for selecting and storing an element of , which corresponds to bits of information. Any increase in the precision of the synaptic weights that does not change the input-output function is not visible from the outside.
The bulk of this paper focuses on estimating the capacity of arbitrary feedforward, layered and fully-connected, architectures of any depth which are widely used in many applications. As a side note, the capacity of fully connected recurrent networks is studied in [6]. In the process, several techniques and theorems of self-standing interest are developed. In addition, the extremal properties of the capacity of such architectures is analyzed, contrasting the capacity of shallow versus deep architectures, and leading to the notion of structural regularization. Structural regularization provides a partial explanation for why deep neural networks have a tendency to avoid overfitting.
1.3. Main result of the paper: the capacity formula
The main result of this paper, Theorem 3.1, provides an estimate of the capacity of a general feedforward, layered, fully connected neural network of linear threshold gates. Suppose that such network has layers with neurons in layer , where corresponds to the input layer and correspond to the output layer. We show that, under some very mild assumptions on the sizes of the layers, the capacity of this network, defined as the binary logarithm of the total number of functions it can compute, satisfies
[TABLE]
Here the notation means that there exists two positive absolute constants such that . Actually, we will show that the upper bound in the capacity formula (1.1) holds with constant . The absolute constant hidden in the lower bound may not depend on anything, in particular it is independent of the depth of the network, or the widths of the layers. The formula (1.1) thus shows that the capacity of such a network is essentially given by a cubic polynomial in the sizes of the layers, where the bottleneck layers play a special role.
1.4. Capacity of sets
In the process of proving the main capacity formula (1.1) we establish some other stand alone results of independent value. At the heart of our analysis are new lower bounds on the capacity of sets. We define the capacity of a set as the binary logarithm of the number of all the linear threshold functions . In other words, measures the capacity of a single neuron when the inputs are restricted to . Equivalently, is the binary logarithm of the number of all possible ways can be separated by affine hyperplanes. We prove that for any subset of the Boolean cube , the capacity satisfies
[TABLE]
The upper bound was previously known, it holds for any subset , and it can be replaced by the simpler form when . The lower bound is a new contribution, and it improves over the previously known (and easy) lower bound of , which is also true for any set , as soon as .
In the next section, we provide mathematical definitions of feedforward neural networks and their capacities and describe several known results. A reader familiar with neural network theory may glance through it and rapidly go to Section 3, which provides a description of the new results and provides a roadmap for the paper.
2. Neural architectures and their capacities
2.1. Threshold functions and maps
Throughout this paper, the -dimensional Boolean cube is denoted by:
[TABLE]
The Heaviside function is defined by:
[TABLE]
We have chosen the formalism for convenience. It can easily be replaced with the formalism, using the Boolean cube and replacing the Heaviside function by the sign function.
Definition 2.1** (Threshold functions).**
Consider a set . A function is called a (linear) threshold function on if there exist and such that can be expressed as:
[TABLE]
The set of all threshold functions on is denoted by and is often abbreviated to .
The notion of threshold functions generalizes naturally to the multivariate setting.
Definition 2.2** (Threshold maps).**
Consider a set . A function is called a threshold map if all components are threshold functions on . The set of all threshold maps is denoted ; in the particular case where , we abbreviate it to .
2.2. Neural architectures
A neural architecture (or network) is represented by a weighted directed graph, where the nodes represent neurons and the weights represent synaptic connection strengths. Neurons have numerical states and operate by taking the weighted average of the corresponding parent states and applying a transfer function to this weighted average. This paper focuses on one of the most widely used class of architectures, namely layered feedforward neural architectures, where neurons are arranged into layers and connections run from one layer to the next. We will denote an architecture with layers numbered from 1 to , and neurons in each layer , by:
[TABLE]
The architecture has input neurons and output neurons. To simplify the analysis, we make two assumptions:
- (a)
full connectivity between the layers, i.e. we assume that each neuron in layer is connected to every neuron in layer , but not to any other neurons; 2. (b)
the transfer function is the Heaviside threshold function.
These two assumptions are not absolutely essential, and we discuss how to relax them in the conclusion.
Under these assumptions, the input-output function computed by a layered feedforward neural architecture with a fixed set of weights is a composition of threshold maps. Indeed, the architecture computes an input-output function of the form
[TABLE]
where each is a threshold map. Generally, a network architecture is able to compute infinitely many functions , as the synaptic weights and biases (threshold values) are varied. However, if the inputs are restricted to a given finite set , then the number functions computable by the architecture becomes finite. We denote this class of functions by
[TABLE]
In the most important case where is the Boolean cube, we drop the set from the notation. Thus,
[TABLE]
denotes the class of functions computable by the architecture ; it consists of all functions that can be expressed as in (2.1) for some set of threshold maps ().
2.3. Definition of capacity
The main question we address in this paper is: how many different functions can a given neural architecture compute? This leads us to the notion of capacity, which we define as follows:
Definition 2.3** (Capacity of a neural architecture).**
The capacity of a neural architecture is the binary logarithm of the number of different functions it can compute, i.e.
[TABLE]
More generally, the capacity of a neural architecture on a given finite set is
[TABLE]
The capacity of an architecture can be interpreted as the number of bits required to specify a function computable by the architecture. Remarkably, this can also be viewed as an upper bound on the number of bits that can stored in the architecture during learning, or equivalently an upper bound on the number of bits that can be extracted from the training data. If the capacity of a network is and the number of connections (weights) is , then at most bits can be stored on average per synaptic weight. This bound is independent, and more fundamental, than any hardware limitation on the precision of the synaptic weights. It can be viewed as a bound on the effective capacity of the deep learning channel [7]. The bound holds even if the weights have infinite precision and thus in principle contain an infinite amount of information. This is because in the current framework different sets of weights that implement the same overall input-output function are indistinguishable from the outside world.
Getting optimal bounds on the capacity can be non-trivial even for simple network architectures. Consider, for example, a single-neuron network with inputs, which thus can implement any threshold function on . The capacity of this network on a given set of inputs is the logarithm of the number of all distinct threshold functions that can be defined on . We call this quantity the capacity of .
Definition 2.4** (Capacity of a set).**
The capacity of a set is the binary logarithm of the number of all threshold functions on , i.e.
[TABLE]
Equivalently, is the binary logarithm of the total number of ways the set can be partitioned by affine hyperplanes in (taking into account the binary assignment associated with each partition).
A significant part of this paper is devoted to studying the capacity of sets, in particular to deriving optimal estimates for in terms of the cardinality of .
2.4. Basic properties of capacity
Here we summarize a few elementary properties of the capacity of neural architectures.
Lemma 2.5** (Basic properties of capacity).**
(Affine invariance) For any invertible affine transformation , we have:
[TABLE] 2. 2.
(Monotonicity) If for all , then:
[TABLE] 3. 3.
(Sub-additivity) For any , we have:
[TABLE] 4. 4.
(Contractivity) Capacity may only increase if a layer is duplicated. For example:
[TABLE] 5. 5.
For any set :
[TABLE] 6. 6.
For any set and a threhsold map , we have:
[TABLE]
The proofs are elementary and left as an exercise.
2.5. Known bounds on capacity of sets
First, as a useful reminder, the following theorem about partitions of by hyperplanes is well known (e.g. [25]) and straightforward to prove by recurrence.
Theorem 2.6**.**
The number of connected regions created by hyperplanes in (passing through the origin) satisfies:
[TABLE]
and the number of connected regions created by affine hyperplanes in satisfies:
[TABLE]
In both cases, equality is achieved if the hyperplanes are in general position.
One of the most basic questions addressed in this paper revolves around the best upper and lower bounds on the capacity in terms of the cardinality of . The following upper bound is known:
Lemma 2.7** (Capacity of sets: upper bound).**
The number of threshold functions on a given set is bounded by
[TABLE]
In particular, if then:
[TABLE]
Proof.
The first part of the lemma is presented in [1, Section 4.2] and follows immediately from the first part of Theorem 2.6 by considering the number of regions into which can be partitioned by the arrangement of hyperplanes of the form , . The second part of the Lemma can then be deduced from the first using the elementary bound on the binomial sums:
[TABLE]
which is valid for all integers , see e.g. [24, Exercise 0.0.5]. The last part follows easily for . ∎
Remark 2.8* (Tightness).*
Lemma 2.7 gives the best possible upper bound on the capacity of a set in terms of the cardinality of . In Section 9, we describe an enrichment method that for a given transforms the cube into a subset of cardinality for which:
[TABLE]
This shows that the bound in Lemma 2.7 is optimal for almost any magnitude of the cardinality .
Lemma 2.9** (Capacity of sets: lower bound).**
For any finite set , there exists at least threshold functions on . In particular:
[TABLE]
Proof.
The proof is elementary and we only sketch it. The claim is easy to check for . For general , choose a projection in onto some line and such that is injective on . Then . By affine invariance, we can realize as a subset of without changing the capacity. Then, applying the statement for , we get . Note that if the result can also be proved by noting that for any point of the hypercube there is a Boolean threshold function on that is equal to 1 on that point, and equal to 0 everywhere else. Including all such functions and their negation yields the lower bound. ∎
Remark 2.10* (Tightness).*
The bound in Lemma 2.9 is generally tight: if the set lies on some line in , the there are exactly threshold functions on .
Neverthelss, for many sets the lower bound given in Lemma 2.9 is too weak and can be improved. Consider, for example, the entire Boolean cube . Lemmas 2.7 and 2.9 give . As the following known result shows, the upper bound is tight, and the capacity of the Boolean cube is approximately :
Theorem 2.11** (Capacity of the Boolean cube).**
For any , we have:
[TABLE]
Moreover:
[TABLE]
The first, non-asymptotic, part of this theorem can be found in [1, Theorems 4.3, 4.5]; see also [12, 18]. It can also be derived from more general results in this paper: the upper bound on follows from Lemmas 2.7 for , and the lower bound on is derived in Example 6.9 below. The second, asymptotic, part of Theorem 2.11 was proved by Zuev [31]. A tighter estimate corresponding to:
[TABLE]
was obtained in [16].
Remark 2.12* (Extensions).*
Theorem 2.11 can be generalized to polynomial threshold functions [8] of degree , i.e. functions of the form where is a polynomial of degree . The capacity , defined as the binary logarithm of the number of such functions on , satisfies:
[TABLE]
thus generalizing Zuev’s result (2.3) which corresponds to . There exist further extensions of the capacity bounds for ReLU units, units with positive weights, and units with binary weights; they are described in [6].
Armed with these definitions and preliminary results, we are set up to study the capacity of arbitrary feedforward architectures.
2.6. Asymptotic notation
In the estimation of various quantities, we will use the notation and for identities and inequalities that hold up to constant factors. To be precise, means that there exists two positive absolute constants and such that:
[TABLE]
Similarly, means that there exists a positive absolute constant such that:
[TABLE]
These notations are useful only when the quantities and vary as a function of certain parameters (e.g. layer sizes). Positive absolute constants, which we denote by may not depend on anything, in particular on the number of layers or the number of nodes in any layer .
3. Overview of new results
3.1. A capacity formula
The main technical result of the paper is a two-sided bound on the capacity of fully-connected, layered, feedforward architectures with threshold transfer functions.
Theorem 3.1** (Capacity formula).**
Consider a neural architecture with layers. Assume that the number of nodes in each layer satisfies for any pair such that . Then:
[TABLE]
The upper bound in Theorem 3.1 is not difficult; we derive it in Section 5 from Lemma 2.7 and the sub-additivity of the capacity. The lower bound is significantly more challenging and requires new tools, which we call multiplexing, enrichment, and stacking. Once these tools are developed, we use them to prove the lower bound in Section 10.2.
The upper bound in Theorem 3.1 actually holds with the optimal factor if each non-output layer has at least four neurons (Proposition 5.2), and it does not require the assumption . This mild assumption is important in the lower bound though to prevents layer sizes from expanding too rapidly. Although this assumption has an almost optimal form (Section 10.3), it can be slightly weakened (Section 10.4).
For the special single-neuron case , Theorem 3.1 gives
[TABLE]
Since , this recovers the capacity estimate of the Boolean cube from Theorem 2.11 up to a constant factor. The proof of Theorem 2.11, however, does not offer any insights on how to compute the capacity of deeper networks.
The simplest new case of Theorem 3.1 is for networks with one hidden layer, where it states that . The constant factor implicit in this bound can be tightened for large . Indeed, we will show in Corollary 8.4 that:
[TABLE]
if and . This extends Zuev’s asymptotic result (Theorem 2.11).
An immediate and somewhat surprising consequence of Theorem 3.1 is that multiple output neurons can always be “channeled” through a single output neuron without a significant change in capacity of the network:
Corollary 3.2**.**
Under the assumptions of Theorem 3.1, we have:
[TABLE]
Proof.
Comparing the capacity formulas for these two architectures, we see that all the terms in the two sums match except for the last (extra) term in , which is . However, this term is clearly bounded by , which is the last term in the capacity sum for . Therefore, the capacity sums for the two architectures are within a factor of from each other. ∎
Let us mention that the capacity formula in Theorem 3.1 obtained for inputs in can be extended to inputs from other finite sets . In Propositions 5.2 and 10.5 we give upper and lower bounds on this variation of the capacity in terms of the cardinality of .
3.2. Networks achieving maximal capacity
We can use the capacity formula in Theorem 3.1 to find networks that maximize the capacity subject to natural constraints. Here we find the most capable networks (a) with a given number of connections (weights) and (b) with a given number of nodes (neurons).
Let us start with (a). The number of connections, or synaptic weights, of the neural architecture is
[TABLE]
Fixing makes sense because it is approximately the same as fixing the number of parameters of the neural architecture. The difference between and are the biases of the neurons so that: . Thus, we always have . Furthermore, since the number of neurons is usually much smaller than the number of connections , in most situations approximately equals . We have the following Corollary.
Corollary 3.3** (Optimal network with given number of connections).**
Under the conditions of Theorem 3.1, we have:
[TABLE]
Moreover, any network satisfying for approximately achieves maximal capacity:
[TABLE]
Proof.
The first statement is known and follows from prior results on the growth function of general (not necessarily fully connected) networks [11, Corollary 3]. It also trivially follows from Theorem 3.1 and the fact that . The second statement follows from Theorem 3.1 and the fact that under the assumptions of the Corollary. ∎
Examples of standard architectures that satisfy the condition of the second statement of the Corollary include monotonically expansive feedforward networks satisfying ) (the output layer can be expansive or contractive) satisfy the conditions of the Corollary. Likewise, expansive autoencoders networks satisfying and (in the case of a single hidden layer) also satisfy the condition of the Corollary. Finally any shallow network with a single hidden layer, where the hidden layer is larger than the input (), satisfies the condition of the Corollary and thus approximately achieves maximal capacity. In contrast, in many deep forward networks used in applications there exists layers that are smaller in size than the input layer and thus these networks do not achieve maximal capacity. On the positive side, this implies that such networks do not require independent examples for their training.
Next, let us find the most capable network with a given number of nodes, or neurons. The constraint on the number of neurons is loosely inspired by biological situations where the number of neurons may stay approximately constant, but the number and pattern of connections among the neurons may vary.
It turns out that for a fixed number of nodes
[TABLE]
the most capable networks are shallow. To quickly see why, note that Theorem 3.1 yields:
[TABLE]
This shows that the capacity decreases if we rearrange the fixed set of nodes into more layers. Furthermore, we can identify the most capable neural architectures with given number of nodes:
Corollary 3.4** (Optimal network with given number of neurons: informal statement).**
Among all neural architectures with a given number of nodes , the architecture approximately maximizes capacity.
Suppose that in addition to fixing , we also fix the number of input neurons . Then:
If , the architecture approximately maximizes capacity. 2. 2.
If , the architecture approximately maximizes capacity.
This result is formally stated in Theorems 11.1 and 11.4. Due to the equivalence (3.2), similar results hold for architectures with a single output unit, as well for architectures with a fixed number of output units. Complementary minimization results (Theorem 11.9) show that, under fixed budgets of units or connections, the capacity is minimized by the deepest possible networks, those with a single unit in each hidden layer.
These optimization results go against the belief, held by some, that deep architectures are more powerful because they can compute more functions than shallow architectures. The contrary is actually true: everything else being equal, deep architectures tend to compute less functions, but the functions they compute are more “ interesting”, or have “better properties”. This is related to the well-known regularizing effect of deep learning: deep architectures tend to avoid overfitting, even when the amount of training data is small compared to the number of parameters. While some of this regularizing effect can be attributed to learning methods based on stochastic gradient descent, our analysis shows that there is a strong structural component (Section 12).
3.3. Capacity of sets
The derivation of the capacity formula (Theorem 3.1) is based on new bounds on the capacity of finite sets. In Lemmas 2.9 and 2.7 we noted the upper and lower bounds
[TABLE]
which hold for any finite set in . We mentioned that both bounds are generally best possible. Surprisingly, the lower bound can be significantly improved for subsets of the Boolean cube . Indeed, the main result of Section 7 states the following:
Theorem 3.5** (Capacity of a set).**
The capacity of any set satisfies:
[TABLE]
This new bound is tight up to an absolute constant factor. Indeed, if is a Boolean cube canonically embedded in , Theorem 3.5 gives , which matches the upper bound in Theorem 2.11.
Unfortunately, even the new lower bound may be too weak for some applications. In particular, we need a stronger result to prove Theorem 3.1 even for three layers (). Thus one may wonder if in some sense the capacity of could be increased through some preprocessing of . Specifically, can we transform into a set whose capacity is significantly larger, ideally as large as the upper bound in (9.1) allows? Furthermore, in doing so, we would like to stay in the category of subsets of the Boolean cube and use only transformations that a network of threshold units can compute. Specifically, we will require that the enrichment map be a threshold map . We address the enrichment problem in the particular case where , leaving the general case for future investigations. The main result of Section 9 states the following.
Theorem 3.6** (Enrichment).**
Let and be positive integers satisfying . There exists an injective threshold map such that:
[TABLE]
The enrichment map transforms the cube into the set . The enriched set has the same cardinality as and almost the maximally possible capacity:
[TABLE]
which matches the upper bound in Lemma 2.7 in dimension .
3.4. Capacity of networks: new tools
In addition to the new bounds on capacity of sets, our proof of Theorem 3.1 uses some other new tools, which may be helpful in other applications. Let us briefly explain our argument.
The upper bound in Theorem 3.1 can be quickly derived from the upper bound in Lemma 2.7 and the sub-additivity of capacity. A similar argument was used before to obtain upper bounds on the VC-dimension of neural networks, see e.g. [10].
The matching lower bound is considerably harder to prove. For networks with one hidden layer, the proof of (3.1) is based on a new method that is inspired by the idea of multiplexing in signal processing (Section 8). Recall that estimating the capacity of a two-layer network involves counting all functions , where
[TABLE]
is a threshold map (i.e. a map whose all components are threshold functions) and
[TABLE]
is a threshold function. Due to Theorem 2.11, there are approximately different functions . However, this does not yield any lower bound on the number of compositions : it might happen that two different functions , when composed with , produce the same function. The multiplexing method circumvents this issue by combining two signals: a selector signal, and a threshold map signal. It allows the network to compute any one of the components of ; the first bits of the input vector act as selector bits used to select which component of the map should be in the output.
Next, the capacity of networks with two hidden layers is handled by combining multiplexing with enrichment (Section 9). A fixed enrichment map whose existence is guaranteed by Theorem 3.6 is used to connect the first two layers of the network. The fact that the image of has large capacity gives us plenty of different threshold maps between the two hidden layers. Multiplexing is then used to preserve the multitude of functions in when they are composed with an output function .
Finally, to handle networks with arbitrarily many layers (Section 10), we stack three-layer networks in a particular way to ensure that: (1) they may perform computations independently; and (2) the number of nodes in each layer is at most . Figure 4 illustrates the stacking method. Then Theorem 3.1 can be deduced from the capacity analysis of three-layer networks and their stacking.
3.5. Paper roadmap
In Section 4, we give a few basic examples of threshold functions and maps. In Section 5, we derive upper bounds on the capacity of networks, and in particular the upper bound in Theorem 3.1. The reader interested only in the proof of Theorem 3.1 may then skip to Section 8. In Sections 6, we develop combinatorial tools for the analysis of the capacity of sets. We use these tools in Section 7 to prove the main result on the capacity of subsets of the Boolean cube, Theorem 3.5. In Section 8, we develop the multiplexing technique and use it to estimate the capacity of networks with one hidden layer. In Section 9, we prove the Enrichment Theorem 3.6 and use it to handle networks with two hidden layers. In Section 10, we extend the resulst to arbitrary many layers, by stacking three-layer networks, and complete the proof of Theorem 3.1. In Section 11, we study networks with maximal or minimal capacity, and in particular prove a rigorous version of Corollary 3.4. Section 12, addresses the issue of structural regularization. Finally several open questions are discussed in the conclusion (Section 14).
4. Useful examples of threshold maps
In this section we give several examples of threshold functions and threshold maps. These examples will become useful in the proofs of the main results.
Throughout this paper, the symbol denotes the direct sum. For two vectors and , the direct sum is obtained by concatenation of and . For two sets and , the direct sum is defined as:
[TABLE]
A similar notation is used for the direct sum of a set and a vector, for example:
[TABLE]
4.1. Examples of threshold functions
It is well known and trivial to prove that the Boolean negation not is a threshold function on , and the Boolean functions of variables and (), or (), and their negations nand and nor, are all threshold functions on . Note that the and operation amounts to checking whether all are equal to . The value is not special and can be replaced by any real number :
Lemma 4.1**.**
Consider the function on that checks whether the argument equals a given vector :
[TABLE]
Then is a Boolean threshold function, i.e. .
Proof.
We can assume without any loss of generality that , for otherwise is the zero function and trivially lies in . Let . Now, can be expressed as:
[TABLE]
and therefore is a threshold function. Indeed, if then and the right hand side of (4.1) is equal to . If , we consider two cases: and . It is easy to check that in each one of these cases, the argument of in (4.1) is or less, and thus . ∎
Lemma 4.1 can be generalized one step further. It is possible to combine two operations into one threshold function: check whether the argument equals , and compute a given Boolean threshold function .
Lemma 4.2** (Adding a clause).**
Consider a Boolean threshold function and a vector . Then the function
[TABLE]
is a Boolean threshold function, i.e. .
Proof.
We can assume without loss of generality that , for otherwise is the zero function and trivially lies in . Let . Express as:
[TABLE]
for suitable and . Choose any suitable constants and such that is in the interval for all the that satisfy , and in the interval for all the that satisfy . We claim that can be expressed as
[TABLE]
and therefore is a threshold function. Indeed, we have seen in the proof of Lemma 4.1 that the quantity is either equal to when , or at most for all other values of . It is then easy to check that when , we have , and when , we have . ∎
Lemma 4.2 easily generalize to functions computable by fedforward neural networks.
Lemma 4.3** (Adding a clause).**
Consider a function and a vector . Then the function:
[TABLE]
satisfies .
The proof is elementary: it suffices to use the identity map on the additional coordinates up to the top layer.
4.2. Examples of threshold maps
Let us go over some basic examples of threshold maps. Obviously, these include the identity map on and all threshold functions. The next lemma gives a more interesting example.
Lemma 4.4** (Exponential map).**
Fix an integer and let denote the canonical vector basis in . Then any one-to-one map:
[TABLE]
is a Boolean threshold map, i.e. .
Proof.
The components of the map trivially satisfy the following: equals if and [math] otherwise. The last condition can be written as . Then Lemma 4.1 implies that is a threshold function, and hence is a threshold map. ∎
A specific example of in Lemma 4.4 is the exponential map, which interprets the input vector as a binary representation of a number and returns the binary representation of . For example, if , then:
[TABLE]
5. Capacity of networks: upper bounds
In this section, we prove general upper bounds on the capacity of neural networks, from which the upper bound in Theorem 3.1 will follow as a special case. The results rely on the following key remark.
Remark 5.1*.*
The capacity of a network is always upper bounded by the sum of the capacities of its neurons. However, in general this is a weak bound due to the restrictions in capacity posed by bottle-neck layers. To see this consider two consecutive layers and . In principle, a unit in layer could have capacity of the order of by Theorem 2.11. However, if there is a layer with , for any setting of the weights, the units in layer can only take at most values, rather than . By Lemma 2.7, this will reduce the capacity of a unit in layer to be at most of the order of instead of . The same effect is seen if the values of the input layer are restricted.
Proposition 5.2** (Capacity formula: upper bounds).**
For any and , , the following holds. Consider a finite set and let . Then:
[TABLE]
In particular, we have:
[TABLE]
Proof.
The proof is by induction. First consider the case where . Using property 5 in Lemma 2.5, the capacity bound from Lemma 2.7, and the assumptions on , we see that:
[TABLE]
which is the claimed bound. Assume the property is true for layers. To prove it for , just apply Remark 5.1 noting that the top layer contains units, and the capacity of each unit is at most . This completes the proof of the first inequality. The second inequality is obtained simply by letting (i.e. ). ∎
The assumption that each non-output layer should have at least four neurons can be removed from Proposition 5.2 at the cost of an absolute constant factor in the capacity formula.
Corollary 5.3** (Upper bound in Theorem 3.1).**
For any and any , we have
[TABLE]
Proof.
Apply Proposition 5.2 for the capacity and note that can only be smaller. ∎
In summary, we have derived the general upper bound associated with Theorem 3.1, and shown that the upper bound holds with an absolute and optimal constant factor of in the general case, where each non-output layer has at least four nodes.
Finally, let us note that the same capacity bound holds if we extend the network by adding a single output node.
Corollary 5.4** (Adding an output node).**
For any and any , we have
[TABLE]
The argument to prove this result is the same as the argument used to prove Corollary 3.2.
6. Capacity of product sets: slicing
Now that we have good upper bounds on the capacity of sets and neural networks, we turn to the lower bounds. In Lemma 2.9 we noted the elementary lower bound:
[TABLE]
which is valid for any set . We observed in Remark 2.10 that this bound is in general tight. Nevertheless, it can often be improved if additional information about the set is available. In Section 7, we will show that if is a subset of the Boolean cube, then the lower bound in (6.1) can be significantly improved . In this section, we develop general combinatorial tools that will be needed to derive the improved lower bound.
Early lower bounds on , and in particular the lower bound in (2.2), were based on simple combinatorial considerations and induction [18, 1]. In this section, we extend these combinatorial methods in order to be able to handle capacities of arbitrary sets . Although the methods can be applied to any subset , the best results are obtained when has a product structure, as explained below.
6.1. Slicing
The following theorem relates the capacity of a general set to the capacities and cardinalities of the slices of . A slice is obtained by fixing the values of certain coordinates. For example, the elements of whose first four coordinates are form a slice of . By monotonicity, the capacity of is lower bounded by the capacity of any slice of . This trivial bound can be boosted if, in addition, other slices have many points. Let us show this.
Theorem 6.1** (Slicing).**
Let be a linearly independent set of vectors, and let be arbitrary finite sets. Consider the subset whose fibers at are , i.e. let:
[TABLE]
Then the number of threshold functions on satisfies:
[TABLE]
The proof of Theorem 6.1 is based on a lifting trick. Given a vector and a set , let us denote by the set of all functions that can be expressed as:
[TABLE]
for some . Thus, the functions in are obtained by “cloning”, i.e. by fixing and varying a single parameter – the bias .
Lemma 6.2**.**
If a vector separates the points of a finite set , then:
[TABLE]
i.e. every has exactly different clones.
Proof.
Let , where the points are ordered so that the sequence is increasing with . Now increase continuously from to . As crosses a point , the function f_{a,\alpha}(x)=h\big{(}\langle a,x\rangle+\alpha\big{)} changes (since it changes its value on from [math] to ), and there are no other points where changes. The crossover points partition into intervals. Each interval corresponds to a different function . Thus, the total number of such functions is . ∎
The lifting trick described in the next lemma allows us to combine given clones of into a single threshold function on a larger domain.
Lemma 6.3** (Lifting).**
Let be a linearly independent set of vectors, and let be arbitrary finite sets. Fix and consider any functions , . Then we can find a function such that:
[TABLE]
Proof.
By definition, the functions can be expressed as:
[TABLE]
Since the vectors are linearly independent, there exist such that:
[TABLE]
For any vectors and , define:
[TABLE]
Obviously, is a threshold function on , and by restricting it to we can say that . Now:
[TABLE]
The lemma is proved. ∎
Proof of Theorem 6.1..
For each , let us choose and fix a vector so that (6.2) holds. Moreover, we can always choose so that it separates the points of , i.e. so that for any distinct pair of points . (This can be done by perturbing slightly. Such perturbation does not change the function but allows to separate points.)
Consider the set of all -tuples of functions where for each . Each such tuple consists of a function and some “clones” of . Due to (6.3), each clone in the tuple can be chosen in exactly ways. Thus the number of all such tuples is:
[TABLE]
Lemma 6.3 implies that different tuples produce different liftings . Indeed, one can uniquely recover the tuple from the fibers of .
Summarizing, is lower bounded by the number of different liftings , which in turn is lower bounded by the number of different tuples , which finally is lower bounded by the expression in (6.4), completing the proof of Theorem 6.1. ∎
6.2. Product sets
Theorem 6.1 is especially useful when is a product of sets.
Corollary 6.4** (Capacity of product sets).**
Let and be finite sets. If is linearly independent, then:
[TABLE]
Proof.
If , we can write . Applying Theorem 6.1 for , we get:
[TABLE]
Taking logarithms of both sides completes the proof. ∎
By induction, this bound extends to products of arbitrary many sets.
Corollary 6.5** (Capacity of product sets).**
Assume is the product of copies of a linearly independent subset with . Then:
[TABLE]
Proof.
Apply Corollary 6.4 for the sets and , whose cardinalities are and , and get:
[TABLE]
Apply Corollary 6.4 again for . Continuing in this way times, we obtain:
[TABLE]
Now:
[TABLE]
as , as , and . Substituting this into (6.5) completes the proof. ∎
Remark 6.6* (Relaxing the linear independence assumption).*
In the main results of this section, we assumed that the set is linear independent. This could be relaxed by assuming only that the set:
[TABLE]
be linearly independent.
To see this, modify the argument in the Lifting Lemma 6.3 as follows. Since the vectors are linearly independent, there exists a vector such that for all . Now define .
6.3. Totally separated sets
In addition to product sets, Theorem 6.1 can easily be specialized to totally separated sets.
Definition 6.7**.**
Two subsets and of are totally separated if they lie in two different parallel hyperplanes of .
Lemma 6.8**.**
Let . If and are totally separated subsets of then:
[TABLE]
Proof.
By affine invariance, we may assume that is a normal vector to both hyperplanes in which and lie. Thus, we can express:
[TABLE]
for some distinct numbers and sets . Moreover, since , the vectors and are linearly independent in . Applying Theorem 6.1 together with Remark 6.6 yields:
[TABLE]
The last step follows from the affine invariance of the capacity. ∎
Example 6.9* (Capacity of the Boolean cube).*
Let us apply Lemma 6.8 to the Boolean cube . This cube splits naturally into two totally separated copies of formed by opposite faces of . Using Lemma 6.8 and taking logarithms of both sides, we get:
[TABLE]
By induction, this gives:
[TABLE]
This recovers the lower bound given in Theorem 2.11.
7. Capacity of general sets
At the beginning of Section 6 we stated that the simple lower bound:
[TABLE]
which is valid for any finite set , can be significantly improved if we assume that lies in the Boolean cube . The following result (restating Theorem 7.1) gives such improvement.
Theorem 7.1** (Capacity of a set).**
The capacity of any set satisfies:
[TABLE]
Before we prove this result, let us note that this bound is generally tight for any magnitude of , up to an absolute constant factor. Indeed, consider the cube as a subset of . Theorem 7.1 gives:
[TABLE]
which matches the bound in Theorem 2.11.
7.1. A hierarchical decomposition
To prove Theorem 7.1, we are going to construct a hierarchical decomposition of into totally separated sets.111We introduced totally separated sets in Section 6.3. The next lemma defines the decomposition, and Lemma 6.8 will be used to keep track of the change in capacity at each step.
Lemma 7.2** (A totally separated partition).**
Any set that consists of more than one point can be partitioned into two non-empty totally separated subsets and .
Proof.
Choose a pair of distinct points . They must differ in at least one coordinate , and without loss of generality we may assume that and . Let the set consist of all points of whose -th coordinate equals [math], and consist of all points of whose -th coordinate equals . Then the sets and form a partition of and they are non-empty since and . ∎
We use the following procedure to decompose into a tree of totally separated subsets. First, let:
[TABLE]
If , stop. Otherwise Lemma 7.2 gives a partition:
[TABLE]
where and are totally separated sets. If , stop. Otherwise Lemma 7.2 gives a partition:
[TABLE]
where and are totally separated sets. Generally, after partitioning steps, we check if and if so, we stop. Otherwise Lemma 7.2 gives a partition:
[TABLE]
where and are totally separated sets.
Since at each step the set becomes strictly smaller, the iterative construction must terminate after a finite number of steps, when we have:
[TABLE]
Firgure 2 may help to visualize the decomposition process.
There are two overlapping situations where a hierarchical decomposition of automatically yields a good lower bound on the capacity of : (1) when the tree is tall, i.e. is large; and (2) when many “leaves” are not too small. The following lemma quantifies this statement.
Lemma 7.3** (Hierarchical decomposition and capacity).**
In the hierarchical decomposition described above, one has:
[TABLE]
Proof.
Applying Lemma 6.8 for decompositions (7.1) and (7.2), we get:
[TABLE]
Continuing in this way, after steps we get:
[TABLE]
since at the last step and thus . To get the first conclusion of the lemma, note that and take the logarithm of both sides of (7.4). To get the second conclusion, note that and finish similarly. ∎
7.2. Proof of Theorem 7.1
Let:
[TABLE]
If , the conclusion of the theorem follows from the trivial capacity bound in Lemma 2.9:
[TABLE]
Thus in the rest of the proof we can assume that:
[TABLE]
Step 1. Stopping criterion.
Consider the hierarchical decomposition of constructed in Section 7.1. We will need only the initial portion of that tree decomposition, where the sets are still large. Specifically, let be the smallest integer such that:
[TABLE]
our argument will focus on the sets and for only. Note that:
[TABLE]
The upper bound is trivial. To check the lower bound, recall that:
[TABLE]
The definition of then yields .
Step 2. Tall trees.
If then the conclusion of the theorem follows from the first bound in Lemma 7.3. Indeed, in this case we have:
[TABLE]
Thus, in the rest of the proof we may assume that:
[TABLE]
Step 3. Decomposition proportions
Recall that in the hierarchical decomposition (7.3), the set is partitioned into two sets and . Let and denote the proportions of these sets (Figure 2), i.e.
[TABLE]
The condition in (7.3) implies that:
[TABLE]
By induction, we have:
[TABLE]
Let us use this identity for . By the stopping criterion (7.6), and since , we have:
[TABLE]
To get the last bound we used the numerical inequality , which is valid for all ; we can apply it since for all . Rearranging the terms in the bound (7.9) gives:
[TABLE]
As a consequence, there must be many that are not too small. Specifically, consider the subset of indices defined by:
[TABLE]
We claim that:
[TABLE]
Indeed, according to (7.10), we have:
[TABLE]
There are terms in the first sum, all of which are bounded by . There are at most terms in the second sum, all of which are bounded by according to the definition of . Therefore:
[TABLE]
Solving this inequality gives , as claimed in (7.11).
Step 4. Short trees.
We are going to use the second bound in Lemma 7.3. To apply it effectively, we will first show that all the sets for are not too small. So, fix an and recall that by the definition of the proportions , we have:
[TABLE]
Since , one has: . Furthermore considering that , together with the definition of the stopping time , yields . Thus:
[TABLE]
In the second bound we use the assumption from (7.7), and in the last bound we use the assumption from (7.5).
Now we are ready to apply the second bound in Lemma 7.3. It gives in particular:
[TABLE]
There are at least terms in this sum due to (7.11), each bounded below by according to (7.12). It follows that:
[TABLE]
completing the proof of Theorem 7.1. ∎
Although Theorem 7.1 gives a bound that is generally tight, for many subsets it can be improved even further. We address this phenomenon in Section 9, where we study the enrichment transformation as a way of increasing the capacity.
8. Networks with one hidden layer: multiplexing
Starting from this section, we focus on networks with at least one hidden layer. The ultimate goal is to prove the tight lower bound on their capacity stated in Theorem 3.1. But for now, we begin with a more basic question. For a given input set , can we relate the capacity of the network with one hidden layer to the capacity of the set ? It is easy to derive a simple upper bound.
Proposition 8.1** (The effect of a hidden layer: upper bound).**
For any and any finite set , we have:
[TABLE]
Proof.
The argument is similar to the proof of Proposition 5.2. We need to count all functions of the form where and , and where:
[TABLE]
The cardinality of the image of is bounded by the cardinality of its domain, so:
[TABLE]
There are functions , and for each there are functions . Thus the total number of compositions is:
[TABLE]
where the maximum is taken over all subsets with cardinality at most . Taking logarithms on both sides gives:
[TABLE]
Property 5 of Lemma 2.5 gives:
[TABLE]
Furthermore, using the capacity bound from Lemma 2.7, the assumption on , and Lemma 2.9, wee see that:
[TABLE]
Substituting these two bounds in (8.1) completes the proof. ∎
We can interpret Proposition 8.1 as a result that compares capacities of single-output and multiple-output networks. Indeed, due to part 5 of Lemma 2.5, the bound in Proposition 8.1 states that:
[TABLE]
What about the converse: can channeling the output through a single node substantially reduce the capacity of a network? In principle, it can. Indeed, is always bounded by , the logarithm of the total number of binary functions on , while is always bounded below by due to Lemma 2.9. Thus, whenever , we necessarily have:
[TABLE]
Nevertheless, we will now show how to prevent the collapse in capacity by modifying a little – namely, by adding just bits to the input.
Theorem 8.2** (The effect of a hidden layer: lower bound).**
Let be a finite set. Let and . Then:
[TABLE]
The proof of this theorem is based on a multiplexing technique, which allows one to transmit output functions through a single channel. To describe this technique, fix an arbitrary injective map:
[TABLE]
where .
Lemma 8.3** (Multiplexing).**
Let be a finite set. Then, for any function , we can construct a function such that:
[TABLE]
if for some .
Note that the injectivity of guarantees that there exists at most one that satisfies (8.2).
Proof.
Define:
[TABLE]
and:
[TABLE]
This definition and the injectivity of ensure that (8.2) holds. Moreover, each is a threshold function according to Lemma 4.2, i.e. . Since the or operation () is also a threshold function, it follows that: . ∎
Proof of Theorem 8.2.
The Multiplexing Lemma 8.3 implies that the transformation is an injective map from into . Indeed, (8.2) allows one to uniquely recover all the threshold functions and thus from . Thus:
[TABLE]
Taking logarithms on both sides completes the proof. ∎
Specializing the result to , yields a tight bound on the capacity of networks with a single hidden layer.
Corollary 8.4** (Capacity of networks with a single hidden layer).**
If , then:
[TABLE]
Moreover, if and , then:
[TABLE]
Proof.
The upper bound in (8.4) is a partial case of Corollary 5.4). To prove the asymptotic upper bound in (8.5), note that if , Proposition 5.2 gives
[TABLE]
Furthermore, we have if .
To obtain the lower bounds, apply Theorem 8.2 for . It gives:
[TABLE]
where in the last step we used the assumption that . This proves the first part of the corollary. The second part follows from the same argument and the assumption that . ∎
Figure 3 illustrates the multiplexing technique of Lemma 8.3. The additional input bits form the vector act as selector bits. These bits are used to select any one of the functions to be the final output of the network. Since , the selector is very small and usually does not interfere with the capacity count.
9. Networks with two hidden layers: enrichment
9.1. Enrichment
A key recurrent question in this paper is: what is the relation between the capacity and cardinality of a general set ? Lemma 2.7 and Theorem 7.1 established upper and lower bounds that are generally best possible:
[TABLE]
The lower bound, however, is sometimes too weak for practical applications, particularly for the forthcoming analysis of networks with two hidden layers. One may wonder if the capacity of can be increased by first preprocessing . In particular, can we transform into a set whose capacity is significantly larger, ideally as large as the upper bound in (9.1) allows? In doing so, we would like to stay in the category of subsets of the Boolean cube and use only transformations that a neural network can compute. Thus, we require the enrichment map to be a threshold map , i.e. a map from to whose all components are threshold functions. We address the enrichment problem in the particular case where , leaving the general case of for future work.
Theorem 9.1** (Enrichment).**
Let and be positive integers satisfying . There exists an injective linear threhsold map such that:
[TABLE]
Let us make two remarks before proving this result. First, the map transforms the cube into an “enriched” version . The enriched set has the same cardinality as and almost the largest possible capacity:
[TABLE]
which matches the upper bound in (9.1) in dimension . Second, note also that an upper bound associated with Theorem 9.1 holds for any map . This follows straight from Lemma 2.7. Indeed, is a subset of and has cardinality , so:
[TABLE]
The non-trivial part in Theorem 9.1 is the lower bound. Our construction of will be based on sparsity considerations.
9.2. Construction of the enrichment map
Let be a positive integer and be the canonical basis vectors of . Fix any one-to-one map:
[TABLE]
According to Lemma 4.4, . Define the enrichment map by applying to each block of successive coordinates of . For to be well defined, the length of the blocks must satisfy the equation:
[TABLE]
as both sides of the equation determine the number of blocks. Assume for now that this equation has an integer solution , and let us prove the theorem in this ‘balanced’ case. The general case will be considered in Sections 9.4–9.5.
For this, we partition a vector into vectors , each containing a block of successive coordinates of length :
[TABLE]
and define:
[TABLE]
Since is a Boolean threshold map, is a threshold map too, i.e. as required.
9.3. Proof of Theorem 9.1 in the balanced case
By construction, the image of consists of copies of the image of :
[TABLE]
Next, recall that the image of is the set of canonical vectors of , i.e.
[TABLE]
Let us apply Corollary 6.5. Since is linear independent, , and by the assumptions on , the corollary can be applied. This application gives:
[TABLE]
This proves Theorem 9.1 in the special balanced case, where the equation (9.2) has an integer solution . Note that the argument so far did not use the assumption of the theorem; this assumption is used next in order to address the general (unbalanced) case. ∎
9.4. Balancing
The following lemma shows how to adjust and so that the equation (9.2) has an integer solution .
Lemma 9.2**.**
Let and be positive integers such that . Then there exist integers , , and such that:
[TABLE]
Proof.
We claim that:
[TABLE]
To show this, consider the function defined by:
[TABLE]
It is easy to check that increases to infinity on the interval . Since by assumption, the intermediate value theorem guarantees the existence of a point where . Equivalently, the equation (9.4) has a solution . To give an upper bound on , note that by the assumptions on and we have:
[TABLE]
Since is an increasing function on the interval and both and lie in this interval, it follows that . This verifies our claim.
Now define:
[TABLE]
Then the identity (9.3) obviously holds. Next, we must check the ranges for , and .
By the definition of , we have since and . Thus , as required.
By the definition of , we have: and:
[TABLE]
where the last bound holds since . Thus , as required.
Finally, by the definition of , we have:
[TABLE]
where the middle bound holds since increases on the interval , both and lie in that interval, and . The last bound in (9.5) follows from (9.4).
As for the lower bound on , the definition of yields:
[TABLE]
Thus, in short, and the proof of the lemma is complete. ∎
9.5. Proof of Theorem 9.1 in full generality
Without any loss of generality, we can assume that:
[TABLE]
Indeed, for the conclusion of the theorem is trivially true by adjusting the implicit absolute constant factors. In the range , we can use the identity embedding and get the conclusion from Theorem 7.1 or Theorem 2.11.
This allows us to apply Lemma 9.2. Let , and be the numbers from the conclusion of that lemma. Then there exist a map such that:
[TABLE]
This follows from the balanced case of the theorem we proved in Sections 9.2–9.3, by replacing and with and in that argument.
Now extend to a map using the identity function. More formally, partition each vector as:
[TABLE]
and define by:
[TABLE]
Here is a padding vector of zeros, which we add in order to make consist of exactly coordinates.
We must check that is well defined. The vector consists of coordinates and consists of coordinates. In order for the concatenation of these two vectors to fit in , we must have: . This is indeed the case since and by Lemma 9.2.
Since both and the identity map are injective threshold maps, the map is an injective threshold map too. By construction, the projection of onto the first coordinates equals . Therefore:
[TABLE]
where we used (9.6) and Lemma 9.2. This completes the proof of Theorem 9.1. ∎
9.6. Capacity of networks with two hidden layers
As an application of the Enrichment Theorem 9.1, we can estimate the capacity of networks with two hidden layers.
Theorem 9.3** (Two hidden layers).**
If , , and , then:
[TABLE]
Proof.
The upper bound follows as a special case of Corollary 5.4. To prove the lower bound, let us first only assume that:
[TABLE]
Then we can obtain the term by comparing the network with the network. Indeed, we have:
[TABLE]
Next, we consider two cases: and .
Case 1:
In this regime, we can compare the two-hidden-layers network with the single-hidden-layer network . Just like above, using monotonicity, contractivity, and Corollary 8.4, we get:
[TABLE]
Case 2:
In this regime, we use both enrichment and multiplexing. The first assumption in (9.7) yields , which allows us to use Theorem 9.1. Fix an enrichment map whose existence is guaranteed by Theorem 9.1. Applying part 6 of Lemma 2.5, for the map that belongs to the class and for , we obtain:
[TABLE]
Putting everything together
In summary, we showed that is always bounded below by , and is also bounded below by if , and by if . This means that:
[TABLE]
The last two assumptions in (9.7) state that: and . Thus monotonicity gives:
[TABLE]
Recall that we proved this result under the assumptions (9.7), which are weaker than those in the statement of the theorem. Applying this result for instead of , and for instead of , completes the proof. ∎
Theorem 9.4** (Two hidden layers, multiple-outputs).**
If and , then:
[TABLE]
Proof.
The upper bound is a partial case of Corollary 5.3. For the lower bound, we can essentially repeat the proof of Theorem 9.3 except for the multiplexing in the last step, which is not needed in this case. Instead, we can just use part 6 of Lemma 2.5 followed by the enrichment Theorem 9.1 and get:
[TABLE]
The proof is complete. ∎
10. Networks with arbitrarily many layers: stacking
Now we extend the capacity lower bounds to feedforward networks with arbitrarily many layers, thus completing the proof of the main result (Theorem 3.1). Denote:
[TABLE]
Let us handle networks with three hidden layers first.
Lemma 10.1** (Three hidden layers).**
Let for all . Then:
[TABLE]
Proof.
The upper bound is a special case of Corollary 5.4. As for the lower bound, monotonicity, contractivity (Lemma 2.5), and Theorem 9.3 yield:
[TABLE]
and also:
[TABLE]
Combining the two lower bounds, we conclude that:
[TABLE]
The proof is complete. ∎
10.1. Stacking
In principle, networks with arbitrarily many layers can be handled by a similar argument. However, instead of producing the sum over the layers claimed by Theorem 3.1, this argument will only produce the maximum over the layers. The maximum can be replaced with the sum by paying a factor of , which is weaker than the constant factor claimed in Theorem 3.1. Thus, to overcome this limitation, we develop a stacking technique and prove the following.
Lemma 10.2** (Four and more hidden layers).**
Assume that and for all . Then:
[TABLE]
Proof.
To prove the lemma, we will compare the network with a smaller network, which we construct by “stacking” three-layer modules, and doing multiplexing in each one of them.
Step 1. Construction of the network
Fix an arbitrary injective map
[TABLE]
Consider arbitrary functions
[TABLE]
Lemma 4.2 states that the function
[TABLE]
belongs to , where we let:
[TABLE]
Now connect three-layer modules , , as shown in Figure 4. In that figure, denotes the coordinate projection onto that retains the first coordinates of a vector. Given an input , the first module computes in layer , and it passes to layer as the input to the second module. The second module computes in layer , then takes the ‘or’ with the output of the second module in layer , thereby computing ; it also passes to layer as the input to the third module, etc. Continuing in this way, we see that the network ultimately computes and outputs the function:
[TABLE]
Step 2. Estimating the capacity of the network using the capacities of modules
Now that we described the architecture, let us estimate how many Boolean functions the architecture can compute. Let us denote the set of all such computable functions by . By definition of the functions in (10.2) and in (10.4), we have:
[TABLE]
if for some222The injectivity of guarantees that there exists at most one that satisfies (10.5). . This implies that the map is an injective transformation from to . Indeed, Equation (10.5) allows one to uniquely recover all and thus from .) Therefore:
[TABLE]
The right hand side can be estimated using the tight bounds on the capacity of three-layer networks from Theorem 9.3. Note that the conditions on guarantee that the assumptions of Theorem 9.3 are satisfied. We obtain:
[TABLE]
Step 3. Counting nodes
As is evident from Figure 4, the overall architecture has layers of units (not counting the output). The number of nodes in the -th layer of units, , is bounded by:
[TABLE]
Hence, by monotonicity:
[TABLE]
Combining this with the lower bound (10.6), we conclude that:
[TABLE]
Step 4. Adding one term to the sum
To complete the proof, we just need to add one last term to this sum. We can get it by comparison with a three-layer network . Indeed, monotonicity, contractivity (Lemma 2.5), and Theorem 9.3 give:
[TABLE]
Combining this with (10.7), we conclude that:
[TABLE]
This completes the proof of the Lemma. ∎
10.2. The lower bound in Theorem 3.1
Now we prove a partial case of Theorem 3.1 for networks with a single output node:
Theorem 10.3**.**
Under the conditions of Theorem 3.1, we have:
[TABLE]
Proof.
We already proved the upper bound on the capacity in Corollary 5.4. The lower bound follows from Corollary 8.4 for a single hidden layer, Theorem 9.3 for two hidden layers, Lemma 10.1 for three hidden layers, and Lemma 10.2 for four and more layers applied333Precisely, the assumptions of Theorem 3.1 yield: . Dividing both sides by and taking the integer part, we get: . This means that Lemma 10.2 can indeed be applied using instead of . using instead of . ∎
Finally, we are ready to complete the proof of the main result:
Proof of Theorem 3.1.
The upper bound was already proven in Proposition 5.2. It remains to prove the lower bound. For , the result follows from Theorem 2.11, which gives:
[TABLE]
Now let . Monotonicity and Theorem 10.3 yield:
[TABLE]
To complete the proof, we need to add just one last term to this sum. We can get it by comparison with a three-layer network . Indeed, monotonicity, contractivity (Lemma 2.5), and Theorem 9.4 give:
[TABLE]
Combining this with (10.7), we conclude that:
[TABLE]
This completes the proof. ∎
10.3. Why are rapidly expanding networks excluded?
We stated Theorem 3.1 under the assumption that the network is not expanding too rapidly, as quantified by requiring that:
[TABLE]
It is worth noting that this requirement is almost optimal. To see this, note first that the number of all Boolean functions on is . This yields the trivial upper bound:
[TABLE]
Combining it with the lower bound given by Theorem 3.1 (and Corollary 3.2), we get
[TABLE]
Thus, in order for Theorem 3.1 to hold, we must have:
[TABLE]
In particular, if all for are of the same order (e.g. equal to each other), we must have:
[TABLE]
This shows that the condition (10.8) can not be removed and that it has an almost optimal form.
10.4. Relaxing the assumption on the number of nodes
Although the assumption in Theorem 3.1 is almost optimal, it can still be slightly improved in order to accommodate small top layers. Specifically, with a little more work, it can be relaxed to:
[TABLE]
This relaxed condition can be useful since it allows for very small top layers.
The idea behind the relaxed condition is that in the proof of Lemma 10.2, it is not necessary to transmit all bits of to the top layer. Indeed, choose to be the binary representation of the number . Thus, the first bit of is [math] if it is in the uppet half of the layers, the first two bits are if is in the upper quarter, the first three bits are if is in the upper eighth, etc. Now, we can drop the first bit of when we pass it between the modules in the upper half of the layers (i.e. for ); instead of verifying the clause , we verify the equivalent clause , where is the coordinate projection that drops the first bit. Similarly, we can drop the second bit of in the upper quarter of the layers, etc. Thus, the length of the portion of passed to the -th layer is approximately instead of the full length, i.e. . The rest of the proof is unchanged.
10.5. Restricted capacity
In this section we extend Theorem 3.1 to the case where the input to the network are not all possible binary vectors, but rather lie in a subset . We introduced this restricted version of capacity in Section 2.3 and denoted it by:
[TABLE]
We proved an upper bound on in Proposition 5.2. Now we will complement it with a lower bound. The notion of VC-dimension (see e.g. [24, Section 8.3]) allows us to reduce the problem of restricted capacity to the case of unrestricted capacity.
Lemma 10.4** (Restricted vs. unrestricted capacity).**
Consider a subset . Then, for any number of layers and any number of nodes in each layer, we have:
[TABLE]
where is the VC-dimension of .
Proof.
By the definition of VC-dimension, there exists a subset of indices of cardinality that is shattered by . This means that:
[TABLE]
where is the coordinate projection that retains the coordinates in and drops the coordinates outside of . By excluding the input nodes outside , one immediately obtains:
[TABLE]
The proof is complete. ∎
Combining this bound with the Sauer-Shelah Lemma, we obtain the following:
Proposition 10.5** (Restricted capacity: a lower bound).**
Consider a subset such that . Then, for any number of layers and any number of nodes in each layer, there exists an integer such that:
[TABLE]
and:
[TABLE]
Proof.
The Sauer-Shelah Lemma (see e.g. [24, Section 8.3.3]) gives the upper bound:
[TABLE]
where is the VC-dimension of . On the other hand, we have the lower bound . Combining the two bounds and taking logarithms, we get:
[TABLE]
An elementary computation then yields:
[TABLE]
An application of Lemma 10.4 completes the proof. ∎
Combining Proposition 10.5 with the capacity formula for given by Theorem 3.1, we can obtain a general lower bound on the restricted capacity in terms of the cardinality of .
Remark 10.6* (Tightness).*
The bound in Proposition 10.5 is generally best possible up to a logarithmic factor. Indeed, if and then:
[TABLE]
11. Extremal capacity
The capacity formula in Theorem 3.1 is particularly useful when one wants to maximize the capacity of a network under some natural constraints. For example, if we fix the number of parameters of a network, which is essentially the same as fixing the number of edges, Corollary 3.3 states that any monotonically expansive network approximately maximizes capacity. In this section, we consider what happens if instead we fix the number of nodes and, possibly, also the number of nodes in the input layer.
We will use the symbols for identities that hold up to an factor, that is means that as . As before, we continue to use the symbols and for identities and inequalities that hold up to an absolute constant factor.
11.1. Fixing the number of nodes
It turns out that a network with a given number of nodes that asymptotically maximizes capacity is shallow. Specifically, the optimal network has just one hidden layer, which is half the size of the input layer:
Theorem 11.1**.**
Let and , . Let denote the total number of nodes. Then:
[TABLE]
as .
We shall first prove a version of Theorem 11.1 for the estimated capacity:
[TABLE]
and then replace the estimated capacity. The following lemma yields a general recipe to increase the (estimated) capacity of any network, by moving all nodes from layer and up into the input layer.
Lemma 11.2** (Move nodes out of upper layers to increase capacity).**
Let . Then:
[TABLE]
Proof.
Let us fist handle the case , where we have to show that:
[TABLE]
The definition of the estimated capacity (11.1) yields:
[TABLE]
In the last step we used that . On the other hand, the same definition yields:
[TABLE]
Hence (11.3) is evident.
Next, let . Combining the definition of the estimated capacity (11.1) with the fact that , , we obtain:
[TABLE]
On the other hand, using the same definition and expanding the square, we get:
[TABLE]
Hence (11.2) is evident. ∎
Armed with the recipe given in Lemma 11.2, we can easily maximize the (estimated) capacity over all networks with two layers and a given number of nodes.
Lemma 11.3** (The most capable network with two layers).**
Let . Then:
[TABLE]
Proof.
The maximum of is attained for . ∎
Combining Lemmas 11.2 and 11.3, we obtain a version of Theorem 11.1 for the estimated capacity:
[TABLE]
Proof of Theorem 11.1.
Because the capacity is bounded by the estimated capacity (Proposition 5.2), and using (11.4), we get:
[TABLE]
Furthermore, Theorem 2.11 implies that:
[TABLE]
as . This completes the proof. ∎
11.2. Fixing both the total number of nodes and the size of the input layer
In many applications, the input layer is fixed and can not be optimized. In such situations, it makes sense to maximize capacity of networks with a given total number of nodes , as well as a given number of nodes in the input layer. While here we focus on the case where the total number of neurons and the size of the input layer are fixed, similar results are obtained also for the case where in addition the size of the output layer is fixed.
It turns out that a network that maximizes capacity under these constraint is again shallow. If , the optimal network has two hidden layers, the first having more nodes than the second. If , such architecture is impossible; the optimal network has just one hidden layer. The following theorem makes this precise.
Theorem 11.4**.**
Let . Assume that the total number of nodes is . Then, the following holds if and .
If then:
[TABLE] 2. 2.
If then:
[TABLE]
As in the previous section, we first prove a version of Theorem 11.4 for the estimated capacity defined in (11.1). The following elementary fact will be helpful in our analysis.
Lemma 11.5**.**
For any , and any positive real numbers , we have:
[TABLE]
Proof.
Consider the difference:
[TABLE]
where is the set of pairs such that either is even and is odd, or is odd and is even. In particular, contains all pairs of the form . Since all the terms of the sum are positive, this yields:
[TABLE]
Combining this with (11.5), we conclude that:
[TABLE]
This yields the conclusion of the lemma. ∎
We are ready to prove the “estimated” version the first part of Theorem 11.4.
Lemma 11.6** (Small input layer).**
If , then:
[TABLE]
Proof.
On one hand, the definition (11.1) of the estimated capacity yields:
[TABLE]
where in the last step we used the assumption and simplified the expression. On the other hand, definition (11.1) gives:
[TABLE]
Comparing the two bounds completes the proof. ∎
Lemma 11.7** (Large input layer).**
If , then:
[TABLE]
Proof.
The assumption that implies that , and in particular we have for all . Therefore, by the definition of the estimated capacity (11.1), we have:
[TABLE]
On the other hand, by the definition of the estimated capacity (11.1), we also have:
[TABLE]
This completes the proof. ∎
Proof of Theorem 11.4.
Consider the case first. Because the capacity is bounded by the estimated capacity (Proposition 5.2), Lemma 11.6 gives:
[TABLE]
Furthermore, by Theorem 9.4, the estimated capacity is equivalent to the actual capacity, i.e.
[TABLE]
This yields the first part of the conclusion.
We can argue similarly in the case . Indeed, using Lemma 11.7, we obtain:
[TABLE]
Finally, Theorem 2.11 yields:
[TABLE]
This completes the proof of the theorem. ∎
Remark 11.8* (Optimal single-output newtorks).*
One can state similar results for single-output architectures , because their capacities are equivalent to the capacities of (Corollary 3.2). We skip the details.
11.3. Minimizing capacity
In the theorems above we have maximized the capacity. It is also possible to minimize the capacity and here too, everything else being equal, we find that capacity tends to be minimized by deep architectures. For example, we have the theorem:
Theorem 11.9**.**
Consider the set of architectures of the form with . Assume that is fixed, and that either the number of connections or the number of nodes is fixed. In either case, the capacity is minimized by the deepest possible architecture with .
Proof.
By definition, we must have at least one unit in each hidden layer, and each layer must be fully connected to the following layer. By Theorem 2.11, the first hidden layer contributes at least to the capacity and this number is minimized by having a single unit in the first hidden layer. If we stack layers of size 1 above this layer, the capacity remains unchanged and thus is minimized. Note that in this case the number of layers is dictated by the value of or . Thus, the minimal capacity is attained by the architecture . ∎
12. Structural regularization
Some have attributed the power of deep networks to the ability of being able to compute more functions. The results of the previous section, summarized in Corollary 3.4, show that this cannot be the case as the opposite is true: everything else being equal, capacity tends to be maximized by shallow networks. However the functions computable by shallow and deep networks are different. For example, R. Eldan and O. Shamir [14] found that a three-layer network with moderate-sized hidden layers is able to compute certain functions that a two-layer network is unable to compute, unless its hidden layer has exponential size. Thus, the emerging picture is that deeper networks with the same number of nodes compute fewer but more sophisticated functions. This lead to the notion of structural regularization.
It has often been noted that deep networks have a tendency to avoid overfitting, even when the size of the training set is small compared to the number of parameters ([27] and references therein). Some of this affect has been attributed to the regularizing properties of the main learning algorithm–stochastic gradient descent, and its inherent tendency to converge towards critical points with relatively broad basins of attraction (e.g. [28] and references therein). However, the results presented here show that there is a major regularization associated with deep architectures that is purely structural and independent of the learning algorithm: compared to shallow networks, deep networks compute fewer functions, but these functions tend to be “smoother and more sophisticated”. The functions we see in practice are a tiny fraction of the universe of all possible functions, but they are the most interesting ones. And deep networks are able to “focus” on them. To see this more formally we can look at the behavior of various architectures on real-valued inputs. The situation is very different in the one-dimensional case, versus all other higher dimensional cases, as shown in the following results. In the one-dimensional case, the behavior of the architecture depends exclusively on the size of the first hidden layer and adding hidden layers does not increase the space of functions that can be implemented.
Proposition 12.1**.**
The set consists of all piecewise-constant functions with at most points of discontinuity. In particular, this class is determined entirely by alone.
Proof.
The first hidden layer, through the biases, creates potential points of discontinuity. Since there is a single output, every function must be constant, and equal to 0 or 1 on each of the corresponding regions. It is possible to select the units in the hidden layer such that the leftmost region is coded by the vector in the hidden layer, the second leftmost regions is coded by the vector , the third leftmost region is coded by the vector , and so forth until the rightmost region which is coded by . The corresponding matrix, augmented with the vector to account for the bias has full rank . Therefore, by selecting the proper weights and biases, any value 0 or 1 can be assigned by the architecture to each one of the regions. ∎
Proposition 12.2**.**
The set of functions is characterized first by a splitting of into at most regions, each one of which produces a constant binary vector in the hidden layer, and then the assignment of a 0 or 1 output to each region which can be achieved in at most:
[TABLE]
different ways (the last inequality assumes ).
Proof.
The proof is easily obtained by using Theorem 2.6 to obtain the number of regions, noting that each region is mapped into a fixed vector in the hidden layer, and them applying Lemma 2.7 with . ∎
As an example, consider the class of architectures. The hidden layer gives rise to affine lines that partition the input space into regions. The number of possible binary assignments to these regions scales like . While in principle the output unit could have capacity and thus be able to handle all these assignments, in reality is capacity is reduced because only vectors, out of all possible vectors, are seen in the hidden layer. Thus the capacity of the output unit is considerably reduced to be at most: , using the standard upperbound on the capacity of sets.
The same approach can be applied to deep architectures.
Proposition 12.3**.**
The set of functions is characterized first by a splitting of into regions. Each one of these regions is mapped to a fixed binary vector in the first hidden layer, creating a set . The capacity of the number of functions that can be computed by the upper part of the architecture is given by and can be bounded using the results in Sections 5 and 10.5.
In short, the emerging intuitive picture is that the first hidden layer determines the number of regions into which the input space is fractured. The overall function is constant in each one of these regions, irrsepective of its depth. The larger the first hidden layer is, the greater the number of such regions. A network with a single, non-exponential hidden layer, has limited power in terms of assigning values to these regions. A deep network with the same number of parameters and hence a smaller first hidden layer will fracture the input in less regions and thus its output will have fewer regions of discontinuity. On the other hand the deep network will be able to compute more complex assignments to these regions.
13. Polynomial threshold functions
In search of more accurate models for biological neurons, or more powerful computational models, one may replace the linear activation with a polynomial activation of degree in the input variables, usually using a lower degree polynomial. Recently, we were able to develop a theory for the capacity of a single polynomial threshold gate [8] with inputs, generalizing Zuev’s result (Theorem 2.11) for all and showing that . The set and network capacity results presented here should be extended to feedforward networks of polynomial threshold functions. We present a first step in that direction beyond what is already in [8]. First we have the following theorem which generalizes Lemma 2.7.
Theorem 13.1** (Polynomial set capacity).**
Consider a finite subset , where . Then, for any degree , we have:
[TABLE]
where:
[TABLE]
Proof.
First, it is easy to see that the number of coefficients of a polynomial of degree in variables is given by , including the constant term (bias). A vector can be canonically and injectively mapped into a vector whose components are the various monomials. Using this mapping, we can represent any polynomial of degree over as a linear affine function over . And vice versa, any linear affine function over is a polynomial of degree over . For example, if , the vector is canonically mapped to the vector . Any polynomial over is clearly an affine function of , and vice versa. Therefore:
[TABLE]
We complete the proof by applying Lemma 2.7 for the set , noting that has the same cardinality as since is injective. Note that when and , which is required for the application of the second part of Lemma 2.7. ∎
Note that if we apply Theorem 13.1 to , we get:
[TABLE]
which is somewhat weaker asymptotically than the result in [8] giving:
[TABLE]
Note also that the general lower bound: , and its improved version when is a subset of the Boolean cube: are trivially satisfied.
Using Theorem 13.1, we can now prove the first result for fully-connected feedforward networks of polynomial threshold gates of degree with a single hidden layer:
Theorem 13.2** (Polynomial capacity of a single-hidden-layer network).**
Consider a feedforward, fully-connected, feedforward network of polynomial threshold gates of fixed degree . If and then:
[TABLE]
Proof.
The proof follows somewhat what happens in the case , using the result in [8] for single units. The capacity of the hidden layer alone is given by:
[TABLE]
and it is easy to check that the multiplexing technique can equally be applied to this case. This immediately yields the lower bound:
[TABLE]
For the upper bound, the total capacity is bounded by the sum of the capacity of all the units. The capacity of each unit in the hidden layers is bounded by . The capacity of the output unit is bounded by . For the output unit, using Theorem 13.1, its capacity is also bounded by:
[TABLE]
if , and by:
[TABLE]
if . This completes the proof. ∎
Note that in the case where we get:
[TABLE]
The work presented here naturally leads to several open research questions. We briefly mention a few.
14. Open questions
14.1. Polynomial threshold functions.
The initial results given above on polynomial threshold functions need to be extended in several directions. We leave the tightening of the capacity result in Theorem 13.2 and its extensions to networks with multiple hidden layers of polynomial threshold functions of degree for future work. The same is also true for any extensions of the lower bound on set capacity in Theorem 7.1 to polynomial threshold functions of degree .
14.2. Asymptotic tightness.
Theorem 3.1 presents a capacity formula that is accurate within an absolute constant factor. Is this formula asymptotically tight? In other words, is it true that:
[TABLE]
as the number of nodes for some (or all) layers increases to infinity?
The upper bound in Theorem 3.1 is indeed tight (Proposition 5.2), but we only know that lower bound is asymptotically tight for networks with a single hidden layer (Corollary 8.4). And even beyond that, can the term be further improved, in the same way that for single neurons the result in [16] refines the capacity estimate in [30]?
14.3. Restricted capacity.
In Sections 5 and 10.5 we estimated the restricted capacity for a general input set . Our lower bound (Proposition 10.5) is tight within a logarithmic factor. Can this factor be removed?
To do so, one may try to follow the proof of the lower bound in Theorem 3.1 and use Theorem 7.1 instead of the estimate in this argument. But this reasoning meets an obstacle: we do not know a version of the Enrichment Theorem 9.1 for a general set .
Thus, an important related problem is to generalize the Enrichment Theorem 9.1 for a general set . Is it true that there exists an injective map all of whose components are threshold functions and such that:
[TABLE]
14.4. Other transfer functions
This paper focused exclusively on networks with the threshold (Heaviside) transfer function . One may wonder if our results hold true for other transfer functions, such as the ReLU function , or the sigmoidal transfer functions (e.g. logistic or ) that are commonly used in neural networks. Suppose some general transfer function is applied at each hidden layer. As long as the inputs are from a finite set and the threshold function is applied at the output nodes, the set of functions that can be implemented by the architecture remains finite and the same definition of capacity can be applied without the need for any adjustments. How many functions can the network compute?
Our preliminary analysis, left for further investigations, suggests that the capacity might not depend too much on the shape of the transfer function , provided that is a piecewise-constant function that consists of more than one piece, but less than an exponential number of pieces. Thus, the capacity formula in Theorem 3.1 might be universal for the class of networks with piecewise-constant transfer functions.
Beyond piecewise-constant, we can consider piecewise-linear transfer functions, such as ReLU functions. We have recently shown that the capacity of a single unit is increased by a ReLU transfer function, but only marginally as it remains equal to [6]. We conjecture that, in essence, the same remains true for multilayer networks with ReLU transfer functions in the hidden layers. In light of the known results ([10, 9]) that show that the VC-dimension of ReLU neural networks may grow super-linearly with the depth , it is interesting to find out if the capacity of the networks (e.g. with equal sizes of layers) grows super-linearly as well.
The definition of capacity used here relies here on a class of functions or hypotheses that is finite. However, the definition could clearly be extended to networks that can compute an infinite number of functions, for instance using real-valued inputs and continuous transfer functions everywhere, including in the output layer. For this purpose, one could still define capacity by but defining in a broader measure theoretic sense (e.g. volume).
14.5. Other connectivity models
Finally, it is possible to consider several other connectivity models, either by having more sparse connections or by constraining the values of the connections. A first example is to consider synaptic weights that are constrained to belong to a finite set, for instance (binary synapses). It is easy to see that the capacity of a binary synapse linear threshold neuron is exactly equal to the number of inputs [6], but the extension to multiple layers has not been studied. Likewise, it is possible to consider the case where all the incoming, or outgoing, synaptic weights must have the same sign (e.g. purely excitatory or purely inhibitory neurons). We have shown that, for a single linear threshold unit, if all the incoming weights are positive the capacity of the unit is marginally decreased but remains in the same class and equal to . Again the extension to multiple layers has not been studied. Finally, there is the case of local connectivity, where the indegree or outdegree of a neuron may be restricted. How can the theory be extended to some of those cases? Consider a typical convolutional neural network for computer vision applications. In this case, a convolutional layer may comprise an array of neurons, where each neuron has an identical set of incoming weights, the so-called weight-sharing approach. Because the weights are tied, the entire layer can implement only one set of weights, and thus only one function. If the neurons are modeled as linear threshold gates, the capacity of the entire layer can be estimated, and is equal to . Thus, in short, under the right assumptions the methods presented here can readily be applied to convolutional neural networks. All the previous examples, assumed feedfoward patterns of connections. Extensions to recurrent networks, other than the fully-connected case, have not been investigated.
15. Conclusion
In the 1940s, McCulloch and Pitts [17] and others introduced a simple neuronal model, whereby a neuron processes information by first computing an activation and then an output . The activation is typically a weighted linear, or polynomial, function of the inputs. The transfer function is typically a non-linear function, such as a threshold function, ReLU (rectified linear unit) function or more generally a piecewise linear function, or sigmoidal function (e.g. logistic, ).
Networks of McCulloch and Pitts model neurons are important for at least four fundamental reasons. First, although far simpler than biological neurons in their processing details, these simplified neural models have proven over and over to be useful to better understand biological networks (e.g. [29, 20, 26]). Second, these models are also used to guide the development of new, power efficient, neuromorphic chips [19]. Third, these neural network models are widely used today in all kinds of AI/deep learning applications with impressive results, often matching or exceeding human capabilities in specific tasks across the gamut of applications, from games to biomedicine (e.g.[22, 23, 4]). And finally, from a foundational standpoint, they are the dominant, and perhaps simplest, available analytical model for studying the neural style of storing information.
Indeed, in the standard description of McCulloch and Pitts neurons given above, the emphasis is placed on the processing aspect of these models, the input-output relationship. However, equal or even more emphasis should be given to the storage aspect of this neural model. Information about the world, e.g. in the form of “training sets”, is stored in a distributed “holographic” way in the synaptic weights (i.e. the coefficients of the activation function), through a learning process. This is significant because to achieve intelligent behavior, information processing systems must be able to learn and store information. To store information, there are two completely different approaches: (1) the Turing tape model, where information is stored at well organized, discrete, addresses of a physical substrate–this is the style used in all living systems at the cellular level (DNA), and in all our digital devices, from cell phones to supercomputers; (2) the neural model, where information is stored “holographically” in neural networks across large numbers of synapses–this is the style believed to be used by the brain, and simulated in our neural network–deep learning–technology. While the Turing style is relatively well understood, the neural style is not.
In this paper, we set out to study the most fundamental property of the neural style of storage, namely how many bits can be stored in a given neural architecture. To address this question, we first had to introduce the notion of cardinal capacity, the logarithm base two of the number of different functions a given architecture can compute. Remarkably, for neural architectures, the cardinal capacity is equal to the total number of bits that can be stored in a given architecture, or the number of bits that can be “communicated” from the outside world to the architecture by the learning process. We then estimated the capacity of feedforward neural architectures of arbitrary depth, under a relatively mild set of assumptions on the connectivity and the transfer functions of these architectures. The capacity is typically a cubic polynomial in the sizes of the layers. For fully connected, feedforward, architectures it is essentially given by: . As a side note, the capacity of fully connected recurrent networks can also be estimated [6], essentially by unfolding them in time and computing the capacity of the underlying feedforward network. In addition, we have improved the bounds on the capacity of sets, analyzed the extremal properties of the capacity and the structural regularization effects of deep architectures, and began to extend the theory of capacity to polynomial threshold functions. In addition, we have briefly surveyed several open questions in this area.
Finally, although this falls beyond the scope of this paper, the capacity is a fundamental quantity that can be related to other measures of complexity and generalization including the VC dimension, the growth function, the Rademacher and Gaussian complexity, the metric entropy, and the minimum description length (MDL). For example, if the function to be learnt as MDL , and the neural architecture being used as capacity then it is easy to see that: (1) cannot be learnt without errors; and (2) the number of errors made by the best approximating function implementable by the architecture must satisfy . Another example of connection is the connection to the VC dimension that was used in the bounds in Section 10.5. These connections will be described more systematically elsewhere.
Appendix: Examples
In this appendix, we apply the main result to a few basic architectures, assuming the layers are large so that the asymptotic regimes can be applied.
First, consider a deep architecture which is expansive, where expansive is defined by the property: . Then, using the main result:
[TABLE]
Second, consider a deep architecture which is compressive, where compressive is defined by the property: . Then, using the main result:
[TABLE]
Third, we consider autoencoder architectures with a single hidden layer (Figure 5). Clearly . For the capacity, there are two cases depending on whether the autoencoder is expansive or compressive. In the compressive case (), . If we let for , then:
[TABLE]
In the expansive case (), . If we let for , then:
[TABLE]
Finally, we contrast a shallow and a deep classification architectures (Figure 6). For the shallow architecture, we have:
[TABLE]
For the deep architecture, we have:
[TABLE]
if , and:
[TABLE]
if . Here is a parameter that represents the depth–the entire architecture has layers, not counting the single-unit output layer. Consider, for instance, the expansive case where and . Then both architectures satisfy: and will have roughly the same capacity when they have roughly the same number of parameters. If we let (), (), and () then for the architectures to have approximately the same number of parameters (and thus approximately the same capacity), one must have: . The other cases can be analyzed similarly.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Martin Anthony. Discrete mathematics of neural networks: selected topics , volume 8. Siam, 2001.
- 2[2] P. Baldi. Autoencoders, Unsupervised Learning, and Deep Architectures. Journal of Machine Learning Research. Proceedings of 2011 ICML Workshop on Unsupervised and Transfer Learning , 27:37–50, 2012.
- 3[3] P. Baldi. Boolean autoencoders and hypercube clustering complexity. Designs, Codes, and Cryptography , 65(3):383–403, 2012.
- 4[4] P. Baldi. Deep learning in biomedical data science. Annual Review of Biomedical Data Science , 1:181–205, 2018.
- 5[5] P. Baldi and A. F. Atiya. Oscillations and synchronizations in neural networks: an exploration of the labeling hypothesis. International Journal of Neural Systems , 1(2):103–124, 1989.
- 6[6] P. Baldi and R. Vershynin. On neuronal capacity. In NIPS 2018 . Accepted for oral presentation.
- 7[7] Pierre Baldi and Peter Sadowski. A theory of local learning, the learning channel, and the optimality of backpropagation. Neural Networks , 83:61–74, 2016.
- 8[8] Pierre Baldi and Roman Vershynin. Boolean polynomial threshold functions and random tensors. ar Xiv preprint ar Xiv:1803.10868 , 2018.
