Interpretations of Deep Learning by Forests and Haar Wavelets

Changcun Huang

arXiv:1906.06706·cs.LG·December 9, 2019

Interpretations of Deep Learning by Forests and Haar Wavelets

Changcun Huang

PDF

Open Access

TL;DR

This paper offers new interpretations of ReLU deep learning by linking it to decision forests and Haar wavelet functions, providing insights into its classification and approximation capabilities.

Contribution

It introduces two novel perspectives: viewing certain ReLU networks as forests and approximating Haar wavelet functions, enhancing understanding of deep learning's theoretical properties.

Findings

01

ReLU deep learning can be equivalent to decision forests in classification.

02

ReLU networks can approximate Haar wavelet functions with arbitrary precision.

03

Generalization of results from ReLU to sigmoid-unit deep learning.

Abstract

This paper presents a basic property of region dividing of ReLU (rectified linear unit) deep learning when new layers are successively added, by which two new perspectives of interpreting deep learning are given. The first is related to decision trees and forests; we construct a deep learning structure equivalent to a forest in classification abilities, which means that certain kinds of ReLU deep learning can be considered as forests. The second perspective is that Haar wavelet represented functions can be approximated by ReLU deep learning with arbitrary precision; and then a general conclusion of function approximation abilities of ReLU deep learning is given. Finally, generalize some of the conclusions of ReLU deep learning to the case of sigmoid-unit deep learning.

Equations12

y = f (i = 1 \sum N w_{i} s_{i}),

y = f (i = 1 \sum N w_{i} s_{i}),

y = W x + b,

y = W x + b,

E = \iint_{B^{'}} (\hat{f} (x_{1}, x_{2}) - f (x_{1}, x_{2})^{2})^{1/2} d x_{1} d x_{2},

E = \iint_{B^{'}} (\hat{f} (x_{1}, x_{2}) - f (x_{1}, x_{2})^{2})^{1/2} d x_{1} d x_{2},

ω = S max ∣ f (x_{1}^{'}, x_{2}^{'}) - f (x_{1}, x_{2}) ∣,

ω = S max ∣ f (x_{1}^{'}, x_{2}^{'}) - f (x_{1}, x_{2}) ∣,

E \leq ω S_{B},

E \leq ω S_{B},

f (x) = max (0, k x + b),

f (x) = max (0, k x + b),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Generative Adversarial Networks and Image Synthesis · Image and Signal Denoising Methods

Methods*Communicated@Fast*How Do I Communicate to Expedia?

Full text

∎

11institutetext: Changcun Huang

Interpretations of Deep Learning by Forests and Haar Wavelets

Changcun Huang

Abstract

This paper presents a basic property of region dividing of ReLU (rectified linear unit) deep learning when new layers are successively added, by which two new perspectives of interpreting deep learning are given. The first is related to decision trees and forests; we construct a deep learning structure equivalent to a forest in classification abilities, which means that certain kinds of ReLU deep learning can be considered as forests. The second perspective is that Haar wavelet represented functions can be approximated by ReLU deep learning with arbitrary precision; and then a general conclusion of function approximation abilities of ReLU deep learning is given. Finally, generalize some of the conclusions of ReLU deep learning to the case of sigmoid-unit deep learning.

Keywords:

Deep learning Interpretation Region dividing Forests Haar wavelets Function approximation.

1 Introduction

Deep leaning is nearly the most popular highlight of artificial intelligence nowadays and has made great successes in speech recognition [1] , computer vision [2] , playing game go [3] , and so on. Despite its successful applications and history of nearly 40 years since Fukushima’s paper in 1982 [4] , the underlying principle still remains unclear, so that deep learning is often referred to as “black box”, which greatly hinders its development.

One of the main concerns is why deep neural networks are more powerful than those with shallow layers. The answer to this question is the key of understanding deep learning, for which there are mainly three kinds of existing results: The first kind is specific, explaining particular class of functions realized by deep learning, such as [5] ; [6] ; [7] ; [8] . The second kind [9] ; [10] ; [11] ; [12] ; [13] is more general by studying the expressive ability of deep layers compared with shallow ones. The last one is about the function approximation ability of deep learning [14] ; [15] .

Part of this paper belongs to the second kind. We’ll present a basic property of region dividing of ReLU [16] ; [17] deep learning when successively adding new layers (Section 3); and then realize the multi-category classification via a deep learning structure equivalent to a forest, which is a new interpretation of ReLU deep learning (Section 4).

The function approximation problem is also discussed (Section 5). We’ll prove that Haar wavelet represented functions can be approximated by ReLU deep learning as precisely as possible. It follows a general result that ReLU deep learning can approximate an arbitrary continuous function on a closed set of $n$ -dimensional space; the proof is totally different from [14] ; [15] , giving a new perspective of interpreting ReLU deep learning.

Finally, one distinction of this paper is that some conclusions of ReLU deep learning can be generalized to the case of sigmoid-unit deep learning (Section 6).

Since ReLU has nearly become the dominant choice of neural units used by deep learning in recent years [9] ; [18] , the main topics of this paper are general and useful both in theory and engineering.

2 Preliminaries

Before describing the region dividing property of deep learning, some preliminary results should be introduced first.

2.1 Mechanisms of 3-layer networks

The discussion of 3-layer networks is the basis of comparisons between shallow and deep networks. And also, there exists 3-layer subnetworks in deep learning, in which the mechanism is the same as that of ordinary 3-layer networks.

We begin the discussion from a concrete example of two-category classification realized by a 3-layer network. It is well known that each ReLU corresponds to a hyperplane dividing the input space into two regions. In the case of two-dimensional input space, a hyperplane is reduced to a line. The following notes are applicable to the rest of this paper:

Note 1.

Hereafter, when referred to the classification by a hyperplane, the data points just being on hyperplanes are not taken into consideration.

Note 2.

We’ll not distinguish between the term of “region dividing” and that of “data classification”; they usually do the same thing.

Note 3.

For simplicity, all the figures of neural networks ignore the biases, which actually exist, however.

Note 4.

Unless otherwise stated, the term of “deep learning” is the abbreviation of “ReLU deep learning”.

Fig.1 (a) is a 3-layer network with three ReLUs in the hidden layer denoted by 1, 2 and 3, corresponding to lines of 1, 2, and 3 of Fig.1 (b), respectively.

Notation.

We denote the two different sides of a hyperplane by “ $l$ - $s$ ”, where $l$ is the index of the hyperplane and $s$ expresses the output of the ReLU with respect to this hyperplane. $l$ -+ represents one side of hyperplane $l$ , where the ReLU output is greater than 0; and the other side is denoted by $l$ -0, where the ReLU output is zero.

For instance, in Fig.1(b), 1-+ is the side above line 1 because the data in that half plane gives positive ReLU output, while 1-0 represents the below side producing zero outputs. The objective of the 3-layer network of Fig.1 (a) is to classify the data points of Fig.1 (b) into two categories: the output of the third layer should be 0 or 1 when the input sample belongs to “o” or “ $\ast$ ” category, respectively. Output 1 can be obtained by normalizing the nonzero output of the ReLU.

In Fig.1 (b), we can add lines of 1, 2, and 3 one by one for classification. First, Line 1 is added, when the “o” samples below line 1 are correctly classified. The samples above line 1 should be further classified by more lines, such as line 2 and line 3. However, for example, when line 3 is added, the “o” samples below line 1 is simultaneously in the side of 3-+, producing nonzero outputs; that is to say, the subdividing of the half plane above line 1 by line 3 makes the ever correct classification result below line 1 change to be wrong, for which we may need to add other lines to eliminate the influence of line 3.

In fact, the final output expression of 3-layer networks (with a single output) is

[TABLE]

where $s_{i}$ is the $i$ th ReLU output of the hidden layer. If $s_{i}\neq 0$ and $w_{i}\neq 0$ , the $i$ th ReLU can influence the whole sum $\sum{w_{i}s_{i}}$ by its nonzero output; in geometry language, it means that the $i$ th hyperplane for region dividing will influence half of the input space where this ReLU output is nonzero. The influenced region may include ever correctly divided regions and the right results may be reversed. If the influence cannot be eliminated by adjusting present hyperplanes, new hyperplanes should be added. This procedure may occur recursively; hence the number of hyperplanes needed in 3-layer networks may be extremely larger than that it really needs, when we just want to divide the input space into separated regions without considering mutual influences. This is the general explanation of Fig.1, from which a conclusion follows:

Theorem 1.

In 3-layer networks, any new added ReLU of the hidden layer will influence half of the input space where the output of this ReLU is nonzero.

We shall show that the interference of hyperplanes to each other can be avoided in deep learning.

2.2 Transmitting of input-space regions through layers

In deep learning, the input space is only directly connected to the first hidden layer; how a region of the input space passes to subsequent layers is a key foundation of subregion dividing via a sequence of layers.

Pascanu et al. [9] used “intermediate layer” to transmit an input-space region, which is actually by means of affine transforms; however, no general rigorous conclusions with proofs were presented. Although trivial in mathematics, due to great importance, we’ll give detailed descriptions rigorously both in the conclusions and proofs about this problem, as well as add some necessary prerequisites for the establishing of the results.

Lemma 1.

Suppose that the input space $I$ is n-dimensional. The $n$ nonzero outputs of n ReLUs in the first hidden layer form a new space $H$ . If the weight matrix $W$ of size $n\times n$ between the input layer and the first hidden layer is nonsingular, then $H$ is n-dimensional and is an affine transform of a region of $I$ . The intersection of the nonzero-output areas of $n$ ReLUs in $I$ is the region to be transformed.

Proof.

We know that the nonzero output of a ReLU is $f(x)=x$ for $x>0$ . So an $n$ -nonzero output vector $\vec{y}$ of $H$ can be written as

[TABLE]

where $\vec{x}$ is a vector of a certain region of $I$ and $\vec{b}$ is the bias vector of the $n$ ReLUs. (2) only combines the outputs of $n$ ReLUs to the matrix form. Obviously, (2) is an affine transform and if $W$ is nonsingular, the dimension of $H$ would be $n$ . ∎

Remark.

The geometric meaning of Lemma 1: Nonsingular $W$ of (2) implies non-parallel hyperplanes. Lemma 1 is equivalent to say that if the $n$ hyperplanes with respect to $n$ ReLUs are not parallel to each other, the space $H$ would be $n$ -dimensional as well as an affine transform of a region of the input space.

Theorem 2.

In deep learning with $n$ -dimensional input, if each succeeding layer has $n$ ReLUs with nonsingular input weight matrix, a certain region of the input space can be transmitted to subsequent layers one by one in the sense of affine transforms.

Proof.

The first hidden layer of Lemma 1 again can be considered as a new input space. By doing this recursively, a certain region of the initial input space can be transmitted to succeeding layers one by one in the sense of affine transforms, as long as this region is always in the nonzero parts of all the $n$ ReLUs in each layer. ∎

3 Basic properties of region dividing via deep learning

Section 2 has mentioned the mechanism of 3-layer networks that adding a new ReLU in hidden layer would influence half of the input space. However, this disadvantage can be avoided in deep learning; we’ll describe a basic property of region dividing of deep learning in Theorem 3, which is the basis of the whole paper. The proof of Lemma 2 will be referred to for several times in the following sections to construct different types of deep learning structures.

3.1 The two-dimensional case

Also begin with an example. Fig.2 is corresponding to Fig.1 of Section 2. In Fig.1, to subdivide the region above 1-+, line 3 is added in the hidden layer; however, this operation influences the ever correctly classified results. In deep learning, the mutual interference among lines can be avoided by adding a new layer to restrict the influencing area of a line.

As shown in Fig.2 (b), first, line 1 is selected to divide the data points into two separate parts in different regions; then, we can always find line 2 having the same classification effect as line 1. In fact, when line 2 in Fig.2 (b) rotates counterclockwise towards line 1, the region between 1-+ and 2-+ (or between 1-0 and 2-0) can be as large as possible, such that all the data points above line 1 (or below line 1) are encompassed by 1-+ and 2-+ (or by 1-0 and 1-0); this is the way of finding line 2. Thus, all the data points are either in the region between 1-+ and 2-+ (denoted by region-+) or in the region between 1-0 and 2-0 (denoted by region-0).

Since line 1 and line 2 are not parallel to each other, by the remark of Lemma 1, the space of the two nonzero outputs of ReLU 1 and ReLU 2 of the first hidden layer is two-dimensional, as well as an affine transform of region-+; while region-0 is excluded from this layer in terms of zero outputs. Affine transforms do not affect the linear classification property of the data; so the linear classification in region-+ can be done in the space of the first hidden layer, without influencing region-0 because it has been excluded.

Now, instead of adding ReLU 3 in the same layer as ReLU 1 and ReLU 2 in Fig.1 (a), we add it in a new layer called the second hidden layer as shown in Fig.2 (a) to perform the classification of the first hidden layer. Correspondingly, in Fig.2 (b), line 3 should be added in a region of the first hidden layer, which is an affine transform of region-+ of the input space; however, this illustration is reasonable because the effect of linear classification is equivalent.

Obviously, the principle and operation underlying this example are general in two-dimensional space. In what follows, we shall directly generalize it to the $n$ -dimensional case.

3.2 The $n$ -dimensional case

Lemma 2.

For a 3-layer network with $n$ -dimensional input, the hidden layer can be designed to realize an arbitrary linearly separable classification of two categories. One of the category will be excluded by the hidden layer, while the other one changes into its affine transform. Adding a new hidden layer can divide a selected region of the input space in the sense of affine transforms without influencing an excluded region.

Proof.

When the input space is $n$ -dimensional, we need $n$ hyperplanes (ReLUs) to construct an $n$ -dimensional space of the hidden layer, each of which realizes a same two-category classification. The function of those $n$ hyperplanes to be constructed is similar to that of line 1 and line 2 in Fig.2 (b).

First choose hyperplane 1 to divide the input space into two regions, containing the data points of category-0 and category-+, respectively; category-0 should be excluded, while category-+ may need to be subdivided. Then hyperplane 2 with the same classification effect as hyperplane 1 can be found by the similar method of the two-dimensional case. When hyperplane 2 rotates towards hyperplane 1 (counterclockwise or clockwise according to their relative positions), there exists infinite number of hyperplanes between them, all of which can classify the data in the same effect; choose $n-2$ of them as the remaining hyperplanes to construct an $n$ -dimensional coordinate system. Since the $n$ selected hyperplanes are not parallel to each other, by the remark of Lemma 1, the $n$ nonzero outputs of $n$ ReLUs with respect to those hyperplanes form an $n$ -dimensional linear space, which is an affine transform of a region of the input space; while the region giving $n$ zero outputs of the $n$ ReLUs will be excluded.

The constructed hidden layer has successfully excluded a region containing category-0 (region-0), as well as transmitted a region containing category-+ (region-+). If adding a new hidden layer, we can subdivide region-+ of the input space in the sense of affine transforms without influencing region-0. ∎

Remark.

The purpose of selecting $n$ non-parallel hyperplanes (ReLUs) is to construct an $n$ -dimensional space to maintain the complete data structure of the $n$ -dimensional input space in the sense of affine transforms. If the number of non-parallel hyperplanes is less than $n$ , the outputs will be the subspace of the input space, which may lose information.

Notation.

Denote an arbitrary 2-layer subnetwork of a deep learning structure by $P$ - $C$ with $n$ -dimensional input, representing the previous layer and current layer, respectively; $W$ is the weight matrix between layer $P$ and layer $C$ as in (2).

Theorem 3.

In deep learning, if current layer $C$ has $n$ ReLUs with nonsingular input weight matrix $W$ , adding ReLUs in a new layer $N$ after layer $C$ can divide a certain region of previous layer $P$ in the sense of affine transforms without influencing an excluded region. Similarly, adding new layers one by one can realize subregion dividing recursively; in each layer, data points that do not need to be subdivided can be put into the excluded region, so that the region dividing of succeeding layers will have no impact on them.

Proof.

The first part of the theorem is similar to Lemma 2. As long as $W$ is nonsingular, even if the $n$ ReLUs of layer $C$ are not specially designed, the region-transmitting property still holds. The rest of the proof is the recursive application of the first part. ∎

Remark 1.

Theorem 3 is a basic property of region dividing of certain kinds of ReLU deep learning, which is the key of this paper; all the following results are the consequences of this theorem.

Remark 2.

Theorem 3 indicates an advantage of deep layers for a type of deep learning structure. To classify complex data points, the deeper the network, the finer the subdividing will be. Once the step of adding new layers stops, the last three layers will perform the classification via the mechanism of 3-layer networks in a ultimate subregion.

4 Multi-category classification of deep learning

In this section, the multi-category classification ability of deep learning will be given in Theorem 4. Lemma 3 deals with the two-category case, while Theorem 4 is the repeated applications of Lemma 3.

Lemma 3.

For a finite number of data points composed of two categories in $n$ -dimensional space, deep learning can classify them as a decision tree.

Proof.

The proof is constructive by Lemma 2, Theorem 3 and the theory of decision trees. First, we can always construct a decision tree to realize this two-category classification, whose decision functions are linear classifiers. Second, there exists a deep learning structure equivalent to that decision tree, which is given by the following method.

As shown in Fig.3 (a), it’s a four-level decision tree classifying two-dimensional data points and Fig.3 (b) is its corresponding deep learning structure. The layer of deep learning should correspond to the level of the decision tree except that the deep learning adds an output layer with one ReLU. Also in Fig.3, the first layer, i.e., the input layer, has two ReLUs with respect to root node 1 of the decision tree because the input space is two-dimensional; the general case of $n$ -dimensional space is similar.

In each layer, for the node having two child nodes, construct $2n$ ReLUs in the next layer: The $n$ of them (left child) separate the data points into region-+ and region-0, which are designed according to the decision function of this node by the method of Lemma 2; data points in region-+ can be subclassified by succeeding layers of child nodes without influencing region-0 excluded. The other $n$ ReLUs (right child) are different from the first group of $n$ ReLUs only in the parameter signs, respectively; they reverse the ReLU outputs of data points in region-+ and region-0, which instead makes region-0 to be subdivided. For example, in Fig.3, node 1 has two child nodes, so that four ReLUs are needed in the next layer; two of them are for left child 2, while the other two are for right child 3. In the second layer, the weights and biases of ReLU 3’s are opposite in the signs to those of ReLU 2’s as well as with same absolute values, respectively.

For the leaf node, if the next layer is the last one, just connect its related ReLUs to the output ReLU, as node 6 and node 7 of Fig.3. Otherwise, we should add one ReLU in each succeeding layer (except for the last one) to transmit the classification result to the last layer, such as node 2 and node 5 in Fig.3; make sure that the weights and bias of the single ReLU of a leaf node in each layer maintain the nonzero output.

The weights and bias of the output-layer ReLU should be designed to distinguish between a left leaf node and a right leaf node. For instance, let the left leaf node and right leaf node of Fig.3 (a) correspond to zero output and nonzero output of deep learning of Fig.3 (b), respectively. The design is easy because in the layer previous the last one, when the output of a leaf node is nonzero, those of other leaf nodes will be mutually exclusive to be zero, due to the properties of decision trees. For example, in Fig.3 (b), when the output of ReLU 2 in the fourth layer (previous the last one) is nonzero, all the outputs of other ReLUs of this layer will be zero. So just consider that only ReLU 2 exists in that layer, by which the weight between ReLU 2 and the output-layer ReLU can be designed without influencing other leaf nodes. The bias of the output-layer ReLU can be set to zero because the weight itself is enough to produce the desired output. Since ReLU 2 is corresponding to a left leaf node, obviously, when the bias is zero, if the weight is set to a value less than or equal to zero, the design will meet the need. The general case is similar. This completes the constructing process. ∎

A forest is a sum of decision trees [19] . We have proved that deep learning can realize the two-category classification as a decision tree. Next we’ll show that deep learning can be equivalent to a forest in classification abilities.

Theorem 4.

Deep learning can classify arbitrary multi-category finite number of data points as a forest.

Proof.

The proof is to reduce the multi-category classification to the two-category case of Lemma 3 [20] . Fig.4 is an example of three-category classification, in which each dotted rectangle classifies one of the three categories using the two-category method of Lemma 3. No matter how many categories should be classified, employ the two-category method to deal with each category separately and combine them into a whole deep learning structure. ∎

Remark 1.

From the viewpoint of forests, the layer depth of deep learning corresponds to the level of decision trees comprising a forest; while the number of neural units in each layer is related to that of the nodes in the corresponding level of the decision tree.

Remark 2.

Bengio et al. [19] stated that decision trees are not easily generalized to variations of the training data, while forests do not have this limitation. By Theorem 4, deep learning can realize the function of forests and its generalization ability can be assured.

5 Function approximations of deep learning

There exists general results about the function approximation ability of 3-layer sigmoid-unit networks, such as Hecht-Nielsen [21] , Cybenko [22] , and Hornik et al.[23] . Among them, Hecht-Nielsen’s proof is constructive.

In the area of ReLU deep learning, Yarotsky [14] and Liang et al. [15] have also discussed such issues, both using similar methods. They first constructed deep learning structures to approximate polynomial functions; and then by Taylor series of smooth functions, the function approximation ability of deep learning was proved. Although the conclusions are assured, the deep learning structures they mentioned are constrained to certain types.

The proofs given blow are totally different from [14] and [15] , aiming at other types of deep learning structures, which will provide a new perspective of interpreting deep learning.

Lemma 4.

Any piecewise-constant function of Haar wavelets with finite number of building-block domains can be approximated by deep learning with arbitrary precision. If the input space is $n$ -dimensional, $2n$ hidden layers are enough.

Proof.

The proof is based on Lemma 2 and Theorem 3. First prove the two-dimensional case. For a Haar wavelet represented function $f(x_{1},x_{2})$ defined on a closed set $S$ with finite building-block domains, we can always divide its domains into rectangles (or squares, similar hereafter) with different sizes and locations, each having a constant value (maybe the same with some other rectangles) of the function. The basic idea is to approximate the function by deep learning in each rectangle as precisely as possible. Because the number of rectangles is finite, if the approximation error for each rectangle is arbitrarily small, then the deep learning approximation to the whole function will be arbitrarily precise. So we just need to prove the case of one rectangle.

First, for an isolated rectangle, such as $R_{i}$ in Fig.5 (a), it can be separated via deep learning. For each side of $R_{i}$ , such as the bottom one, we can always find two lines (ReLUs) to divide some rectangle domains of $f(x_{1},x_{2})$ into two parts in two different regions, with one of the two lines parallel to the bottom side (such as line 1). $R_{i}$ is in the region where the outputs of the two ReLUs are both nonzero; all the rectangles below line 1 should be excluded by line 1 and line 2, and are in the other region (zero-output region). We see that the region between 1-+ and 2-0 or between 1-0 and 2-+ also gives nonzero output, which needs to be specially processed later different from the classification of discrete data points in Theorem 4.

After doing similar operations to the other sides, except for $R_{i}$ , the rectangle domains of $f(x_{1},x_{2})$ are all excluded. However, the intersection of nonzero-output regions of the four separations is not $R_{i}$ , but the region of the plane excluding four zero-output regions, which is a concave polygon (denoted by $P$ ) formed by eight lines such as in Fig.5 (a).

Note that only the separation of the bottom side of $R_{i}$ handled first is done in the input space of deep learning; the separations of three other sides should be done in the spaces of three succeeding layers, respectively, as shown in Fig.5 (c). However, the above operations are reasonable because of the properties of affine transforms. For example, if the second hidden layer of Fig.5 (c) corresponds to the separation of the left side of $R_{i}$ , as long as we can find two lines for this side in the input space such as in Fig.5 (a), the corresponding two lines in the space of the first hidden layer can also be found, with the parallel and collinear properties invariant. By the architecture of Fig.5 (c), the effects of four separations can be combined and finally only the data points in polygon $P$ can give nonzero outputs. The rest of the proof will not remind this related issue again.

In polygon $P$ , let the output of deep learning be the value of the approximated function $f(x_{1},x_{2})$ in rectangle $R_{i}$ . We now show that the limit of a sequence of $P$ can be $R_{i}$ by adjusting the parameters of eight lines. Denote an outer rectangle formed by four of the eight lines parallel to the four respective sides of $R_{i}$ (such as line 1) by $R_{o}$ . For the separation of the bottom side of $R_{i}$ , when line 2 rotates clockwise towards line 1 parallel to the bottom side, the limit of line 2 is line 1; during the rotating process, the classification effect of separating $R_{i}$ remains unchanged, while the region between 1-+ and 2-0 or between 1-0 and 2-+ becomes smaller and smaller. If we do the similar rotating operations to the cases of three other sides, the concave polygon $P$ can approximate $R_{o}$ by any desired precision.

When the outer rectangle $R_{o}$ shrinks to $R_{i}$ , the polygon $P$ constructed by deep learning can also approximate $R_{i}$ with arbitrary precision; therefore, deep learning can approximate $f(x_{1},x_{2})$ in $R_{i}$ as precisely as possible.

Now discuss the case of adjacent rectangles. We call two rectangles adjacent when their two respective sides are on a same line. In Fig.5 (b), suppose that $f(x_{1},x_{2})$ has different constant values in the big rectangle $R_{b}$ and small rectangle $R_{s}$ . As can be seen, the bottom side of $R_{b}$ shares a same line with the top side of $R_{s}$ , so that they are adjacent. $R_{b}$ is to be separated and may have more than one adjacent rectangles; however, we only illustrate one of them, which is enough for the description of the proof.

As the case of an isolated rectangle, a concave polygon $P_{b}$ encompassing $R_{b}$ can be constructed by deep learning. In polygon $P_{b}$ , the output of deep learning is normalized to the value of function $f(x_{1},x_{2})$ in $R_{b}$ . As shown in Fig.5 (b), part of $R_{s}$ is separated into $P_{b}$ , where the output of deep learning is not equal to the actual function value in $R_{s}$ . This type of approximation error occurs in all the adjacent rectangles separated into polygon $P_{b}$ , where the function value is different from that of $R_{b}$ . So the region of $P_{b}$ outside $R_{b}$ (denoted by $B$ ) is the source of approximation error of deep learning. Define the approximation error in $B$ as

[TABLE]

where $\hat{f}(x_{1},x_{2})$ is the approximating function of deep learning and $B^{\prime}$ is a subset of region $B$ on which $f(x_{1},x_{2})$ is defined.

Let

[TABLE]

where $S$ is the domain of $f(x_{1},x_{2})$ and $\omega$ is the maximum variation of $f(x_{1},x_{2})$ , which always exists because $f(x_{1},x_{2})$ only has finite number of function values. Since the value of $\hat{f}(x_{1},x_{2})$ is also derived from $f(x_{1},x_{2})$ , it’s obvious that

[TABLE]

where $S_{B}$ is the area of region $B$ . Because the area of $P_{b}$ can be arbitrarily close to that of $R_{b}$ , $S_{B}$ tends to be zero as $P_{b}\to R_{b}$ ; thus, $E$ can be as small as possible.

Fig.5 (c) is the structure of deep learning constructed for Fig.5 (a) or Fig.5 (b). The first hidden layer is corresponding to the region dividing by line 1 and line 2 with respect to the bottom side of a rectangle; and the succeeding three layers are the cases of three other sides. The four times of region dividing must be done in different layers successively to ensure that their effects can be combined. The final output should be normalized to the function value.

It is noted that four hidden layers for the 2-dimensional case are enough, since a rectangle has only four sides, each of which needs one hidden layer.

The whole structure of deep learning approximating $f(x_{1},x_{2})$ can be obtained by combining the subnetworks of all rectangle domains just like Fig.4, each module of a dotted rectangle representing a certain rectangle domain of $f(x_{1},x_{2})$ . This completes the proof of the two-dimensional case.

Similarly, the $n$ -dimensional case can be proved. To approximate a single hyperrectangle, use $2n$ hidden layers (each for one of the $2n$ hyperrectangle sides) instead of four as in Fig.5 (c), with each layer having $n$ ReLUs. The rotating operations changing the parameters of hyperplanes can refer to the proof of Lemma 2. For each side of a hyperrectangle, $n$ hyperplanes are constructed to separate the hyperrectangle by the method of Lemma 2, with hyperplane 1 parallel to the side. Hyperplane 2 is second added and other $n-2$ hyperplanes are chosen between hyperplane 1 and hyperplane 2. So we just need to rotate hyperplane 2 as in the two-dimensional case, and then to insert other new $n-2$ hyperplanes between hyperplane 1 and the rotated hyperplane 2. The rest of the proof is trivial according to the two-dimensional case. ∎

Theorem 5.

Deep learning with $2n$ hidden layers can approximate any continuous function on a closed set of $n$ -dimensional space with arbitrary precision.

Proof.

We know that Haar wavelets are capable of approximating continuous functions, while deep learning can approximate Haar wavelets as demonstrated in Lemma 4. This completes the proof. ∎

Remark 1.

From the perspective of Haar wavelets, the $2n$ hidden layers of deep learning are used to approximate hyperrectangle domains of Haar wavelet functions by the principle of Theorem 3; and the number of neural units in each layer corresponds to that of the hyperrectangle domains. In more detail, if there exists $m$ hyperrectangle domains, each hidden layer should use $m\times n$ neural units.

Remark 2.

The approximating accuracy of deep learning by Theorem 5 is determined by that of Haar wavelets. If we want the approximation error to be smaller, just make the Haar wavelet approximation to a function more precise, and then use deep learning to approximate the finer wavelets. The precise the Haar wavelets, the more ReLUs we need; however, the number of hidden layers is always $2n$ .

Remark 3.

Lippmann [24] ever gave a little similar proof about the classification ability of 3-layer networks composed of threshold logic units (TLUs). Although he didn’t mention the function approximation problem, his region dividing by neural networks can accurately represent a Haar wavelet function. However, he only discussed the case of 3-layer networks with TLUs.

6 Generalizations to sigmoid-unit deep learning

Deep learning with sigmoid neural units has been successfully used in speech analysis [1] and computer vision [2] . It will be shown later that some related conclusions of ReLU deep learning of this paper can be generalized to the sigmoid-unit case.

Corollary 1.

All the conclusions of this paper about ReLU deep learning still hold in the case of a modified ReLU, which is

[TABLE]

where $k$ and $b$ are real with $k>0$ .

Proof.

(6) only changes the slope of the linear part and the position in $x$ axis of a ReLU; however, as long as a neural unit has zero and linear outputs separated by a threshold, all the proofs related to the ReLU are applicable to the modified case of (6). ∎

Corollary 2.

In sigmoid-unit deep learning, a certain region of input space can be approximately transmitted to hidden layers by any desired precision in the sense of affine transforms.

Proof.

The derivative of sigmoid function $S(x)$ is $S^{\prime}(x)=S(x)(1-S(x))$ , tending to 1/4 when $x\to 0$ ; that is to say, $S(x)$ is approximately a line of $y=x/4+1/2$ as precisely as possible when $x$ is close enough to zero. Thus, a certain segment of the sigmoid function can be approximately considered as a line. According to Corollary 1 and Theorem 2, this corollary holds. ∎

Remark.

In the classic paper [25] of artificial neural networks, Hopfield also referred to the “linear central region” of $S(x)$ at $x=0$ and used this approximately linear property to transmit information between nonlinear neurons. The thought is similar; however, the details are different from the background of applications.

Corollary 3.

Sigmoid-unit deep learning can exclude a certain region of the input space or a hidden layer space with arbitrary precision, so that region dividing in some other regions can not influence it.

Proof.

The sigmoid function $S(x)$ tends to zero as $x\to-\infty$ , approximately corresponding to the zero-output part of a ReLU. Selecting probable parameters of sigmoid units can exclude a certain region as the case of ReLUs with arbitrary precision. ∎

Remark.

The above three corollaries suggest that sigmoid-unit deep learning can realize the function of ReLU deep leaning to some extent.

7 Summary

The region-dividing property of deep learning in Theorem 3 is general. On the basis of this property, we established the relationships between deep learning and forests, as well as between deep learning and Haar wavelets, by which the multi-category classification and function approximation abilities of deep learning were discussed.

All topics mentioned above are related to the “black-box” problem of deep learning, which is important both in theory and engineering. We hope that this paper will be helpful to this theme.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29 (2012).
2(2) Y. Le Cun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proceedings of the IEEE 86.11, 2278-2324 (1998).
3(3) D.Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Driessche, T. Graepel, D. Hassabis, Mastering the game of go without human knowledge. Nature 550.7676, 354 (2017).
4(4) K. Fukushima, S. Miyake, Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern recognition 15.6, 455-469 (1982).
5(5) O. Delalleau, Y. Bengio, Shallow vs. deep sum-product networks. Advances in Neural Information Processing Systems. 666-674 (2011).
6(6) R. Eldan, O. Shamir, The power of depth for feedforward neural networks. Conference on learning theory, 907-940 (2016).
7(7) B. Mc Cane, L. Szymanskic, Deep networks are efficient for circular manifolds. 23rd International Conference on Pattern Recognition (ICPR). IEEE, 3464-3469 (2016).
8(8) D. Rolnick, M. Tegmark, The power of deeper networks for expressing natural functions. ar Xiv preprint ar Xiv:1705.05502 v 2 (2018).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Interpretations of Deep Learning by Forests and Haar Wavelets

Abstract

Keywords:

1 Introduction

2 Preliminaries

2.1 Mechanisms of 3-layer networks

Note 1**.**

Note 2**.**

Note 3**.**

Note 4**.**

Notation**.**

Theorem 1**.**

2.2 Transmitting of input-space regions through layers

Lemma 1**.**

Proof.

Remark**.**

Theorem 2**.**

Proof.

3 Basic properties of region dividing via deep learning

3.1 The two-dimensional case

3.2 The nnn-dimensional case

Lemma 2**.**

Proof.

Remark**.**

Notation**.**

Theorem 3**.**

Proof.

Remark 1**.**

Remark 2**.**

4 Multi-category classification of deep learning

Lemma 3**.**

Proof.

Theorem 4**.**

Proof.

Remark 1**.**

Remark 2**.**

5 Function approximations of deep learning

Lemma 4**.**

Proof.

Theorem 5**.**

Proof.

Remark 1**.**

Remark 2**.**

Remark 3**.**

6 Generalizations to sigmoid-unit deep learning

Corollary 1**.**

Proof.

Corollary 2**.**

Proof.

Remark**.**

Corollary 3**.**

Proof.

Remark**.**

7 Summary

Note 1.

Note 2.

Note 3.

Note 4.

Notation.

Theorem 1.

Lemma 1.

Remark.

Theorem 2.

3.2 The $n$ -dimensional case

Lemma 2.

Remark.

Notation.

Theorem 3.

Remark 1.

Remark 2.

Lemma 3.

Theorem 4.

Remark 1.

Remark 2.

Lemma 4.

Theorem 5.

Remark 1.

Remark 2.

Remark 3.

Corollary 1.

Corollary 2.

Remark.

Corollary 3.

Remark.