Deep Representation with ReLU Neural Networks
Andreas Heinecke, Wen-Liang Hwang

TL;DR
This paper analyzes deep ReLU neural networks from a signal processing perspective, describing their affine linear regions and atomic decompositions to better understand their representations and stability.
Contribution
It provides a detailed description of the affine linear regions in ReLU networks and proposes conditions for stabilizing learning independent of network depth.
Findings
Characterization of affine linear regions in ReLU networks
Atomic decomposition of neural representations
Conditions for learning stability
Abstract
We consider deep feedforward neural networks with rectified linear units from a signal processing perspective. In this view, such representations mark the transition from using a single (data-driven) linear representation to utilizing a large collection of affine linear representations tailored to particular regions of the signal space. This paper provides a precise description of the individual affine linear representations and corresponding domain regions that the (data-driven) neural network associates to each signal of the input space. In particular, we describe atomic decompositions of the representations and, based on estimating their Lipschitz regularity, suggest some conditions that can stabilize learning independent of the network depth. Such an analysis may promote further theoretical insight from both the signal processing and machine learning communities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Sparse and Compressive Sensing Techniques · Geophysical and Geoelectrical Methods
Deep Representation with ReLU Neural Networks
Andreas Heinecke
Yale-NUS College, Singapore 138527, Singapore
Wen-Liang Hwang Corresponding author
Email addresses: [email protected] (Wen-Liang Hwang), [email protected] (Andreas Heinecke) Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan
Abstract
We consider deep feedforward neural networks with rectified linear units from a signal processing perspective. In this view, such representations mark the transition from using a single (data-driven) linear representation to utilizing a large collection of affine linear representations tailored to particular regions of the signal space. This paper provides a precise description of the individual affine linear representations and corresponding domain regions that the (data-driven) neural network associates to each signal of the input space. In particular, we describe atomic decompositions of the representations and, based on estimating their Lipschitz regularity, suggest some conditions that can stabilize learning independent of the network depth. Such an analysis may promote further theoretical insight from both the signal processing and machine learning communities.
1 Introduction
After having brought about impressive and revolutionary results in machine learning tasks from computer vision, speech recognition or machine translation, deep neural networks (DNNs) have also been entering into the realms of signal processing. Deep feedforward neural networks can be viewed as a cascading of affine linear transforms and nonlinear activation functions, producing representations of given data. In this view, best visualized via a graph representing the network, the DNN iteratively computes each layer by transforming the output of the previous layer with an affine linear operator and a componentwise acting non-linear activation. From another angle, incepted by the universality theory of shallow neural networks, starting with [1, 2], and of deep neural networks, see e.g. [3], DNN’s with piecewise linear activation functions can be viewed as piecewise affine linear functions, affine linear on polytopes that partition the input space [4], that can approximate any function in () arbitrarily well. However, the expression power of a DNN cannot be fully leveraged in signal processing without explicit expressions of the affine linear operators, their domains, ranges, and composition from the weight and bias parameters of the network. This paper addresses the expression power of a DNN by providing an explicit formulation of each affine linear mapping and their domains for the case of rectifier activations. In Section 2, we discuss how DNNs with piecewise linear activations may be considered a most significant modern advancement to the long history of signal processing via linear transforms, marking the transition from universal and data-driven linear transforms to data driven piecewise linear transforms. In Section 3, we provide a detailed analysis of those piecewise linear transforms for deep feedforward rectifier neural networks.
The main contributions of this paper are a configuration expression, that specifies explicitly the hyperplane constraints that bound the domains of each affine linear map and how those refine the input space by increasing the number of layers, as well as an atomic decomposition (Theorem 5) for the respective affine maps. This characterization of the affine linear pieces unravels precisely how, depending on the region of the input space, the in- and output layers of the network determine the atoms of the representation, and how those atoms are linearly combined over many possible paths through the hidden weights of the network. The precise domain specification and atomic decomposition may facilitate new analytic insight to architectural questions, but also to optimization procedures and empirically successful methods, such as BatchNorm [5], dropout [6] or residual learning [7]. As an indication we give an estimate of the Lipschitz regularity of the atomic decomposition. While being important as a characteristic of the representation itself, we also relate it to the smoothness of the gradient of the networks loss function that governs gradient based training algorithms for DNNs.
2 From orthonormal bases to data-driven representations and deep neural networks
Many problems of science and engineering can be described by the model with input data/signals , output data/signals and a linear or non-linear operator modelling some process. Among its many instances, it may for example describe an ill-posed inverse problem where one wishes to reconstruct a certain well structured from an observed ; or a transform, where one wishes to derive some “good” representation of the data . The measurement process or transform often contains a linear/non-linear component subject to constraints stemming from, say, physics or engineering. A classic instance is the phaseless reconstruction problem, in which one observes only the modulus of linear Fourier coefficients, thus being an inverse problem consisting of an analysis with a non-linear measurement process. Another instance is synthesis of linear measurements with prior information, e.g., sparsity of wavelet frame coefficients in imaging. One may also wish to design such that the reconstruction becomes possible, stable and/or fast. For example, in compressed sensing [8, 9] one is interested in designing sensing matrices that allow the recovery of sparse vectors from significantly fewer linear measurements than the signal dimension.
Orthonormal bases and frames: For centuries, conventional wisdom suggested that, whenever possible, one should use an orthonormal basis to represent signals. Different orthonormal bases may allow for sparse representations of certain classes of data. The most classic example is the Fourier basis, given by the columns of the matrix
[TABLE]
with the help of which many oscillatory signals become sparsely represented, allowing insight into many phenomena of physics and chemistry. As the Fourier basis is orthonormal, the coefficients of the representation , simply given via conjugate transpose, are .
In many situations orthonormal bases are far from the ideal choice for a representation and it can have great advantages to give up the linear independence imposed on the elements of orthonormal bases. Frames [10] are advancements of orthonormal bases, derived by relaxing Parseval’s identity to a pair of inequalities: A matrix is the synthesis matrix of a frame, if there are constants such that
[TABLE]
Frames are thus precisely those systems for which signals can be stably reconstructed from linear measurements. For any frame there are, in general many, dual frames , which provide perfect reconstruction of the signal from the linear measurements in the sense that for all . Dual frames can be derived via different incarnations of a duality principle that hinges on exploiting the adjoint nature of the involved operators. In case of tight frames, i.e., if (1) holds with equality, it is possible to choose , but different dual frames can be chosen to optimally adapt to practical considerations such as, say, minimization of quantization errors.
A major advantage of frames is that signals from large classes of data may have common structural features that often translate to the fact that choosing an appropriate frame can force a dimensionality reduction in the sense that the data is sparsely representable via the frame. In audio processing, time-varying frequencies are captured sparsely via Gabor tight frames [11], comprised of translations and modulations of a window function. In image processing, wavelet frames [12, 13] of shifts and dilations of fast oscillating zero-mean functions can be used to compress and process piecewise smooth images using very few significant coefficients. In both examples, orthonormality is usually given up to gain desired properties, e.g., joint time-frequency localization of the generator in case of Gabor frames, or joint smoothness, symmetry and compact support in the case of the generators of wavelet frames.
Sparse representation and dictionary learning: Frames that enable sparse representations of signals yield great advantages, for instance in the interpretation and estimation of the main subcomponents in signals. While particular frames are predestined for certain signal classes, there remain classes of signals that cannot be sparsely represented with off-the-shelf frames, say, comprised of dilations/modulations and translations of a generator. The sparse representation problem focuses on the synthesis of signals from the span of some overcomplete dictionary , derived from signal domain knowledge, via the sparsest coefficient vector , [14]. Formulated as an optimization problem, the task is to
[TABLE]
where returns the number of nonzero entries of . To overcome its NP-hardness, this problem is usually relaxed to a convex optimization problem using the -norm [15]. Based on this approach many algorithms have been proposed to iteratively approximate solutions of (2), for an overview see [16].
The migration of the sparse representation problem (2) to the era of data-driven methods may be marked with the introduction of K-SVD, [17, 18], where a dictionary and sparse coefficients are being simultaneously learned for a set of observations.
Transition to deep neural networks: There are many ideas and applications in which neural networks have entered into different aspects of signal processing, see, e.g., [19] for an overview of applications to inverse problems in imaging. One example is the question whether approximate solutions to the sparse representation problem (2) can be derived without using computationally expensive iterative algorithms. To this end, [20] treats the inverse problem (2) as a regression problem based on a deep neural network that is trained on supervised examples of observations and their sparse representations. After training the network, estimates of sparse representations are calculated by a forward pass of new observations through the network. To give a second example, the DNN method has also been used in compressed sensing. In [21] a -sparse solution is estimated from noisy measurements obtained via a Gaussian sensing matrix by solving the problem where is a trained DNN.
Deep representations: Approaches as described in the previous paragraph suggest that signal representations should further leverage the data-driven approach in order to obtain representations with better estimation and interpretation properties. Deep neural networks may be considered a next step in the historical development of signal representation described above, in the sense that data is no longer represented via a single linear representation, like an orthonormal basis, a frame or data driven dictionary, but via an entire collection of affine linear representations. In the case of piecewise linear activations each individual representation of the collection is used for one particular region of a polytope partition of the signal space. In the remainder of this paper we study deep feedforward rectifier neural networks from this angle. Specifically, the architecture we consider is as follows. For a number of layers of widths , a collection of affine linear operators and componentwise acting nonlinear activation functions we consider the map defined by
[TABLE]
to which we will refer to as -layer deep representation. The affine linear map at the -th layer is given by , with linear part given by a weight matrix , representing edge weights in the graph interpretation of as a feedforward neural network, and affine shift , called bias, representing the offsets of the neurons. We refer to as the input space.
Notation:
We denote matrices bold upper case, vectors bold lower case and scalars in normal font. Moreover, we denote by , or , the -th entry of the vector , by the -th column and by the -th row of . The rank one matrix given by the outer product of the column vector and the row vector is denoted by . We use to denote the cardinality of a set, to denote the support of a vector , to denote the identity matrix and to denote the pointwise semi-order on . Finally, for subsets of we shorten notation by denoting a set of the form simply by . Throughout, will be short for .
3 Data-driven expression for ReLU representations
One of the most effective and widely used non-linear activations is the pointwise acting rectifier for , [22, 23]. We will refer to as a rectified linear unit (ReLU) representation if all its activations are set to be this rectifier. To begin, consider a -layer representation
[TABLE]
with and denote by
[TABLE]
the output and input of the -th rectifier, . Representing the input in terms of the output we have
[TABLE]
The non-linearity (4) can be replaced by using a data-dependent diagonal matrix whose -th diagonal entry is defined as
[TABLE]
The first formulation in (6) captures how functions as processing a rectifier backward from , killing the set-valued entries of in (5) and preserving the other entries. The second formulation states how functions as processing a rectifier forward from its input , letting the positive entries of pass, while setting to zeros the negative entries. There is an ambiguity in how to set the diagonal entry if and in our definition in this case the diagonal entry is set to zero. Note that for a -entry diagonal matrix , (6) is equivalent to imposing the conditions
[TABLE]
While (7) excludes the case that and ; (8) excludes the case that and . Hence, there does not exist a , such that for any of its components . Meanwhile, and if and only if ; as well as and if and only if . We impose (9), which thus happens if and only if and . Hereafter, we keep in mind that the diagonal entry corresponding to is set to zero and neglect (9) to simplify the notation.111This choice will be rendered irrelevant since it concerns the hyperplane boundary between two regions on which the representation acts affine linear. By continuity of the representation the respective affine linear pieces coincide on those boundaries.
Working backwards through the non-linearities of the representation, i.e., starting with and using , we have , where . Thus, successively expressing the non-linear relation between out- and inputs of the rectifiers using data-dependent -entry diagonal matrices, the representation (3) becomes
[TABLE]
or equivalently
[TABLE]
The general -layer ReLU representation can be expressed as a collection of data-driven affine linear representations:
[TABLE]
We stress that the diagonal matrices are not pre-determined; they are functions of , i.e., depending on the data . The non-linear operator is thus expressed as a set of affine linear operators, each of which is determined by the diagonal matrices , or equivalently by the sign patterns of the input vectors .
Configuration expression:
The above description motivates the following terminology and definitions. In slight abuse of notation, we call any vector derived from the concatenation of certain , , a (diagonal) configuration of the ReLU representation , if the polytope
[TABLE]
is non-empty. For a given configuration of , we define the affine linear map
[TABLE]
with domain . We will also say that induces a configuration, if is non-empty. Then on the restriction to the ReLU representation and the affine linear operator coincide.
Example 1**.**
*Let be a configuration of a -layer ReLU representation . Then the affine linear operator coincides with on the convex polytope *
[TABLE]
i.e., on the set of all that satisfy
[TABLE]
If and if is a configuration, then defines
[TABLE]
*on the polytope *
[TABLE]
In the remainder of this section we recall how the configurations of a ReLU representation partition the input space in increasingly finer polytopes, before describing in detail the affine linear maps.
3.1 Input space partition
Given a ReLU representation , denote by , for , the set of all configurations of . Then every configuration in is derived from a configuration in via concatenation with a vector from . Note however that not all possible vectors are part of a configuration . Lower estimates of the size of are given in [4]. Whether or not a certain is a configuration depends on . If, say, for some ReLU representation, then must be the zero vector for all and hence for such a deep representation there is only one configuration possible. We consider a slightly less trivial example in more detail.
Example 2**.**
*Consider a ReLU representation on where is surjective and . Then *
[TABLE]
and the input space is first partitioned by the configurations from into the polygons
[TABLE]
These polygons are further partitioned by the second layer. Since , the only diagonal configuration that can be achieved via a concatenation from to is and thus is not further partitioned. The partitions of , for , are derived depending on the affine transforms . The polygon is partitioned into the union of the two polygons corresponding to and corresponding to , unless one of those sets is empty, in which case the corresponding vector is not a configuration. Altogether, is partitioned into potentially up to convex regions, as illustrated in Figure 1, corresponding to the configurations
[TABLE]
but, depending on the actual parameters, a smaller is possible. Each configuration is associated to an affine linear map via (10), to which is equal to when restricted to the corresponding polytope. The non-linear operator is piecewise affine linear, comprised of the (up to) affine linear maps , . Note that if the bias vectors and are zero, then and , and thus the regions are convex cones arising from halfspace intersections through the origin.
We record in the following result how the consecutive layers of an -layer ReLU representation define increasingly finer partitions of the input space.222To be precise (see comment on (9)), here partition has to be understood in the sense that the interiors of the participating sets have empty intersection. Restricted to each polytope of the final partition, is equal to an affine linear operator specified by the diagonal configuration corresponding to that region.
Proposition 3**.**
Let be a ReLU representation, and the set of configurations of .
- (i)
If , then on the representation coincides with
[TABLE]
- (ii)
Define . Then is a partition of and is a refinement of .
Proof.
The first claim follows by induction on the layers from the construction; and so does the second. Indeed, is a partition of . Suppose partitions , let and denote the domain of the affine map . The entirety of the regions of the configurations induced by partition . Since is defined as the union of the configurations induced by all with , the collection is a refinement of . ∎
In our terminology Proposition 3 reads as the following qualitative result, well know in the literature, e.g. [4], and further illustrated in Figures 2.
Corollary 4**.**
*(i) Every ReLU representation is a piecewise affine linear operator with respect to a partition of the input space into convex polytopes (on each of which is affine linear). The number of polytopes is equal to the number of diagonal configurations of .
(ii) If the biases of all layers of vanish, then is piecewise linear with respect to a partion of into convex cones.*
3.2 Affine linear maps
We now give a precise characterization in terms of an atomic decomposition for the affine transform induced by a configuration. Here we refer to a rank one matrix (the outer product of two vectors) as an atom. We show that the atoms that linearly combine the linear part of the affine transform induced by the configuration are exclusively determined by the Kronecker product of and . Thus, increasing the number of rows of and the number of columns of (i.e., the widths of layers and ) increases the number of atoms in expressing all affine transform pieces of . The coefficients in the linear combination of the atoms are sums of weight products over paths between those layers. Each path is obtained by taking one entry from one nonvanishing column in for . Increasing the widths and the number of intermediate layers, in different ways, increases the number of paths contributing to a coefficient.
Theorem 5**.**
Let be a configuration of an -layer ReLU representation . Then the linear part of the affine linear transform induced by is a linear combination of atoms of the form . Specifically:
- (i)
For the linear part of is the sum of atoms.
- (ii)
For the linear part of is a linear combination of atoms and is the coefficient for atom .
- (iii)
For the linear part of the linear combination
[TABLE]
of at most atoms, with coefficients
[TABLE]
each of which is the sum of products consisting of at most one weight from each layer along possible paths.
Proof.
(i) The linear part of is the sum of atoms, namely
[TABLE]
(ii) The -th row of is therefore and thus the linear part of is
[TABLE]
(iii) The -th row of is and thus the linear part of is
[TABLE]
Successively continuing, the linear part of is (11). ∎
On the polytope we therefore obtain the following expression for :
[TABLE]
where
[TABLE]
and where the coefficients associated with the column of are
[TABLE]
in the case of layers;
[TABLE]
in the case of layers; and
[TABLE]
in the general case of layers. In particular, the affine transform maps its domain into
[TABLE]
As an immediate application we estimate a Lipschitz bound for , which can be interpreted as a measure for the gain of local input perturbations to that of the outputs of on . The bound depends on the number of activated rectifiers. Given the atomic representation, it may have benefits to normalize the columns of and the rows of , depending, e.g., on whether the model is used for signal analysis or synthesis. The following result can easily be modified for the case without this normalization assumption.
Theorem 6**.**
Let be a configuration of an -layer ReLU representation with , and let .
- (i)
Suppose that has normalized columns, that has normalized rows, and let be the maximum of the absolute value of all weights in . Then
[TABLE]
If then is a global Lipschitz bound for .
- (ii)
If is the maximum of the spectral norms of the weight matrices, then
[TABLE]
Proof.
Under the assumptions of (i), for we get
[TABLE]
For the global Lipschitz estimate note that, since , we have
[TABLE]
Part (ii) follows directly from (12). ∎
It is clear that the global Lipschitz bound derived for via the above crude estimate from its affine linear pieces is far from being optimal. As such, Theorem 6 can be regarded as a refinement of a similar global Lipschitz bound derived in [21]. The fact that increasing the number of layers of the representation refines the partitioning of the input space, implies that, in order to keep stability, the Lipschitz bound for should be a non-increasing function of ; otherwise a tiny part of the input space could cause instability of the representation. We are thus particularly interested in deriving a sufficient condition for the Lipschitz bound to not be an increasing function of . Achieving this for the bound in (i) requires , i.e., to achieve a stable representation regardless the number of layers requires the mean and variance of the weight coefficients to be very small at large . This might be related to the batch normalization technique in learning DNNs [5, 24]. On the other hand, the suffiecient condition of having the spectral norms of the weight matrices not exceed can be achieved via optimization techniques by imposing the Frobenius norms of the weight matrices to not exceed .
With regards to Theorem 6(i), we remark on two observations further suggesting that asymptotic stability of the Lipschitz bound for for large number of layers plays a role in the learning process of function approximation via deep feedforward neural networks. The back-propagation algorithm, designed to carry out the learning task, is based on (sub)gradient descent in the landscape of a loss function in the network parameter space. It is believed that in the course of training, both, the maximum magnitude component and the smoothness of the gradient of affect the learning performance.
We first consider the maximum magnitude component of the gradient. Let denote the restriction of the loss function to and for denote and . Then
[TABLE]
where , for , is a diagonal matrix with entries [math] or , corresponding to the value of the directional derivative of the rectifier function, which is if and [math] otherwise. Note that this value is at , since the subdifferential of the one-dimensional rectifier at is the interval and the directional derivative of a one-dimensional convex function is the maximum of the subdifferential. Using the estimate for on the weight matrices, the maximum magnitude entry of is bounded by
[TABLE]
If , then increases when decreases. This implies that the maximum magnitude entries of the gradient at early layers can potentially have larger variations, which would hamper the learning performance.
Our second consideration concerns the smoothness of the gradient of the loss function. Globally this gradient is notoriously nonsmooth and thus again we restrict to the individual polytope regions . Assume that there the gradient of the loss function is -smooth, i.e., suppose that for all and , where , the estimate
[TABLE]
holds. For any layer , (13) implies
[TABLE]
Similar to (3.2), here a condition like is needed to guarantee to avoid blowing up of the Lipschitz parameters at early layers during learning.
We hope that having precise expressions such as (11) for deep representations can contribute to paving a way to develop new and to better understand existing regularization techniques such as batch normalization, dropout [6, 25, 26] or deep residual learning [7].
We finally would like to make one more signal processing related remark. The smoothness of a loss function not only relates to the stability in learning a deep representation, it also relates to deriving local minimizers of over the input space using gradient descent. Following (15), we have
[TABLE]
Since
[TABLE]
this implies
[TABLE]
i.e., here again is sufficient to stabilize the smoothness of the gradient for large .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems , vol. 2, no. 4, pp. 303–314, 1989.
- 2[2] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks , vol. 4, no. 2, pp. 251–257, 1991.
- 3[3] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deep neural networks with rectified linear units,” in International Conference on Learning Representations (ICLR) , 2018.
- 4[4] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep neural networks,” in Advances in Neural Information Processing Systems 27 , pp. 2924–2932, 2014.
- 5[5] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning (ICML) , pp. 448–456, 2015.
- 6[6] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” Co RR , 2012.
- 7[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 770–778, 2016.
- 8[8] E. J. Candès, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics , vol. 59, no. 8, pp. 1207–1223, 2006.
