Deep Representation with ReLU Neural Networks

Andreas Heinecke; Wen-Liang Hwang

arXiv:1903.12384·cs.LG·April 1, 2019

Deep Representation with ReLU Neural Networks

Andreas Heinecke, Wen-Liang Hwang

PDF

Open Access

TL;DR

This paper analyzes deep ReLU neural networks from a signal processing perspective, describing their affine linear regions and atomic decompositions to better understand their representations and stability.

Contribution

It provides a detailed description of the affine linear regions in ReLU networks and proposes conditions for stabilizing learning independent of network depth.

Findings

01

Characterization of affine linear regions in ReLU networks

02

Atomic decomposition of neural representations

03

Conditions for learning stability

Abstract

We consider deep feedforward neural networks with rectified linear units from a signal processing perspective. In this view, such representations mark the transition from using a single (data-driven) linear representation to utilizing a large collection of affine linear representations tailored to particular regions of the signal space. This paper provides a precise description of the individual affine linear representations and corresponding domain regions that the (data-driven) neural network associates to each signal of the input space. In particular, we describe atomic decompositions of the representations and, based on estimating their Lipschitz regularity, suggest some conditions that can stabilize learning independent of the network depth. Such an analysis may promote further theoretical insight from both the signal processing and machine learning communities.

Equations100

F = N^{- 1/2} (exp (- 2 π ij k / N))_{j, k = 0, \dots, N - 1},

F = N^{- 1/2} (exp (- 2 π ij k / N))_{j, k = 0, \dots, N - 1},

A ∥ y ∥_{2} \leq ∥ F^{*} y ∥_{2} \leq B ∥ y ∥_{2} for all y \in R^{m} .

A ∥ y ∥_{2} \leq ∥ F^{*} y ∥_{2} \leq B ∥ y ∥_{2} for all y \in R^{m} .

minimize ∥ x ∥_{0} subject to y = Dx,

minimize ∥ x ∥_{0} subject to y = Dx,

M_{L} (x) = M_{L} \circ ρ_{L - 1} \circ M_{L - 1} \circ \dots \circ ρ_{1} \circ M_{1} (x),

M_{L} (x) = M_{L} \circ ρ_{L - 1} \circ M_{L - 1} \circ \dots \circ ρ_{1} \circ M_{1} (x),

M_{3} = M_{3} ρ_{2} M_{2} ρ_{1} M_{1}

M_{3} = M_{3} ρ_{2} M_{2} ρ_{1} M_{1}

a_{k} = ρ_{k} y_{k}

a_{k} = ρ_{k} y_{k}

y_{k, i} = {a_{k, i} (- \infty, 0] if a_{k, i} > 0 if a_{k, i} = 0.

y_{k, i} = {a_{k, i} (- \infty, 0] if a_{k, i} > 0 if a_{k, i} = 0.

d_{k, i} = {10 if a_{k, i} > 0 if a_{k, i} = 0 = {10 if y_{k, i} > 0 else.

d_{k, i} = {10 if a_{k, i} > 0 if a_{k, i} = 0 = {10 if y_{k, i} > 0 else.

0 \leq D_{k} y_{k},

0 \leq D_{k} y_{k},

(I - D_{k}) y_{k} \leq 0 and

d_{k, i} = 0 if y_{k, i} = 0 .

⎩ ⎨ ⎧ y = M_{3} D_{2} M_{2} D_{1} M_{1} x with d_{1, i} = {10 if y_{1, i} = (M_{1} x)_{i} > 0 else and d_{2, i} = {10 if y_{2, i} = (M_{2} D_{1} M_{1} x)_{i} > 0 else.

⎩ ⎨ ⎧ y = M_{3} D_{2} M_{2} D_{1} M_{1} x with d_{1, i} = {10 if y_{1, i} = (M_{1} x)_{i} > 0 else and d_{2, i} = {10 if y_{2, i} = (M_{2} D_{1} M_{1} x)_{i} > 0 else.

⎩ ⎨ ⎧ y = M_{3} D_{2} M_{2} D_{1} M_{1} x subject to 0 \leq D_{k} y_{k}, (I - D_{k}) y_{k} \leq 0; for y_{k} = M_{k} D_{k - 1} M_{k - 1} \dots M_{1} x and k = 1, 2.

⎩ ⎨ ⎧ y = M_{3} D_{2} M_{2} D_{1} M_{1} x subject to 0 \leq D_{k} y_{k}, (I - D_{k}) y_{k} \leq 0; for y_{k} = M_{k} D_{k - 1} M_{k - 1} \dots M_{1} x and k = 1, 2.

⎩ ⎨ ⎧ y = M_{L} D_{L - 1} \dots M_{2} D_{1} M_{1} x subject to 0 \leq D_{k} y_{k}, (I - D_{k}) y_{k} \leq 0; for y_{k} = M_{k} D_{k - 1} M_{k - 1} \dots M_{1} x and k = 1, \dots, L - 1.

⎩ ⎨ ⎧ y = M_{L} D_{L - 1} \dots M_{2} D_{1} M_{1} x subject to 0 \leq D_{k} y_{k}, (I - D_{k}) y_{k} \leq 0; for y_{k} = M_{k} D_{k - 1} M_{k - 1} \dots M_{1} x and k = 1, \dots, L - 1.

R^{θ} := k = 1 ⋂ L - 1 {x \in X :

R^{θ} := k = 1 ⋂ L - 1 {x \in X :

M_{L}^{θ} := M_{L} diag (θ_{L - 1}) \dots M_{2} diag (θ_{1}) M_{1}

M_{L}^{θ} := M_{L} diag (θ_{L - 1}) \dots M_{2} diag (θ_{1}) M_{1}

R^{θ} = {diag (θ_{1}) M_{1} x \geq 0} \cap {(I - diag (θ_{1})) M_{1} x \leq 0} \cap {diag (θ_{2}) M_{2}^{θ_{1}} x \geq 0} \cap {(I - diag (θ_{2})) M_{2}^{θ_{1}} x \leq 0},

R^{θ} = {diag (θ_{1}) M_{1} x \geq 0} \cap {(I - diag (θ_{1})) M_{1} x \leq 0} \cap {diag (θ_{2}) M_{2}^{θ_{1}} x \geq 0} \cap {(I - diag (θ_{2})) M_{2}^{θ_{1}} x \leq 0},

(M_{1} x)_{i}

(M_{1} x)_{i}

M_{3}^{θ} = M_{3} diag (1, 0) M_{2} diag (0, 1) M_{1}

M_{3}^{θ} = M_{3} diag (1, 0) M_{2} diag (0, 1) M_{1}

{(M_{1} x

{(M_{1} x

Θ_{1} = {θ_{1}^{0} = (00), θ_{1}^{1} = (10), θ_{1}^{2} = (01), θ_{1}^{3} = (11)},

Θ_{1} = {θ_{1}^{0} = (00), θ_{1}^{1} = (10), θ_{1}^{2} = (01), θ_{1}^{3} = (11)},

R^{θ_{1}^{0}}

R^{θ_{1}^{0}}

R^{θ_{1}^{2}}

Θ_{2} = ⎩ ⎨ ⎧ 000, 010, 011, 100, 101, 110, 111 ⎭ ⎬ ⎫,

Θ_{2} = ⎩ ⎨ ⎧ 000, 010, 011, 100, 101, 110, 111 ⎭ ⎬ ⎫,

M_{k + 1}^{θ} = M_{k + 1} diag (θ_{k}) \dots M_{2} diag (θ_{1}) M_{1} .

M_{k + 1}^{θ} = M_{k + 1} diag (θ_{k}) \dots M_{2} diag (θ_{1}) M_{1} .

i_{L - 1} \in spt θ_{L - 1}, i_{1} \in spt θ_{1} \sum c_{i_{L - 1}, i_{1}} w_{L, : i_{L - 1}} \otimes w_{1, i_{1} :},

i_{L - 1} \in spt θ_{L - 1}, i_{1} \in spt θ_{1} \sum c_{i_{L - 1}, i_{1}} w_{L, : i_{L - 1}} \otimes w_{1, i_{1} :},

c_{i_{L - 1}, i_{1}} = i_{L - 2} \in spt θ_{L - 2}, \dots, i_{2} \in spt θ_{2} \sum w_{L - 1, i_{L - 1} i_{L - 2}} \dots w_{2, i_{2} i_{1}},

c_{i_{L - 1}, i_{1}} = i_{L - 2} \in spt θ_{L - 2}, \dots, i_{2} \in spt θ_{2} \sum w_{L - 1, i_{L - 1} i_{L - 2}} \dots w_{2, i_{2} i_{1}},

W_{2} diag (θ_{1}) W_{1}

W_{2} diag (θ_{1}) W_{1}

W_{3} diag (θ_{2}) W_{2} diag (θ_{1}) W_{1} = i_{2} \in spt θ_{2} \sum i_{1} \in spt θ_{1} \sum w_{2, i_{2} i_{1}} w_{3, : i_{2}} \otimes w_{1, i_{1} :} .

W_{3} diag (θ_{2}) W_{2} diag (θ_{1}) W_{1} = i_{2} \in spt θ_{2} \sum i_{1} \in spt θ_{1} \sum w_{2, i_{2} i_{1}} w_{3, : i_{2}} \otimes w_{1, i_{1} :} .

W_{4} diag (θ_{3}) W_{3} diag (θ_{2}) W_{2} diag (θ_{1}) W_{1} = i_{3} \in spt θ_{3}, i_{1} \in spt θ_{1} \sum i_{2} \in spt θ_{2} \sum w_{3, i_{3} i_{2}} w_{2, i_{2} i_{1}} w_{4, : i_{3}} \otimes w_{1, i_{1} :} .

W_{4} diag (θ_{3}) W_{3} diag (θ_{2}) W_{2} diag (θ_{1}) W_{1} = i_{3} \in spt θ_{3}, i_{1} \in spt θ_{1} \sum i_{2} \in spt θ_{2} \sum w_{3, i_{3} i_{2}} w_{2, i_{2} i_{1}} w_{4, : i_{3}} \otimes w_{1, i_{1} :} .

M_{L}^{θ} x

M_{L}^{θ} x

= i_{L - 1} \in spt θ_{L - 1} \sum α_{L, i_{L - 1}} (x) w_{L, : i_{L - 1}} + b,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Sparse and Compressive Sensing Techniques · Geophysical and Geoelectrical Methods

Full text

Deep Representation with ReLU Neural Networks

Andreas Heinecke

Yale-NUS College, Singapore 138527, Singapore

Wen-Liang Hwang Corresponding author

Email addresses: [email protected] (Wen-Liang Hwang), [email protected] (Andreas Heinecke) Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan

Abstract

We consider deep feedforward neural networks with rectified linear units from a signal processing perspective. In this view, such representations mark the transition from using a single (data-driven) linear representation to utilizing a large collection of affine linear representations tailored to particular regions of the signal space. This paper provides a precise description of the individual affine linear representations and corresponding domain regions that the (data-driven) neural network associates to each signal of the input space. In particular, we describe atomic decompositions of the representations and, based on estimating their Lipschitz regularity, suggest some conditions that can stabilize learning independent of the network depth. Such an analysis may promote further theoretical insight from both the signal processing and machine learning communities.

1 Introduction

After having brought about impressive and revolutionary results in machine learning tasks from computer vision, speech recognition or machine translation, deep neural networks (DNNs) have also been entering into the realms of signal processing. Deep feedforward neural networks can be viewed as a cascading of affine linear transforms and nonlinear activation functions, producing representations of given data. In this view, best visualized via a graph representing the network, the DNN iteratively computes each layer by transforming the output of the previous layer with an affine linear operator and a componentwise acting non-linear activation. From another angle, incepted by the universality theory of shallow neural networks, starting with [1, 2], and of deep neural networks, see e.g. [3], DNN’s with piecewise linear activation functions can be viewed as piecewise affine linear functions, affine linear on polytopes that partition the input space [4], that can approximate any function in $L^{p}(\mathbb{R}^{n})$ ( $1\leq p\leq\infty$ ) arbitrarily well. However, the expression power of a DNN cannot be fully leveraged in signal processing without explicit expressions of the affine linear operators, their domains, ranges, and composition from the weight and bias parameters of the network. This paper addresses the expression power of a DNN by providing an explicit formulation of each affine linear mapping and their domains for the case of rectifier activations. In Section 2, we discuss how DNNs with piecewise linear activations may be considered a most significant modern advancement to the long history of signal processing via linear transforms, marking the transition from universal and data-driven linear transforms to data driven piecewise linear transforms. In Section 3, we provide a detailed analysis of those piecewise linear transforms for deep feedforward rectifier neural networks.

The main contributions of this paper are a configuration expression, that specifies explicitly the hyperplane constraints that bound the domains of each affine linear map and how those refine the input space by increasing the number of layers, as well as an atomic decomposition (Theorem 5) for the respective affine maps. This characterization of the affine linear pieces unravels precisely how, depending on the region of the input space, the in- and output layers of the network determine the atoms of the representation, and how those atoms are linearly combined over many possible paths through the hidden weights of the network. The precise domain specification and atomic decomposition may facilitate new analytic insight to architectural questions, but also to optimization procedures and empirically successful methods, such as BatchNorm [5], dropout [6] or residual learning [7]. As an indication we give an estimate of the Lipschitz regularity of the atomic decomposition. While being important as a characteristic of the representation itself, we also relate it to the smoothness of the gradient of the networks loss function that governs gradient based training algorithms for DNNs.

2 From orthonormal bases to data-driven representations and deep neural networks

Many problems of science and engineering can be described by the model $\mathbf{y}=\mathcal{M}(\mathbf{x}),$ with input data/signals $\mathbf{x}\in\mathbb{R}^{n}$ , output data/signals $\mathbf{y}\in\mathbb{R}^{m}$ and a linear or non-linear operator $\mathcal{M}$ modelling some process. Among its many instances, it may for example describe an ill-posed inverse problem where one wishes to reconstruct a certain well structured $\mathbf{x}$ from an observed $\mathbf{y}$ ; or a transform, where one wishes to derive some “good” representation $\mathbf{y}$ of the data $\mathbf{x}$ . The measurement process or transform $\mathcal{M}$ often contains a linear/non-linear component subject to constraints stemming from, say, physics or engineering. A classic instance is the phaseless reconstruction problem, in which one observes only the modulus of linear Fourier coefficients, thus being an inverse problem consisting of an analysis with a non-linear measurement process. Another instance is synthesis of linear measurements with prior information, e.g., sparsity of wavelet frame coefficients in imaging. One may also wish to design $\mathcal{M}$ such that the reconstruction $\mathbf{y}\mapsto\mathbf{x}$ becomes possible, stable and/or fast. For example, in compressed sensing [8, 9] one is interested in designing sensing matrices that allow the recovery of sparse vectors from significantly fewer linear measurements than the signal dimension.

Orthonormal bases and frames: For centuries, conventional wisdom suggested that, whenever possible, one should use an orthonormal basis to represent signals. Different orthonormal bases may allow for sparse representations of certain classes of data. The most classic example is the Fourier basis, given by the columns of the matrix

[TABLE]

with the help of which many oscillatory signals become sparsely represented, allowing insight into many phenomena of physics and chemistry. As the Fourier basis is orthonormal, the coefficients of the representation $\mathbf{y}=\mathbf{F}\mathbf{x}$ , simply given via conjugate transpose, are $\mathbf{x}=\mathbf{F}^{*}\mathbf{y}$ .

In many situations orthonormal bases are far from the ideal choice for a representation and it can have great advantages to give up the linear independence imposed on the elements of orthonormal bases. Frames [10] are advancements of orthonormal bases, derived by relaxing Parseval’s identity to a pair of inequalities: A matrix $\mathbf{F}$ is the synthesis matrix of a frame, if there are constants $0<A\leq B$ such that

[TABLE]

Frames are thus precisely those systems for which signals can be stably reconstructed from linear measurements. For any frame there are, in general many, dual frames $\mathbf{G}$ , which provide perfect reconstruction of the signal from the linear measurements in the sense that $\mathbf{y}=\mathbf{F}\mathbf{G}^{*}\mathbf{y}$ for all $\mathbf{y}$ . Dual frames can be derived via different incarnations of a duality principle that hinges on exploiting the adjoint nature of the involved operators. In case of tight frames, i.e., if (1) holds with equality, it is possible to choose $\mathbf{G}=\mathbf{F}$ , but different dual frames can be chosen to optimally adapt to practical considerations such as, say, minimization of quantization errors.

A major advantage of frames is that signals from large classes of data may have common structural features that often translate to the fact that choosing an appropriate frame can force a dimensionality reduction in the sense that the data is sparsely representable via the frame. In audio processing, time-varying frequencies are captured sparsely via Gabor tight frames [11], comprised of translations and modulations of a window function. In image processing, wavelet frames [12, 13] of shifts and dilations of fast oscillating zero-mean functions can be used to compress and process piecewise smooth images using very few significant coefficients. In both examples, orthonormality is usually given up to gain desired properties, e.g., joint time-frequency localization of the generator in case of Gabor frames, or joint smoothness, symmetry and compact support in the case of the generators of wavelet frames.

Sparse representation and dictionary learning: Frames that enable sparse representations of signals yield great advantages, for instance in the interpretation and estimation of the main subcomponents in signals. While particular frames are predestined for certain signal classes, there remain classes of signals that cannot be sparsely represented with off-the-shelf frames, say, comprised of dilations/modulations and translations of a generator. The sparse representation problem focuses on the synthesis of signals $\mathbf{y}$ from the span of some overcomplete dictionary $\mathbf{D}$ , derived from signal domain knowledge, via the sparsest coefficient vector $\mathbf{x}$ , [14]. Formulated as an optimization problem, the task is to

[TABLE]

where $\|\mathbf{x}\|_{0}$ returns the number of nonzero entries of $\mathbf{x}$ . To overcome its NP-hardness, this problem is usually relaxed to a convex optimization problem using the $\ell_{1}$ -norm [15]. Based on this approach many algorithms have been proposed to iteratively approximate solutions of (2), for an overview see [16].

The migration of the sparse representation problem (2) to the era of data-driven methods may be marked with the introduction of K-SVD, [17, 18], where a dictionary and sparse coefficients are being simultaneously learned for a set of observations.

Transition to deep neural networks: There are many ideas and applications in which neural networks have entered into different aspects of signal processing, see, e.g., [19] for an overview of applications to inverse problems in imaging. One example is the question whether approximate solutions to the sparse representation problem (2) can be derived without using computationally expensive iterative algorithms. To this end, [20] treats the inverse problem (2) as a regression problem based on a deep neural network that is trained on supervised examples of observations and their sparse representations. After training the network, estimates of sparse representations are calculated by a forward pass of new observations through the network. To give a second example, the DNN method has also been used in compressed sensing. In [21] a $k$ -sparse solution is estimated from noisy measurements $\mathbf{y}=\mathbf{A}\mathbf{x}$ obtained via a Gaussian sensing matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$ by solving the problem $\min_{\mathbf{x}^{*}}\|\mathbf{A}\mathbf{G}(\mathbf{x}^{*})-\mathbf{y}\|_{2},$ where $\mathbf{G}\colon\mathbb{R}^{k}\to\mathbb{R}^{n}$ is a trained DNN.

Deep representations: Approaches as described in the previous paragraph suggest that signal representations should further leverage the data-driven approach in order to obtain representations with better estimation and interpretation properties. Deep neural networks may be considered a next step in the historical development of signal representation described above, in the sense that data is no longer represented via a single linear representation, like an orthonormal basis, a frame or data driven dictionary, but via an entire collection of affine linear representations. In the case of piecewise linear activations each individual representation of the collection is used for one particular region of a polytope partition of the signal space. In the remainder of this paper we study deep feedforward rectifier neural networks from this angle. Specifically, the architecture we consider is as follows. For a number $L\in\mathbb{N}$ of layers of widths $N_{0},\ldots,N_{L}\in\mathbb{N}$ , a collection of affine linear operators $\{M_{\ell}\colon\mathbb{R}^{N_{\ell-1}}\to\mathbb{R}^{N_{\ell}}\}_{\ell=1}^{L}$ and componentwise acting nonlinear activation functions $\{\rho_{\ell}\}_{\ell=1}^{L-1}$ we consider the map $\mathcal{M}_{L}\colon\mathbb{R}^{N_{0}}\to\mathbb{R}^{N_{L}}$ defined by

[TABLE]

to which we will refer to as $L$ -layer deep representation. The affine linear map at the $\ell$ -th layer is given by $M_{\ell}(\mathbf{x})=\mathbf{W}_{\ell}\mathbf{x}+\mathbf{b}_{\ell}$ , with linear part given by a weight matrix $\mathbf{W}_{\ell}\in\mathbb{R}^{N_{\ell}\times N_{\ell-1}}$ , representing edge weights in the graph interpretation of $\mathcal{M}_{L}$ as a feedforward neural network, and affine shift $\mathbf{b}_{\ell}$ , called bias, representing the offsets of the neurons. We refer to $\mathcal{X}=\mathbb{R}^{N_{0}}$ as the input space.

Notation:

We denote matrices bold upper case, vectors bold lower case and scalars in normal font. Moreover, we denote by $x_{i}$ , or $(\mathbf{x})_{i}$ , the $i$ -th entry of the vector $\mathbf{x}$ , by $\mathbf{w}_{k,:i}$ the $i$ -th column and by $\mathbf{w}_{k,j:}$ the $j$ -th row of $\mathbf{W}_{k}$ . The rank one matrix given by the outer product of the column vector $\mathbf{w}_{k,:i}$ and the row vector $\mathbf{w}_{k,j:}$ is denoted by $\mathbf{w}_{k,:i}\otimes\mathbf{w}_{k,j:}$ . We use $|\cdot|$ to denote the cardinality of a set, $\operatorname{spt}\mathbf{x}$ to denote the support of a vector $\mathbf{x}$ , $\mathbf{I}$ to denote the identity matrix and $\leq$ to denote the pointwise semi-order on $\mathbb{R}^{n}$ . Finally, for subsets of $\mathbb{R}^{n}$ we shorten notation by denoting a set of the form $\{\mathbf{x}\in\mathbb{R}^{n}\colon M_{\ell}\mathbf{x}\geq\mathbf{0}\}$ simply by $\{M_{\ell}\mathbf{x}\geq\mathbf{0}\}$ . Throughout, $M_{\ell}\mathbf{x}$ will be short for $M_{\ell}(\mathbf{x})$ .

3 Data-driven expression for ReLU representations

One of the most effective and widely used non-linear activations is the pointwise acting rectifier $\rho(t):=\max(0,t)$ for $t\in\mathbb{R}$ , [22, 23]. We will refer to $\mathcal{M}_{L}$ as a rectified linear unit (ReLU) representation if all its activations $\rho_{k}$ are set to be this rectifier. To begin, consider a $3$ -layer representation

[TABLE]

with $\rho_{1}=\rho_{2}=\rho$ and denote by

[TABLE]

the output and input of the $k$ -th rectifier, $k=1,2$ . Representing the input in terms of the output we have

[TABLE]

The non-linearity (4) can be replaced by $\mathbf{a}_{k}=\mathbf{D}_{k}\mathbf{y}_{k},$ using a data-dependent diagonal matrix $\mathbf{D}_{k}$ whose $i$ -th diagonal entry is defined as

[TABLE]

The first formulation in (6) captures how $\mathbf{D}_{k}$ functions as processing a rectifier backward from $\mathbf{a}_{k}$ , killing the set-valued entries of $\mathbf{y}_{k}$ in (5) and preserving the other entries. The second formulation states how $\mathbf{D}_{k}$ functions as processing a rectifier forward from its input $\mathbf{y}_{k}$ , letting the positive entries of $\mathbf{y}_{k}$ pass, while setting to zeros the negative entries. There is an ambiguity in how to set the diagonal entry if $y_{k,i}=0$ and in our definition in this case the diagonal entry is set to zero. Note that for a $\{0,1\}$ -entry diagonal matrix $\mathbf{D}_{k}$ , (6) is equivalent to imposing the conditions

[TABLE]

While (7) excludes the case that $y_{k,i}<0$ and $d_{k,i}=1$ ; (8) excludes the case that $y_{k,i}>0$ and $d_{k,i}=0$ . Hence, there does not exist a $\mathbf{y}_{k}$ , such that for any of its components $0<(\mathbf{D}_{k}\mathbf{y}_{k})_{i}\text{ and }((\mathbf{I}-\mathbf{D}_{k})\mathbf{y}_{k})_{i}<0$ . Meanwhile, $y_{k,i}<0$ and $d_{k,i}=0$ if and only if $0=(\mathbf{D}_{k}\mathbf{y}_{k})_{i}\text{ and }((\mathbf{I}-\mathbf{D}_{k})\mathbf{y}_{k})_{i}<0$ ; as well as $y_{k,i}>0$ and $d_{k,i}=1$ if and only if $0<(\mathbf{D}_{k}\mathbf{y}_{k})_{i}\text{ and }((\mathbf{I}-\mathbf{D}_{k})\mathbf{y}_{k})_{i}=0$ . We impose (9), which thus happens if and only if $(\mathbf{D}_{k}\mathbf{y}_{k})_{i}=0$ and $((\mathbf{I}-\mathbf{D}_{k})\mathbf{y}_{k})_{i}=0$ . Hereafter, we keep in mind that the diagonal entry corresponding to $y_{k,i}=0$ is set to zero and neglect (9) to simplify the notation.111This choice will be rendered irrelevant since it concerns the hyperplane boundary between two regions on which the representation acts affine linear. By continuity of the representation the respective affine linear pieces coincide on those boundaries.

Working backwards through the non-linearities of the representation, i.e., starting with $\mathbf{y}=M_{3}\mathbf{a}_{2}$ and using $\mathbf{a}_{2}=\rho_{2}\mathbf{y}_{2}=\mathbf{D}_{2}\mathbf{y}_{2}$ , we have $\mathbf{y}=M_{3}\mathbf{D}_{2}\mathbf{y}_{2}=M_{3}\mathbf{D}_{2}M_{2}\mathbf{a}_{1}$ , where $\mathbf{a}_{1}=\rho_{1}\mathbf{y}_{1}$ . Thus, successively expressing the non-linear relation between out- and inputs of the rectifiers using data-dependent $\{0,1\}$ -entry diagonal matrices, the representation (3) becomes

[TABLE]

or equivalently

[TABLE]

The general $L$ -layer ReLU representation $\mathcal{M}_{L}$ can be expressed as a collection of data-driven affine linear representations:

[TABLE]

We stress that the diagonal matrices are not pre-determined; they are functions of $\mathbf{y}_{k}$ , i.e., depending on the data $\mathbf{x}$ . The non-linear operator $\mathcal{M}_{L}$ is thus expressed as a set of affine linear operators, each of which is determined by the diagonal matrices $\mathbf{D}_{1},\ldots,\mathbf{D}_{L-1}$ , or equivalently by the sign patterns of the input vectors $\mathbf{y}_{k}$ .

Configuration expression:

The above description motivates the following terminology and definitions. In slight abuse of notation, we call any vector $\theta=[\theta_{1}^{\top},\ldots,\theta_{L-1}^{\top}]^{\top}\in\{0,1\}^{N_{1}+\cdots+N_{L-1}}$ derived from the concatenation of certain $\theta_{k}\in\{0,1\}^{N_{k}}$ , $k=1\ldots,L-1$ , a (diagonal) configuration of the ReLU representation $\mathcal{M}_{L}$ , if the polytope

[TABLE]

is non-empty. For a given configuration $\theta$ of $\mathcal{M}_{L}$ , we define the affine linear map

[TABLE]

with domain $R^{\theta}$ . We will also say that $\mathcal{M}_{L}^{\theta}$ induces a configuration, if $R^{\theta}$ is non-empty. Then on the restriction to $R^{\theta}$ the ReLU representation $\mathcal{M}_{L}$ and the affine linear operator $\mathcal{M}_{L}^{\theta}$ coincide.

Example 1.

*Let $\theta=[\theta_{1}^{\top},\theta_{2}^{\top}]^{\top}$ be a configuration of a $3$ -layer ReLU representation $\mathcal{M}_{3}$ . Then the affine linear operator $\mathcal{M}_{3}^{\theta}=M_{3}\operatorname{diag}(\theta_{2})M_{2}\operatorname{diag}(\theta_{1})M_{1}$ coincides with $\mathcal{M}_{3}$ on the convex polytope *

[TABLE]

i.e., on the set of all $\mathbf{x}\in\mathcal{X}$ that satisfy

[TABLE]

If $\mathcal{X}=\mathbb{R}^{2}$ and if $\theta=[0,1,1,0]^{\top}\in\mathbb{R}^{2+2}$ is a configuration, then $\theta$ defines

[TABLE]

*on the polytope *

[TABLE]

In the remainder of this section we recall how the configurations of a ReLU representation partition the input space in increasingly finer polytopes, before describing in detail the affine linear maps.

3.1 Input space partition

Given a ReLU representation $\mathcal{M}_{L}$ , denote by $\Theta_{k}$ , for $k=1,\ldots,L-1$ , the set of all configurations of $\mathcal{M}_{k+1}$ . Then every configuration in $\Theta_{k}$ is derived from a configuration in $\Theta_{k-1}$ via concatenation with a vector from $\{0,1\}^{N_{k}}$ . Note however that not all $2^{N_{k}}$ possible vectors $\theta_{k}\in\{0,1\}^{N_{k}}$ are part of a configuration $[\theta_{1}^{\top},\ldots,\theta_{k}^{\top}]^{\top}\in\Theta_{k}$ . Lower estimates of the size of $\Theta_{k}$ are given in [4]. Whether or not a certain $[\theta_{1}^{\top},\ldots,\theta_{k}^{\top}]^{\top}$ is a configuration depends on $\mathcal{M}_{k}^{[\theta_{1}^{\top},\ldots,\theta_{k-1}^{\top}]^{\top}}$ . If, say, $M_{1}=0$ for some ReLU representation, then $\theta_{k}$ must be the zero vector for all $k=1,\ldots,L-1$ and hence for such a deep representation there is only one configuration $\theta=[0,\ldots,0]^{\top}$ possible. We consider a slightly less trivial example in more detail.

Example 2.

*Consider a ReLU representation $\mathcal{M}_{3}$ on $\mathcal{X}=\mathbb{R}^{2}$ where $M_{1}\colon\mathbb{R}^{2}\to\mathbb{R}^{2}$ is surjective and $M_{2}\colon\mathbb{R}^{2}\to\mathbb{R}$ . Then *

[TABLE]

and the input space $\mathcal{X}$ is first partitioned by the configurations from $\Theta_{1}$ into the polygons

[TABLE]

These polygons are further partitioned by the second layer. Since $M_{2}\operatorname{diag}(\theta_{1}^{0})M_{1}=0$ , the only diagonal configuration that can be achieved via a concatenation from $\{0,1\}$ to $\theta_{1}^{0}$ is $\theta_{2}^{0}=[0,0,0]^{\top}$ and thus $R^{\theta_{1}^{0}}$ is not further partitioned. The partitions of $R^{\theta_{1}^{j}}$ , for $j=1,2,3$ , are derived depending on the affine transforms $M_{2}\operatorname{diag}(\theta_{1}^{j})M_{1}$ . The polygon $R^{\theta_{1}^{j}}$ is partitioned into the union of the two polygons $\{\mathbf{x}\in R^{\theta_{1}^{j}}\colon M_{2}\operatorname{diag}(\theta_{1}^{j})M_{1}\mathbf{x}>0\}$ corresponding to $[(\theta_{1}^{j})^{\top},1]^{\top}$ and $\{\mathbf{x}\in R^{\theta_{1}^{j}}\colon M_{2}\operatorname{diag}(\theta_{1}^{j})M_{1}\mathbf{x}\leq 0\}$ corresponding to $[(\theta_{1}^{j})^{\top},0]^{\top}$ , unless one of those sets is empty, in which case the corresponding vector is not a configuration. Altogether, $\mathcal{X}$ is partitioned into potentially up to $7$ convex regions, as illustrated in Figure 1, corresponding to the configurations

[TABLE]

but, depending on the actual parameters, a smaller $\Theta_{2}$ is possible. Each configuration is associated to an affine linear map via (10), to which $\mathcal{M}_{3}$ is equal to when restricted to the corresponding polytope. The non-linear operator $\mathcal{M}_{3}$ is piecewise affine linear, comprised of the (up to) $7$ affine linear maps $\mathcal{M}_{3}^{\theta}$ , $\theta\in\Theta_{2}$ . Note that if the bias vectors $\mathbf{b}_{1}$ and $\mathbf{b}_{2}$ are zero, then $M_{1}=\mathbf{W}_{1}$ and $M_{2}=\mathbf{W}_{2}$ , and thus the regions are convex cones arising from halfspace intersections through the origin.

We record in the following result how the consecutive layers of an $L$ -layer ReLU representation $\mathcal{M}_{L}$ define increasingly finer partitions of the input space.222To be precise (see comment on (9)), here partition has to be understood in the sense that the interiors of the participating sets have empty intersection. Restricted to each polytope of the final partition, $\mathcal{M}_{L}$ is equal to an affine linear operator specified by the diagonal configuration corresponding to that region.

Proposition 3.

Let $\mathcal{M}_{L}$ be a ReLU representation, $k\in\{1,\ldots,L-1\}$ and $\Theta_{k}$ the set of configurations of $\mathcal{M}_{k}$ .

(i)

If $\theta=[\theta_{1}^{\top},\ldots,\theta_{k}^{\top}]^{\top}\in\Theta_{k}$ , then on $R^{\theta}$ the representation $\mathcal{M}_{k+1}$ coincides with

[TABLE]

(ii)

Define $R^{\Theta_{k}}=\{R^{\theta}\colon\theta\in\Theta_{k}\}$ . Then $R^{\Theta_{k}}$ is a partition of $\mathcal{X}$ and $R^{\Theta_{k+1}}$ is a refinement of $R^{\Theta_{k}}$ .

Proof.

The first claim follows by induction on the layers from the construction; and so does the second. Indeed, $R^{\Theta_{1}}$ is a partition of $\mathcal{X}$ . Suppose $R^{\Theta_{k}}$ partitions $\mathcal{X}$ , let $\theta\in\Theta_{k}$ and denote $R^{\theta}$ the domain of the affine map $\mathcal{M}_{k+1}^{\theta}$ . The entirety of the regions of the configurations induced by $\mathcal{M}_{k+1}^{\theta}$ partition $R^{\theta}$ . Since $\Theta_{k+1}$ is defined as the union of the configurations induced by all $\mathcal{M}_{k+1}^{\theta}$ with $\theta\in\Theta_{k}$ , the collection $R^{\Theta_{k+1}}$ is a refinement of $R^{\Theta_{k}}$ . ∎

In our terminology Proposition 3 reads as the following qualitative result, well know in the literature, e.g. [4], and further illustrated in Figures 2.

Corollary 4.

*(i) Every ReLU representation $\mathcal{M}_{L}$ is a piecewise affine linear operator with respect to a partition of the input space $\mathcal{X}$ into convex polytopes (on each of which $\mathcal{M}_{L}$ is affine linear). The number of polytopes is equal to the number of diagonal configurations of $\mathcal{M}_{L}$ .

(ii) If the biases of all layers of $\mathcal{M}_{L}$ vanish, then $\mathcal{M}_{L}$ is piecewise linear with respect to a partion of $\mathcal{X}$ into convex cones.*

3.2 Affine linear maps

We now give a precise characterization in terms of an atomic decomposition for the affine transform induced by a configuration. Here we refer to a rank one matrix (the outer product of two vectors) as an atom. We show that the atoms that linearly combine the linear part of the affine transform induced by the configuration $\theta$ are exclusively determined by the Kronecker product of $\operatorname{diag}(\theta_{1})\mathbf{W}_{1}$ and $\mathbf{W}_{L}\operatorname{diag}(\theta_{L-1})$ . Thus, increasing the number of rows of $\mathbf{W}_{1}$ and the number of columns of $\mathbf{W}_{L}$ (i.e., the widths of layers $1$ and $L-1$ ) increases the number of atoms in expressing all affine transform pieces of $\mathcal{M}_{L}$ . The coefficients in the linear combination of the atoms are sums of weight products over paths between those layers. Each path is obtained by taking one entry from one nonvanishing column in $\mathbf{W}_{j}\operatorname{diag}(\theta_{j-1})$ for $j=2,\ldots,L-1$ . Increasing the widths and the number of intermediate layers, in different ways, increases the number of paths contributing to a coefficient.

Theorem 5.

Let $\theta$ be a configuration of an $L$ -layer ReLU representation $\mathcal{M}_{L}$ . Then the linear part of the affine linear transform $\mathcal{M}_{L}^{\theta}$ induced by $\theta$ is a linear combination of atoms of the form $\{\mathbf{w}_{L,:i_{L-1}}\otimes\mathbf{w}_{1,i_{1}:}\}_{i_{L-1}\in\operatorname{spt}\theta_{L-1},i_{1}\in\operatorname{spt}\theta_{1}}$ . Specifically:

(i)

For $L=2$ the linear part of $\mathcal{M}_{L}^{\theta}$ is the sum of $|\operatorname{spt}\theta_{1}|$ atoms.

(ii)

For $L=3$ the linear part of $\mathcal{M}_{L}^{\theta}$ is a linear combination of $|\operatorname{spt}\theta_{1}||\operatorname{spt}\theta_{2}|$ atoms and $w_{2,i_{2}i_{1}}$ is the coefficient for atom $\mathbf{w}_{3,:i_{2}}\otimes\mathbf{w}_{1,i_{1}:}$ .

(iii)

For $L>3$ the linear part of $\mathcal{M}_{L}^{\theta}$ the linear combination

[TABLE]

of at most $|\operatorname{spt}\theta_{1}||\operatorname{spt}\theta_{L-1}|$ atoms, with coefficients

[TABLE]

each of which is the sum of products consisting of at most one weight from each layer along $\prod_{j=2}^{L-2}|\operatorname{spt}\theta_{j}|$ possible paths.

Proof.

(i) The linear part of $\mathcal{M}_{2}^{\theta}$ is the sum of $|\operatorname{spt}\theta_{1}|$ atoms, namely

[TABLE]

(ii) The $i_{2}$ -th row of $\mathbf{W}_{2}\operatorname{diag}(\theta_{1})\mathbf{W}_{1}$ is therefore $\sum_{i_{1}\in\operatorname{spt}\theta_{1}}w_{2,i_{2}i_{1}}\mathbf{w}_{1,i_{1}:}$ and thus the linear part of $\mathcal{M}_{3}^{\theta}$ is

[TABLE]

(iii) The $i_{3}$ -th row of $\mathbf{W}_{3}\operatorname{diag}(\theta_{2})\mathbf{W}_{2}\operatorname{diag}(\theta_{1})\mathbf{W}_{1}$ is $\sum_{i_{1}\in\operatorname{spt}\theta_{1},i_{2}\in\operatorname{spt}\theta_{2}}w_{2,i_{2}i_{1}}w_{3,i_{3}i_{2}}\mathbf{w}_{1,i_{1}:}$ and thus the linear part of $\mathcal{M}_{4}^{\theta}$ is

[TABLE]

Successively continuing, the linear part of $\mathcal{M}_{L}^{\theta}$ is (11). ∎

On the polytope $R^{\theta}$ we therefore obtain the following expression for $\mathcal{M}_{L}$ :

[TABLE]

where

[TABLE]

and where the coefficients $\alpha_{L,i_{L-1}}(\mathbf{x})$ associated with the column $\mathbf{w}_{L,:i_{L-1}}$ of $\mathbf{W}_{L}\operatorname{diag}(\theta_{L-1})$ are

[TABLE]

in the case of $L=2$ layers;

[TABLE]

in the case of $L=3$ layers; and

[TABLE]

in the general case of $L>3$ layers. In particular, the affine transform $\mathcal{M}_{L}^{\theta}$ maps its domain $R^{\theta}$ into

[TABLE]

As an immediate application we estimate a Lipschitz bound for $\mathcal{M}_{L}^{\theta}$ , which can be interpreted as a measure for the gain of local input perturbations to that of the outputs of $\mathcal{M}_{L}$ on $R^{\theta}$ . The bound depends on the number of activated rectifiers. Given the atomic representation, it may have benefits to normalize the columns of $\mathbf{W}_{L}$ and the rows of $\mathbf{W}_{1}$ , depending, e.g., on whether the model is used for signal analysis or synthesis. The following result can easily be modified for the case without this normalization assumption.

Theorem 6.

Let $\theta$ be a configuration of an $L$ -layer ReLU representation $\mathcal{M}_{L}$ with $L>3$ , and let $\mathbf{x}_{1},\mathbf{x}_{2}\in R^{\theta}$ .

(i)

Suppose that $\mathbf{W}_{L}$ has normalized columns, that $\mathbf{W}_{1}$ has normalized rows, and let $C$ be the maximum of the absolute value of all weights in $\mathbf{W}_{2},\ldots,\mathbf{W}_{L-1}$ . Then

[TABLE]

If $N=\max\{N_{1},\ldots,N_{L-1}\}$ then $(CN)^{L-2}N$ is a global Lipschitz bound for $\mathcal{M}_{L}$ .

(ii)

If $\sigma$ is the maximum of the spectral norms of the weight matrices, then

[TABLE]

Proof.

Under the assumptions of (i), for $\mathbf{x}_{1},\mathbf{x}_{2}\in R^{\theta}$ we get

[TABLE]

For the global Lipschitz estimate note that, since $|\operatorname{spt}\theta_{k}|\leq N_{k}$ , we have

[TABLE]

Part (ii) follows directly from (12). ∎

It is clear that the global Lipschitz bound $(CN)^{L-2}N$ derived for $\mathcal{M}_{L}$ via the above crude estimate from its affine linear pieces is far from being optimal. As such, Theorem 6 can be regarded as a refinement of a similar global Lipschitz bound derived in [21]. The fact that increasing the number $L$ of layers of the representation refines the partitioning of the input space, implies that, in order to keep stability, the Lipschitz bound for $\mathcal{M}_{L}$ should be a non-increasing function of $L$ ; otherwise a tiny part of the input space could cause instability of the representation. We are thus particularly interested in deriving a sufficient condition for the Lipschitz bound to not be an increasing function of $L$ . Achieving this for the bound in (i) requires $C\leq 1/N$ , i.e., to achieve a stable representation regardless the number of layers requires the mean and variance of the weight coefficients to be very small at large $N$ . This might be related to the batch normalization technique in learning DNNs [5, 24]. On the other hand, the suffiecient condition of having the spectral norms of the weight matrices not exceed $1$ can be achieved via optimization techniques by imposing the Frobenius norms of the weight matrices to not exceed $1$ .

With regards to Theorem 6(i), we remark on two observations further suggesting that asymptotic stability of the Lipschitz bound for $\mathcal{M}_{L}$ for large number of layers plays a role in the learning process of function approximation via deep feedforward neural networks. The back-propagation algorithm, designed to carry out the learning task, is based on (sub)gradient descent in the landscape of a loss function $\mathcal{L}$ in the network parameter space. It is believed that in the course of training, both, the maximum magnitude component and the smoothness of the gradient of $\mathcal{L}$ affect the learning performance.

We first consider the maximum magnitude component of the gradient. Let $\mathcal{L}^{\theta}$ denote the restriction of the loss function to $R^{\theta}$ and for $\mathbf{x}\in R^{\theta}$ denote $\mathbf{y}_{k}=\mathcal{M}_{k}^{\theta}(\mathbf{x})$ and $\mathbf{a}_{k}=\operatorname{diag}(\theta_{k})\mathbf{y}_{k}$ . Then

[TABLE]

where $\Sigma^{\theta_{l}}(\mathbf{y}_{l})$ , for $l=k,\ldots,L-1$ , is a diagonal matrix with entries [math] or $1$ , corresponding to the value of the directional derivative of the rectifier function, which is $\partial{a_{l,i}}/\partial y_{l,i}=1$ if $y_{l,i}\geq 0$ and [math] otherwise. Note that this value is $1$ at $y_{l,i}=0$ , since the subdifferential of the one-dimensional rectifier at $y_{l,i}=0$ is the interval $[0,1]$ and the directional derivative of a one-dimensional convex function is the maximum of the subdifferential. Using the estimate $\|B\|\leq\sqrt{mn}\|B\|_{\max}$ for $B\in\mathbb{R}^{m\times n}$ on the weight matrices, the maximum magnitude entry of $\partial\mathcal{L}^{\theta}/\partial\mathbf{y}_{k}$ is bounded by

[TABLE]

If $NC>1$ , then $(NC)^{L-k}$ increases when $k$ decreases. This implies that the maximum magnitude entries of the gradient at early layers can potentially have larger variations, which would hamper the learning performance.

Our second consideration concerns the smoothness of the gradient of the loss function. Globally this gradient is notoriously nonsmooth and thus again we restrict to the individual polytope regions $R^{\theta}$ . Assume that there the gradient of the loss function is $\beta_{\theta}$ -smooth, i.e., suppose that for all $\mathbf{y}_{k}=\mathcal{M}_{k}^{\theta}(\mathbf{x})$ and $\mathbf{y}^{\prime}_{k}=\mathcal{M}_{k}^{\theta}(\mathbf{x}^{\prime})$ , where $\mathbf{x},\mathbf{x}^{\prime}\in R^{\theta}$ , the estimate

[TABLE]

holds. For any layer $k$ , (13) implies

[TABLE]

Similar to (3.2), here a condition like $C\leq 1/N$ is needed to guarantee to avoid blowing up of the Lipschitz parameters at early layers during learning.

We hope that having precise expressions such as (11) for deep representations can contribute to paving a way to develop new and to better understand existing regularization techniques such as batch normalization, dropout [6, 25, 26] or deep residual learning [7].

We finally would like to make one more signal processing related remark. The smoothness of a loss function not only relates to the stability in learning a deep representation, it also relates to deriving local minimizers of $\mathcal{L}$ over the input space $\mathcal{X}$ using gradient descent. Following (15), we have

[TABLE]

Since

[TABLE]

this implies

[TABLE]

i.e., here again $C\leq 1/N$ is sufficient to stabilize the smoothness of the gradient for large $L$ .

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems , vol. 2, no. 4, pp. 303–314, 1989.
2[2] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks , vol. 4, no. 2, pp. 251–257, 1991.
3[3] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deep neural networks with rectified linear units,” in International Conference on Learning Representations (ICLR) , 2018.
4[4] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep neural networks,” in Advances in Neural Information Processing Systems 27 , pp. 2924–2932, 2014.
5[5] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning (ICML) , pp. 448–456, 2015.
6[6] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” Co RR , 2012.
7[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 770–778, 2016.
8[8] E. J. Candès, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics , vol. 59, no. 8, pp. 1207–1223, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Deep Representation with ReLU Neural Networks

Abstract

1 Introduction

2 From orthonormal bases to data-driven representations and deep neural networks

Notation:

3 Data-driven expression for ReLU representations

Configuration expression:

Example 1**.**

3.1 Input space partition

Example 2**.**

Proposition 3**.**

Proof.

Corollary 4**.**

3.2 Affine linear maps

Theorem 5**.**

Proof.

Theorem 6**.**

Proof.

Example 1.

Example 2.

Proposition 3.

Corollary 4.

Theorem 5.

Theorem 6.