Lipschitz Certificates for Layered Network Structures Driven by Averaged   Activation Operators

Patrick L. Combettes; Jean-Christophe Pesquet

arXiv:1903.01014·math.OC·June 23, 2020·SIAM J. Math. Data Sci.

Lipschitz Certificates for Layered Network Structures Driven by Averaged Activation Operators

Patrick L. Combettes, Jean-Christophe Pesquet

PDF

Open Access

TL;DR

This paper develops a method to compute tight Lipschitz constants for layered neural networks using averaged operators, improving robustness assessment by capturing layer interactions more accurately.

Contribution

It introduces a novel framework for deriving sharp Lipschitz bounds for layered networks with averaged operators, surpassing traditional product-based estimates.

Findings

01

Tighter Lipschitz constants than traditional bounds.

02

Applicable to standard convolutional neural networks.

03

Enhanced robustness evaluation for neural network models.

Abstract

Obtaining sharp Lipschitz constants for feed-forward neural networks is essential to assess their robustness in the face of perturbations of their inputs. We derive such constants in the context of a general layered network model involving compositions of nonexpansive averaged operators and affine operators. By exploiting this architecture, our analysis finely captures the interactions between the layers, yielding tighter Lipschitz constants than those resulting from the product of individual bounds for groups of layers. The proposed framework is shown to cover in particular many practical instances encountered in feed-forward neural networks. Our Lipschitz constant estimates are further improved in the case of structures employing scalar nonlinear functions, which include standard convolutional networks as special cases.

Equations286

∥ T (x + z) - T x ∥ ⩽ θ ∥ z ∥.

∥ T (x + z) - T x ∥ ⩽ θ ∥ z ∥.

R = (1 - α) Id + α Q .

R = (1 - α) Id + α Q .

T = T_{m} \circ \dots \circ T_{1}, where (\forall i \in {1, \dots, m}) T_{i} : H_{i - 1} \to H_{i} : x \mapsto R_{i} (W_{i} x + b_{i}) .

T = T_{m} \circ \dots \circ T_{1}, where (\forall i \in {1, \dots, m}) T_{i} : H_{i - 1} \to H_{i} : x \mapsto R_{i} (W_{i} x + b_{i}) .

θ_{m} = i = 1 \prod m ∥ W_{i} ∥.

θ_{m} = i = 1 \prod m ∥ W_{i} ∥.

\big{(}\forall x=(\xi_{k})_{1\leqslant k\leqslant N_{i}})\in\mathbb{R}^{N_{i}}\big{)}\quad R_{i}x=\big{(}\rho(\xi_{k})\big{)}_{1\leqslant k\leqslant N_{i}},\quad\text{where}\quad\rho\colon\xi\mapsto\max\{0,\xi\}.

\big{(}\forall x=(\xi_{k})_{1\leqslant k\leqslant N_{i}})\in\mathbb{R}^{N_{i}}\big{)}\quad R_{i}x=\big{(}\rho(\xi_{k})\big{)}_{1\leqslant k\leqslant N_{i}},\quad\text{where}\quad\rho\colon\xi\mapsto\max\{0,\xi\}.

\theta_{3}=\frac{1}{4}\big{(}\|W_{3}W_{2}W_{1}\|+\|W_{3}W_{2}\|\,\|W_{1}\|+\|W_{3}\|\,\|W_{2}W_{1}\|+\|W_{3}\|\,\|W_{2}\|\,\|W_{1}\|\big{)}

\theta_{3}=\frac{1}{4}\big{(}\|W_{3}W_{2}W_{1}\|+\|W_{3}W_{2}\|\,\|W_{1}\|+\|W_{3}\|\,\|W_{2}W_{1}\|+\|W_{3}\|\,\|W_{2}\|\,\|W_{1}\|\big{)}

ϑ_{3} = Λ_{1} \in D_{{0, 1}}^{(N_{1})}, Λ_{2} \in D_{{0, 1}}^{(N_{2})} sup ∥ W_{3} Λ_{2} W_{2} Λ_{1} W_{1} ∥,

ϑ_{3} = Λ_{1} \in D_{{0, 1}}^{(N_{1})}, Λ_{2} \in D_{{0, 1}}^{(N_{2})} sup ∥ W_{3} Λ_{2} W_{2} Λ_{1} W_{1} ∥,

∥ W_{3} W_{2} W_{1} ∥ ⩽ ϑ_{3} ⩽ θ_{3} ⩽ ∥ W_{3} ∥ ∥ W_{2} ∥ ∥ W_{1} ∥.

∥ W_{3} W_{2} W_{1} ∥ ⩽ ϑ_{3} ⩽ θ_{3} ⩽ ∥ W_{3} ∥ ∥ W_{2} ∥ ∥ W_{1} ∥.

(\forall (x, u) \in gra A) (\forall (y, v) \in gra A) ⟨ x - y ∣ u - v ⟩ ⩾ 0,

(\forall (x, u) \in gra A) (\forall (y, v) \in gra A) ⟨ x - y ∣ u - v ⟩ ⩾ 0,

Γ_{0} (H) ∋ f^{*} : u \mapsto x \in H sup (⟨ x ∣ u ⟩ - f (x))

Γ_{0} (H) ∋ f^{*} : u \mapsto x \in H sup (⟨ x ∣ u ⟩ - f (x))

\partial f\colon{\mathcal{H}}\to 2^{{\mathcal{H}}}\colon x\mapsto\big{\{}{u\in{\mathcal{H}}}~{}\big{|}~{}{(\forall y\in{\mathcal{H}})\;\>{\left\langle{{y-x}\mid{u}}\right\rangle}+f(x)\leqslant f(y)}\big{\}}.

\partial f\colon{\mathcal{H}}\to 2^{{\mathcal{H}}}\colon x\mapsto\big{\{}{u\in{\mathcal{H}}}~{}\big{|}~{}{(\forall y\in{\mathcal{H}})\;\>{\left\langle{{y-x}\mid{u}}\right\rangle}+f(x)\leqslant f(y)}\big{\}}.

R = Id + λ (prox_{ϕ} - Id), where ϕ \in Γ_{0} (R) and λ \in [0, 2] .

R = Id + λ (prox_{ϕ} - Id), where ϕ \in Γ_{0} (R) and λ \in [0, 2] .

(\forall x \in R) R (x) = prox_{ι_{[0, β]}} (x) = min {max {x, 0}, β},

(\forall x \in R) R (x) = prox_{ι_{[0, β]}} (x) = min {max {x, 0}, β},

(\forall x\in\mathbb{R})\quad R(x)=\begin{cases}x,&\text{if}\;\;x\geqslant 0;\\ \beta\big{(}\exp(x)-1\big{)},&\text{if}\;\;x<0.\end{cases}

(\forall x\in\mathbb{R})\quad R(x)=\begin{cases}x,&\text{if}\;\;x\geqslant 0;\\ \beta\big{(}\exp(x)-1\big{)},&\text{if}\;\;x<0.\end{cases}

(\forall x\in\mathbb{R})\quad\phi(x)=\begin{cases}0&\text{if}\;\;x\geqslant 0;\\ (x+\beta)\ln\bigg{(}\dfrac{x+\beta}{\beta}\bigg{)}-x-\dfrac{x^{2}}{2},&\text{if}\;\;-\beta<x<0;\\ \beta-\dfrac{\beta^{2}}{2},&\text{if}\;\;x=-\beta;\\ {+\infty},&\text{if}\;\;x<-\beta.\end{cases}

(\forall x\in\mathbb{R})\quad\phi(x)=\begin{cases}0&\text{if}\;\;x\geqslant 0;\\ (x+\beta)\ln\bigg{(}\dfrac{x+\beta}{\beta}\bigg{)}-x-\dfrac{x^{2}}{2},&\text{if}\;\;-\beta<x<0;\\ \beta-\dfrac{\beta^{2}}{2},&\text{if}\;\;x=-\beta;\\ {+\infty},&\text{if}\;\;x<-\beta.\end{cases}

(\forall x \in R) R (x) = \frac{μ sign ( x ) x ^{2}}{1 + x ^{2}}, where μ = \frac{8}{3 3},

(\forall x \in R) R (x) = \frac{μ sign ( x ) x ^{2}}{1 + x ^{2}}, where μ = \frac{8}{3 3},

(\forall x \in R) ψ^{*} (x) = ⎩ ⎨ ⎧ arctan \frac{∣ x ∣}{1 - ∣ x ∣} - ∣ x ∣ (1 - ∣ x ∣), \frac{π}{2}, + \infty, if ∣ x ∣ < 1; if ∣ x ∣ = 1; otherwise.

(\forall x \in R) ψ^{*} (x) = ⎩ ⎨ ⎧ arctan \frac{∣ x ∣}{1 - ∣ x ∣} - ∣ x ∣ (1 - ∣ x ∣), \frac{π}{2}, + \infty, if ∣ x ∣ < 1; if ∣ x ∣ = 1; otherwise.

\phi=\mu\psi^{*}\bigg{(}\dfrac{\cdot}{\mu}\bigg{)}-\dfrac{|\cdot|^{2}}{2}\colon x\mapsto\begin{cases}\mu\arctan\sqrt{\dfrac{|x|}{\mu-|x|}}-\sqrt{|x|(\mu-|x|)}-\dfrac{x^{2}}{2},&\text{if}\;\;|x|<\mu;\\ \dfrac{\mu(\pi-\mu)}{2},&\text{if}\;\;|x|=\mu;\\ {+\infty},&\text{otherwise.}\end{cases}

\phi=\mu\psi^{*}\bigg{(}\dfrac{\cdot}{\mu}\bigg{)}-\dfrac{|\cdot|^{2}}{2}\colon x\mapsto\begin{cases}\mu\arctan\sqrt{\dfrac{|x|}{\mu-|x|}}-\sqrt{|x|(\mu-|x|)}-\dfrac{x^{2}}{2},&\text{if}\;\;|x|<\mu;\\ \dfrac{\mu(\pi-\mu)}{2},&\text{if}\;\;|x|=\mu;\\ {+\infty},&\text{otherwise.}\end{cases}

(\forall x \in R) R (x) = proj_{[0, 1]} ∣ x ∣ = {∣ x ∣, 1, if ∣ x ∣ < 1; otherwise,

(\forall x \in R) R (x) = proj_{[0, 1]} ∣ x ∣ = {∣ x ∣, 1, if ∣ x ∣ < 1; otherwise,

(\forall x \in R) R (x) = \frac{10 x}{11 ( 1 + exp ( - x ))},

(\forall x \in R) R (x) = \frac{10 x}{11 ( 1 + exp ( - x ))},

(\forall x \in R) R (x) = \frac{10}{11} \times ⎩ ⎨ ⎧ \frac{x}{1 + exp ( - x )}, \frac{exp ( x ) - 1}{1 + exp ( - x )}, if x ⩾ 0; if x < 0,

(\forall x \in R) R (x) = \frac{10}{11} \times ⎩ ⎨ ⎧ \frac{x}{1 + exp ( - x )}, \frac{exp ( x ) - 1}{1 + exp ( - x )}, if x ⩾ 0; if x < 0,

(\forall x \in H) R x = ⎩ ⎨ ⎧ (1 - λ) x + \frac{λ prox _{ϕ} d _{C} ( x )}{d _{C} ( x )} (x - proj_{C} x), (1 - λ) x, if x \in / C; if x \in C .

(\forall x \in H) R x = ⎩ ⎨ ⎧ (1 - λ) x + \frac{λ prox _{ϕ} d _{C} ( x )}{d _{C} ( x )} (x - proj_{C} x), (1 - λ) x, if x \in / C; if x \in C .

R : x \mapsto \frac{μ ∥ x ∥}{1 + ∥ x ∥ ^{2}} x

R : x \mapsto \frac{μ ∥ x ∥}{1 + ∥ x ∥ ^{2}} x

R\colon\mathbb{R}^{N}\to\mathbb{R}^{N}\colon(\xi_{k})_{1\leqslant k\leqslant N}\mapsto\omega\big{(}\xi_{k}^{\uparrow}\big{)}_{1\leqslant k\leqslant N}+(1-\omega)\text{\rm proj}_{C}(\xi_{k})_{1\leqslant k\leqslant N},

R\colon\mathbb{R}^{N}\to\mathbb{R}^{N}\colon(\xi_{k})_{1\leqslant k\leqslant N}\mapsto\omega\big{(}\xi_{k}^{\uparrow}\big{)}_{1\leqslant k\leqslant N}+(1-\omega)\text{\rm proj}_{C}(\xi_{k})_{1\leqslant k\leqslant N},

R\colon\mathbb{R}^{N}\to\mathbb{R}^{N}\colon(\xi_{k})_{1\leqslant k\leqslant N}\mapsto\Bigg{(}\omega\xi_{k}^{\uparrow}+\frac{1-\omega}{N}\sum_{j=1}^{N}\xi_{j}\Bigg{)}_{1\leqslant k\leqslant N}.

R\colon\mathbb{R}^{N}\to\mathbb{R}^{N}\colon(\xi_{k})_{1\leqslant k\leqslant N}\mapsto\Bigg{(}\omega\xi_{k}^{\uparrow}+\frac{1-\omega}{N}\sum_{j=1}^{N}\xi_{j}\Bigg{)}_{1\leqslant k\leqslant N}.

R\colon\mathbb{R}^{N-1}\to\mathbb{R}^{N-1}\colon(\xi_{k})_{1\leqslant k\leqslant N-1}\mapsto US\Big{(}[\tau_{1}\xi_{1},\ldots,\tau_{N-1}\xi_{N-1},\theta]^{\top}\Big{)},

R\colon\mathbb{R}^{N-1}\to\mathbb{R}^{N-1}\colon(\xi_{k})_{1\leqslant k\leqslant N-1}\mapsto US\Big{(}[\tau_{1}\xi_{1},\ldots,\tau_{N-1}\xi_{N-1},\theta]^{\top}\Big{)},

\mathbb{J}_{m,k}=\begin{cases}\big{\{}{(j_{1},\ldots,j_{k})\in\mathbb{N}^{k}}~{}\big{|}~{}{1\leqslant j_{1}<\cdots<j_{k}\leqslant m-1}\big{\}},&\text{if}\;\;k>1;\\ \{1,\ldots,m-1\},&\text{if}\;\;k=1\end{cases}

\mathbb{J}_{m,k}=\begin{cases}\big{\{}{(j_{1},\ldots,j_{k})\in\mathbb{N}^{k}}~{}\big{|}~{}{1\leqslant j_{1}<\cdots<j_{k}\leqslant m-1}\big{\}},&\text{if}\;\;k>1;\\ \{1,\ldots,m-1\},&\text{if}\;\;k=1\end{cases}

σ_{m; {j_{1}, \dots, j_{k}}} = ∥ W_{m} \circ \dots \circ W_{j_{k} + 1} ∥ ∥ W_{j_{k}} \circ \dots \circ W_{j_{k - 1} + 1} ∥ \dots ∥ W_{j_{1}} \circ \dots \circ W_{1} ∥.

σ_{m; {j_{1}, \dots, j_{k}}} = ∥ W_{m} \circ \dots \circ W_{j_{k} + 1} ∥ ∥ W_{j_{k}} \circ \dots \circ W_{j_{k - 1} + 1} ∥ \dots ∥ W_{j_{1}} \circ \dots \circ W_{1} ∥.

(\forall\,\mathbb{J}\subset\{1,\ldots,m-1\})\quad\beta_{m;\mathbb{J}}=\Bigg{(}\prod_{j\in\mathbb{J}}\alpha_{j}\Bigg{)}\prod_{j\in\{1,\ldots,m-1\}\smallsetminus\mathbb{J}}(1-\alpha_{j})

(\forall\,\mathbb{J}\subset\{1,\ldots,m-1\})\quad\beta_{m;\mathbb{J}}=\Bigg{(}\prod_{j\in\mathbb{J}}\alpha_{j}\Bigg{)}\prod_{j\in\{1,\ldots,m-1\}\smallsetminus\mathbb{J}}(1-\alpha_{j})

θ_{m} = β_{m; \emptyset} ∥ W_{m} \circ \dots \circ W_{1} ∥ + k = 1 \sum m - 1 (j_{1}, \dots, j_{k}) \in J_{m, k} \sum β_{m; {j_{1}, \dots, j_{k}}} σ_{m; {j_{1}, \dots, j_{k}}} .

θ_{m} = β_{m; \emptyset} ∥ W_{m} \circ \dots \circ W_{1} ∥ + k = 1 \sum m - 1 (j_{1}, \dots, j_{k}) \in J_{m, k} \sum β_{m; {j_{1}, \dots, j_{k}}} σ_{m; {j_{1}, \dots, j_{k}}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and ELM · Stochastic Gradient Optimization Techniques

Full text

Lipschitz Certificates for Layered Network Structures Driven

by Averaged Activation Operators††thanks: Contact author: P. L. Combettes, [email protected], phone: +1 919 515 2671. The work of P. L. Combettes was supported by the National Science Foundation under grant CCF-1715671. The work of J.-C. Pesquet was supported by Institut Universitaire de France.

Patrick L. Combettes1 and Jean-Christophe Pesquet2

$\!{}^{1}$ North Carolina State University, Department of Mathematics, Raleigh, NC 27695-8205, USA

[email protected]

$\!{}^{2}$ CentraleSupélec, Inria, Université Paris-Saclay, Center for Visual Computing, 91190 Gif sur Yvette, France

[email protected]

( )

Abstract

Obtaining sharp Lipschitz constants for feed-forward neural networks is essential to assess their robustness in the face of perturbations of their inputs. We derive such constants in the context of a general layered network model involving compositions of nonexpansive averaged operators and affine operators. By exploiting this architecture, our analysis finely captures the interactions between the layers, yielding tighter Lipschitz constants than those resulting from the product of individual bounds for groups of layers. The proposed framework is shown to cover in particular many practical instances encountered in feed-forward neural networks. Our Lipschitz constant estimates are further improved in the case of structures employing scalar nonlinear functions, which include standard convolutional networks as special cases.

1 Introduction

Artificial neural networks are becoming increasingly central tools in tasks such as learning, modeling, data processing, and decision making. As first noted in [52], neural networks are vulnerable to adversarial examples which, though close to other data inputs, lead to very different outputs. This potential lack of stability makes the networks vulnerable and unreliable in key application areas; see, for instance, [1, 30, 35] and the references therein. To protect networks against such instabilities various techniques have been explored [39, 43, 44, 54]. Although these defense strategies may be effective in certain scenarios, they do not provide formal guarantees of robustness for general networks and they have been shown to be breakable by new attacks; see, for instance, [3, 18].

It has been acknowledged for some time that the Lipschitz behavior of a network plays a key role in the analysis of its robustness [52]. Simply put, if a layered network is modeled by an operator $T$ acting between normed spaces, with Lipschitz constant $\theta$ , given an input $x$ and a perturbation $z$ , we can majorize the perturbation on the output via the inequality

[TABLE]

Thus $\theta$ can be used as a certificate of robustness of the network provided that it is tightly estimated. Lipschitz regularity is also an important ingredient in the derivation of generalization bounds and approximation bounds [6, 11, 50], and of reachability conditions [47]. In [52] the estimation of $\theta$ is performed by evaluating the Lipschitz constant of the layers individually and then defining $\theta$ as the product of these constants, which typically yields pessimistic bounds. Lipschitz constants have also been computed for specific situations, e.g., [5, 33, 49, 53]. Overall, however, deriving analytically accurate constants for general contexts remains an open problem. The objective of the present paper is to address this question for a general class of layered networks. Mathematically, our network model is described as an alternation of affine and nonlinear operators. This type of structure also arises in variational and equilibrium problems, as well as in network science, e.g., [16, 24, 27, 56]. Adopting the same terminology as in the neural network literature, where they model the activity of neurons, the nonlinear operators will be called activation operators. Our stability analysis focuses on the following $m$ -layer model, in which the activation operators are averaged nonexpansive operators (see Fig. 1). Recall that an operator $R\colon{\mathcal{H}}\to{\mathcal{H}}$ acting on a Hilbert space ${\mathcal{H}}$ is $\alpha$ -averaged for some $\alpha\in[0,1]$ if there exists a nonexpansive (i.e., $1$ -Lipschitzian) operator $Q\colon{\mathcal{H}}\to{\mathcal{H}}$ such that

[TABLE]

In other words, $R=\operatorname{Id}+\alpha(Q-\operatorname{Id})$ is an underrelaxation of a nonexpansive operator (see [8] for a detailed account). This class of operators was introduced in [4] and shown in [21] to model various problems in nonlinear analysis as it includes common operators such as projection operators, proximity operators, resolvents of monotone operators, reflection operators, gradient step operators, and various combinations thereof. Recent theoretical developments and applications to data science include [9, 10, 12, 13, 15, 22, 26, 34, 41, 51, 55, 56].

Model 1.1

Let $m\geqslant 1$ be an integer and let $({\mathcal{H}}_{i})_{0\leqslant i\leqslant m}$ be nonzero real Hilbert spaces. For every $i\in\{1,\ldots,m\}$ , let $W_{i}\colon{\mathcal{H}}_{i-1}\to{\mathcal{H}}_{i}$ be a bounded linear operator, let $b_{i}\in{\mathcal{H}}_{i}$ , let $\alpha_{i}\in [0,1]$ , and let $R_{i}\colon{\mathcal{H}}_{i}\to{\mathcal{H}}_{i}$ be an $\alpha_{i}$ -averaged operator. Set

[TABLE]

Since the operators $(R_{i})_{1\leqslant i\leqslant m}$ are nonexpansive, a Lipschitz constant for $T$ in (1.3) is

[TABLE]

However, as already mentioned, this constant is usually quite loose and of limited use to assess the actual stability of the network. A novelty of our approach is to take into account the averagedness properties of the individual activation operators to capture more sharply the overall interactions between the layers, yielding tighter constants than those provided by computing bounds for groups of layers. Our specific contributions are the following:

•

We show that the most common activation operators used in neural networks are averaged operators. This not only provides an a posteriori justification for Model 1.1, but also indicates that this highly structured framework should be of interest in the analysis of other properties of layered networks beyond stability.

•

We derive a general expression for a Lipschitz constant of $T$ in terms of the averagedness constants of the activation operators $(R_{i})_{1\leqslant i\leqslant m}$ and the norms of certain compositions of the linear operators $(W_{i})_{1\leqslant i\leqslant m}$ . This Lipschitz constant is shown to lie between the simple upper bound (1.4) and the lower bound $\|W_{m}\circ\cdots\circ W_{1}\|$ corresponding to a purely linear network. Our analysis applies to any type of linear operator, in particular convolutive ones, and it does not require any additional assumptions on the activation operator. In particular, differentiability is not assumed and our results therefore cover, in particular, networks using the rectified linear unit (ReLU) and max-pooling operations.

•

In the common situation when the activation operators are separable, we obtain tighter Lipschitz constants for various norms.

•

Under some positivity condition, we prove that a Lipschitz constant of the network reduces to that of the associated purely linear network obtained by removing the nonlinear operators.

In [24], we investigated the special case of Model 1.1 in which the activation operators $(R_{i})_{1\leqslant i\leqslant m}$ are proximity operators, hence $1/2$ -averaged (see Section 3.1). The objective there was to study the asymptotic behavior of deep network structures rather than their stability.

The remainder of the paper is organized as follows. In Section 2 we present an illustration of our main result in a simple special case. In Section 3.1 we provide the necessary nonlinear analysis background. In Section 3.2 we show that a wide array of activation operators used in neural networks are indeed nonexpansive. In Section 4 we derive general results concerning Lipschitz constants for Model 1.1. Section 5 refines this analysis in the case of separable activation operators.

2 Preview of the main results in a simple scenario

We illustrate on a simple instance the main results of the paper. More precisely, we consider a three-layer ( $m=3$ ) network where, for every $i\in\{0,1,2,3\}$ , ${\mathcal{H}}_{i}$ is the standard Euclidean space $\mathbb{R}^{N_{i}}$ . In this case, each linear operator $W_{i}$ is identified with a matrix in $\mathbb{R}^{N_{i}\times N_{i-1}}$ . To further simplify our setting, we assume that the operators $R_{1}$ , $R_{2}$ , and $R_{3}$ correspond to ReLU layers, that is, for each $i\in\{1,2,3\}$ ,

[TABLE]

In view of (1.2), $\rho=(1/2)\operatorname{Id}+(1/2)|\cdot|$ is $1/2$ -averaged since $|\cdot|$ has Lipschitz constant 1. This implies that the operators $R_{1}$ , $R_{2}$ , and $R_{3}$ are also $1/2$ -averaged [24]. Let us now introduce two parameters which will play a central role in our analysis, namely,

[TABLE]

and

[TABLE]

where $\|\cdot\|$ is the spectral norm and, for each $i\in\{1,2\}$ , $\mathscr{D}_{\{0,1\}}^{(N_{i})}$ denotes the set of $N_{i}\times N_{i}$ diagonal matrices with entries in $\{0,1\}$ . In this context, our main result states that both $\theta_{3}$ and $\vartheta_{3}$ are Lipschitz constants of the network, and that

[TABLE]

In addition, if the entries of the matrices $(W_{i})_{1\leqslant i\leqslant 3}$ are in $\left[0,+\infty\right[$ , then a Lipschitz constant of the network is $\|W_{3}W_{2}W_{1}\|$ .

Example 2.1

To illustrate the improvement of the proposed bound over the classical product norm estimate, we consider a fully connected network with $N_{0}=8$ , $N_{1}=10$ , $N_{2}=6$ , and $N_{3}=3$ . The entries of the matrices $(W_{i})_{1\leqslant i\leqslant 3}$ are generated randomly and independently according to a normal distribution. We evaluate the Lipschitz constant estimate $\theta_{3}$ provided by (2.2) and the lower bound in (2.4). The average (resp. minimal) value of $\theta_{3}/(\|W_{1}\|\,\|W_{2}\|\,\|W_{3}\|)$ computed over 1000 realizations is approximately equal to $0.6699$ (resp. $0.5112$ ), while the average (resp. minimal) value of $\|W_{3}W_{2}W_{1}\|/(\|W_{1}\|\,\|W_{2}\|\,\|W_{3}\|)$ is approximately equal to $0.3747$ (resp. $0.1208$ ). In addition, the average (resp. minimal) value of $\vartheta_{3}/(\|W_{1}\|\,\|W_{2}\|\,\|W_{3}\|)$ computed over 1000 realizations is approximately equal to $0.4528$ (resp. $0.2424$ ). In agreement with (2.4), this estimation of the Lipschitz constant is better than $\theta_{3}$ and significantly sharper than $\|W_{1}\|\,\|W_{2}\|\,\|W_{3}\|$ .

In the remainder of this paper, we show that the above results hold in a much more general context (for an arbitrary number of layers $m$ , arbitrary Hilbert spaces, and a wide class of activation operators), and that some of them can be extended to non-Euclidean norms. To establish these results, we need to introduce suitable mathematical tools in the next section.

3 Nonexpansive averaged activation operators

3.1 Nonlinear analysis tools and notation

We review some key facts and definitions which will be used subsequently; see [8] for further information. Throughout, ${\mathcal{H}}$ is a real Hilbert space with power set $2^{\mathcal{H}}$ , scalar product ${\left\langle{{\cdot}\mid{\cdot}}\right\rangle}$ , and associated norm $\|\cdot\|$ .

Let $R\colon{\mathcal{H}}\to{\mathcal{H}}$ be an operator and let $\alpha\in[0,1]$ . Then $R$ is nonexpansive if it is $1$ -Lipschitzian, $\alpha$ -averaged if there exists a nonexpansive operator $Q\colon{\mathcal{H}}\to{\mathcal{H}}$ such that $R=(1-\alpha)\operatorname{Id}+\alpha Q$ , and firmly nonexpansive if it is $1/2$ -averaged. Let $A\colon{\mathcal{H}}\to 2^{{\mathcal{H}}}$ be a set-valued operator. We denote by $\text{\rm gra}\,A=\big{\{}{(x,u)\in{\mathcal{H}}\times{\mathcal{H}}}~{}\big{|}~{}{u\in Ax}\big{\}}$ the graph of $A$ and by $A^{-1}$ the inverse of $A$ , i.e., the operator with graph $\big{\{}{(u,x)\in{\mathcal{H}}\times{\mathcal{H}}}~{}\big{|}~{}{u\in Ax}\big{\}}$ . In addition, $A$ is monotone if

[TABLE]

and maximally monotone if there exists no monotone operator $B\colon{\mathcal{H}}\to 2^{{\mathcal{H}}}$ such that $\text{\rm gra}\,A\subset\text{\rm gra}\,B\neq\text{\rm gra}\,A$ . If $A$ is maximally monotone, then its resolvent $J_{A}=(\operatorname{Id}+A)^{-1}$ is firmly nonexpansive. We denote by $\Gamma_{0}({\mathcal{H}})$ the class of proper lower semicontinuous convex functions from ${\mathcal{H}}$ to $\left]-\infty,+\infty\right]$ . Let $f\in\Gamma_{0}({\mathcal{H}})$ . The conjugate of $f$ is

[TABLE]

and the subdifferential of $f$ is the maximally monotone operator

[TABLE]

For every $x\in{\mathcal{H}}$ , the unique minimizer of $f+\|x-\cdot\|^{2}/2$ is denoted by $\text{\rm prox}_{f}x$ . We have $\text{\rm prox}_{f}=J_{\partial f}$ and $\text{\rm prox}_{f}$ is therefore firmly nonexpansive.

Let $C$ be a nonempty convex subset of ${\mathcal{H}}$ . Then $\iota_{C}$ is the indicator function of $C$ (it takes values [math] on $C$ and ${+\infty}$ on its complement) and $d_{C}\colon x\mapsto\min_{y\in C}\|x-y\|$ is its distance function. If $C$ is closed, its projection operator is $\text{\rm proj}_{C}=\text{\rm prox}_{\iota_{C}}$ .

3.2 Activators as averaged operators

We show via various illustrations that the assumption made in Model 1.1 on the activation operators covers many existing instances of feed-forward neural networks. Let us start with some key properties.

Proposition 3.1

Let ${\mathcal{H}}$ be a real Hilbert space, let $\alpha\in\left[0,1\right]$ , and let $R\colon{\mathcal{H}}\to{\mathcal{H}}$ be $\alpha$ -averaged. Then the following hold:

(i)

There exist a maximally monotone operator $A\colon{\mathcal{H}}\to 2^{{\mathcal{H}}}$ and a constant $\lambda\in\left[0,2\right]$ such that $R=\operatorname{Id}+\lambda(J_{A}-\operatorname{Id})$ . Furthermore, if $\lambda\leqslant 1$ , then $R$ is firmly nonexpansive. 2. (ii)

Suppose that ${\mathcal{H}}=\mathbb{R}$ . Then there exist a function $\phi\in\Gamma_{0}(\mathbb{R})$ and a constant $\lambda\in\left[0,2\right]$ such that $R=\operatorname{Id}+\lambda(\text{\rm prox}_{\phi}-\operatorname{Id})$ . Furthermore, $R$ is increasing if $\lambda\leqslant 1$ and $R$ is odd if $\phi$ is even. 3. (iii)

Suppose that ${\mathcal{H}}=\mathbb{R}$ and that $R$ is increasing. Then there exists $\phi\in\Gamma_{0}(\mathbb{R})$ such that $R=\text{\rm prox}_{\phi}$ .

Next, we illustrate the pervasiveness of nonexpansive averaged activation operators in practice, starting with activation operators on the real line.

Example 3.2

Proposition 3.1 (ii) states that activation functions on the real line can be expressed in the generic form

[TABLE]

Here are a few explicit instantiations of this proximal representation.

(i)

If $\lambda=1$ , we obtain the class of proximal activation functions discussed in [24] and which was seen there to include standard instances such as the unimodal sigmoid activation function [24, Example 2.13], the saturated linear activation function [24, Example 2.5], the ReLU activation function [24, Example 2.6], the inverse square root unit activation function [24, Example 2.9], the hyperbolic tangent activation function [24, Example 2.12], and the Elliot activation function [24, Example 2.15]. Additional examples in this category are the following. Given $\beta\in\left]0,+\infty\right[$ , the capped ReLU activation function [36] is

[TABLE]

and, for $\beta\leqslant 1$ , the exponential linear unit (ELU) function [20] is

[TABLE]

It follows from [8, Cor. 24.5, Prop. 24.32, and Exa. 13.2(v)] that $R=\text{\rm prox}_{\phi}$ , where

[TABLE]

The softplus activation function [29] $R\colon x\mapsto\ln((1+e^{x})/2)$ is also a proximity operator since it is nonexpansive and increasing (see Proposition 3.1 (iii)). 2. (ii)

The Geman–McClure function [28]

[TABLE]

will be employed in Example 3.3. Set $\psi=|\cdot|-\arctan|\cdot|\in\Gamma_{0}(\mathbb{R})$ . Then $R$ is nonexpansive and $R=\mu\psi^{\prime}$ . The conjugate of $\mu\psi$ is 1-strongly convex and given by $\mu\psi^{*}(\cdot/\mu)$ , where

[TABLE]

It follows from [8, Cor. 24.5] that $R=\text{\rm prox}_{\phi}$ with (see Fig. 2)

[TABLE] 3. (iii)

Take $\phi=\iota_{\left[0,+\infty\right[}$ . Then we obtain the leaky ReLU activation function [38] for $0<\lambda<1$ , the ReLU activation function for $\lambda=1$ , and the absolute value activation function [17] for $\lambda=2$ . 4. (iv)

The use of nonmonotonic activation functions has been advocated in various papers. They turn out to be $\alpha$ -averaged (alternatively, in view of Proposition 3.1 (ii), they are of the form (3.4) with $\lambda\in\left]1,2\right]$ ). To compute the averagedness constant of a nonexpansive operator $R\colon\mathbb{R}\to\mathbb{R}$ , one can proceed as follows. According to (1.2), we must find the smallest $\alpha\in\left]0,1\right]$ such that $Q=\operatorname{Id}+\alpha^{-1}(R-\operatorname{Id})$ remains nonexpansive. This means that the supremum of the modulus of the one-sided derivatives (the derivatives if they exist) over $\mathbb{R}$ should be one. Thus, we obtain $\alpha=1$ for the sine activation function $R=\sin$ [42], as well as for the absolute value function $R=|\cdot|$ [17] and the mirrored ReLU activation function [58]

[TABLE]

$\alpha\approx 0.546$ for the swish activation function [45]

[TABLE]

$\alpha\approx 0.536$ for the exponential linear squashing (ELiSH) function [7]

[TABLE]

and $\alpha=(1+\sqrt{2/e})/2$ for the Gaussian activation function $R\colon x\mapsto\exp(-x^{2})$ [40].

Next, is a technique for lifting a proximal activation operator from $\mathbb{R}$ to a Hilbert space ${\mathcal{H}}$ .

Example 3.3

Let ${\mathcal{H}}$ be a real Hilbert space, let $\lambda\in[0,2]$ , let $C$ be a nonempty closed convex subset of ${\mathcal{H}}$ , let $\phi\in\Gamma_{0}(\mathbb{R})$ be an even function such that $\phi^{*}$ is differentiable on $\mathbb{R}\smallsetminus\{0\}$ with [math] as its unique minimizer. Set

[TABLE]

Then $R$ is $\lambda/2$ -averaged. In particular, set $\lambda=1$ , $C=\{0\}$ , $\mu=8/(3\sqrt{3})$ and define $\phi$ as in (3.10). Then we infer that the squashing function

[TABLE]

used in capsule networks [48] is a proximal activation operator.

Another construction that builds on activation functions on the real line is the following, which is reminiscent of the original multilayer perceptrons [46].

Example 3.4

Let ${\mathcal{H}}$ be a separable real Hilbert space, let ${\varnothing}\neq\mathbb{K}\subset\mathbb{N}$ , let $(e_{k})_{k\in\mathbb{K}}$ be an orthonormal basis of ${\mathcal{H}}$ , and let $\alpha\in[0,1]$ . For every $k\in\mathbb{K}$ , let $\varrho_{k}\colon\mathbb{R}\to\mathbb{R}$ be $\alpha$ -averaged and such that $\varrho_{k}(0)=0$ . Define $R\colon{\mathcal{H}}\to{\mathcal{H}}\colon x\mapsto\sum_{k\in\mathbb{K}}\varrho_{k}({\left\langle{{x}\mid{e_{k}}}\right\rangle})e_{k}$ . Then $R$ is $\alpha$ -averaged.

Example 3.5

Let $N$ be a strictly positive integer, let $\omega\in\left[0,1\right]$ , and let $C$ be a nonempty closed convex subset of $\mathbb{R}^{N}$ . Set

[TABLE]

where $(\xi_{k}^{\uparrow})_{1\leqslant k\leqslant N}$ denotes the vector obtained by sorting the components of $(\xi_{k})_{1\leqslant k\leqslant N}$ in ascending order. Then $R$ is $(1+\omega)/2$ -averaged.

Remark 3.6

Set $C=\big{\{}{(\xi_{k})_{1\leqslant k\leqslant N}\in\mathbb{R}^{N}}~{}\big{|}~{}{\xi_{1}=\cdots=\xi_{N}}\big{\}}$ in Example 3.5. Then

[TABLE]

Now set $W\colon\mathbb{R}^{N}\to\mathbb{R}\colon(\xi_{k})_{1\leqslant k\leqslant N}\mapsto\xi_{N}$ . Then $W\circ R$ corresponds to the max-average pooling performed on a block of size $N$ [37]. When $\omega=0$ , the standard average-pooling operation is obtained, which is associated with the activation operator $\text{\rm proj}_{C}$ . When $\omega=1$ , we recover the standard max-pooling operation [14], which is the main building block of maxout layers [31]. The max-pooling operator is nonexpansive.

Example 3.7

Let $2\leqslant N\in\mathbb{N}$ , let $\{\tau_{k}\}_{1\leqslant k\leqslant N-1}\subset\left]-1,1\right[$ , and let $\theta\in\mathbb{R}$ . Set

[TABLE]

where $U\in\mathbb{R}^{(N-1)\times N}$ is the matrix obtained by retaining the first $(N-1)$ rows of the identity matrix of size $N\times N$ , and $S\colon\mathbb{R}^{N}\to\mathbb{R}^{N}\colon(\xi_{k})_{1\leqslant k\leqslant N}\mapsto(\xi_{k}^{\uparrow})_{1\leqslant k\leqslant N}$ . Then $R$ is $(1+\max\{|\tau_{1}|,\ldots,|\tau_{N-1}|\})/2$ -averaged.

Remark 3.8

Let $N\geqslant 3$ be an odd integer, let $(\tau_{k})_{1\leqslant k\leqslant N-1}\in\left]-1,1\right[^{N-1}$ , let $\theta\in\mathbb{R}$ , let $R$ be the activation operator defined in Example 3.7, and set $W\colon\mathbb{R}^{N-1}\to\mathbb{R}\colon(\xi_{k})_{1\leqslant k\leqslant N-1}\mapsto\xi_{\frac{N+1}{2}}$ . Then, for every $x=(\xi_{k})_{1\leqslant k\leqslant N-1}\in\mathbb{R}^{N-1}$ , $(W\circ R)x=\text{\rm median}\{\tau_{1}\xi_{1},\ldots,\tau_{N-1}\xi_{N-1},\theta\}$ . This corresponds to the median neuron model introduced in [2].

Remark 3.9

Multi-component averaged activation operators can be derived from theabove examples. Indeed, let $({\mathcal{H}}_{i})_{1\leqslant i\leqslant M}$ be real Hilbert spaces and let ${\mathcal{H}}={\mathcal{H}}_{1}\oplus\cdots\oplus{\mathcal{H}}_{M}$ be their Hilbert direct sum. For every $i\in\{1,\ldots,M\}$ , let $\alpha_{i}\in[0,1]$ and let $R_{i}\colon{\mathcal{H}}_{i}\to{\mathcal{H}}_{i}$ be $\alpha_{i}$ -averaged. Then $R\colon{\mathcal{H}}\to{\mathcal{H}}\colon(x_{i})_{1\leqslant i\leqslant M}\mapsto(R_{i}x_{i})_{1\leqslant i\leqslant M}$ is $\alpha$ -averaged with $\alpha=\max_{1\leqslant i\leqslant M}\alpha_{i}$ .

4 Lipschitz constants for layered networks

The objective of this section is to derive Lipschitz constants for networks conforming to Model 1.1. Note that, if $m=1$ , a Lipschitz constant of $T$ is clearly $\theta_{1}=\|W_{1}\|$ since $R_{1}$ is nonexpansive. We shall therefore focus henceforth on the case $m\geqslant 2$ . Throughout, the following notation is employed.

Notation 4.1

Let $2\leqslant m\in\mathbb{N}$ and $k\in\{1,\ldots,m-1\}$ . Then

[TABLE]

and, for every $(j_{1},\ldots,j_{k})\in\mathbb{J}_{m,k}$ ,

[TABLE]

Theorem 4.2

Consider the setting of Model 1.1 with $m\geqslant 2$ . Set

[TABLE]

and

[TABLE]

Then $\theta_{m}$ is a Lipschitz constant of $T$ .

The following proposition features some important special cases.

Proposition 4.3

Consider the setting of Model 1.1 with $m\geqslant 2$ , and let $\theta_{m}$ be defined as in (4.4). Then the following hold:

(i)

$\|W_{m}\circ\cdots\circ W_{1}\|\leqslant\theta_{m}\leqslant\prod_{i=1}^{m}\|W_{i}\|$ . 2. (ii)

Suppose that, for every $i\in\{1,\ldots,m-1\}$ , $R_{i}=\operatorname{Id}$ . Then $\theta_{m}=\|W_{m}\circ\cdots\circ W_{1}\|$ . 3. (iii)

Suppose that, for every $i\in\{1,\ldots,m-1\}$ , $R_{i}$ is purely nonexpansive in the sense that $\alpha_{i}=1$ is its smallest averaging constant. Then $\theta_{m}=\prod_{i=1}^{m}\|W_{i}\|$ . 4. (iv)

Suppose that, for every $i\in\{1,\ldots,m-1\}$ , $R_{i}$ is firmly nonexpansive. Then

[TABLE] 5. (v)

Set $\alpha_{0}=\theta_{0}=1$ . Then

[TABLE]

Remark 4.4

Proposition 4.3 (i)–4.3 (iii) show that the tightest bound in terms of stability corresponds to a linear network, while the loosest corresponds to a network with nonlinearities having no stronger property than nonexpansiveness.

We close this section by observing that the Lipschitz constant exhibited in Theorem 4.2 is a componentwise increasing function of the averagedness constants of the activation operators.

Proposition 4.5

Consider the setting of Model 1.1 with $m\geqslant 2$ . Make the Lipschitz constant $\theta_{m}$ in Theorem 4.2 a function of $(\alpha_{1},\ldots,\alpha_{m-1})\in[0,1]^{m-1}$ . Let $(\alpha_{i})_{1\leqslant i\leqslant m-1}\in[0,1]^{m-1}$ and $(\alpha^{\prime}_{i})_{1\leqslant i\leqslant m-1}\in[0,1]^{m-1}$ be such that $(\forall i\in\{1,\ldots,m-1\})$ $\alpha_{i}\leqslant\alpha^{\prime}_{i}$ . Then $\theta_{m}(\alpha_{1},\ldots,\alpha_{m-1})\leqslant\theta_{m}(\alpha^{\prime}_{1},\ldots,\alpha^{\prime}_{m-1})$ .

Remark 4.6

Proposition 4.5 suggests that, in terms of stability, it is better to use proximal activation operators, such as those listed in Example 3.2 (i)–(ii), than $\alpha$ -averaged activation operators for which $\alpha>1/2$ , such as those mentioned in Example 3.2 (iv).

5 Networks using separable activation operators

We show that sharper Lipschitz constants can be derived in the case of networks featuring the type of separable structure described in Example 3.4. Note that this class of networks is the most commonly used, standard convnets being special cases. The following notation will be used.

Notation 5.1

Let ${\mathcal{H}}$ be a separable real Hilbert space, let ${\varnothing}\neq\mathbb{K}\subset\mathbb{N}$ , let $E=(e_{k})_{k\in\mathbb{K}}$ be an orthonormal basis of ${\mathcal{H}}$ , and let $I$ be a nonempty bounded subset of $\mathbb{R}$ . Then

[TABLE]

5.1 General results

Theorem 5.2

Consider the setting of Model 1.1 with $m\geqslant 2$ . For every $i\in\{1,\ldots,m-1\}$ , suppose that ${\mathcal{H}}_{i}$ is separable, let ${\varnothing}\neq\mathbb{K}_{i}\subset\mathbb{N}$ , let ${E_{i}}=(e_{i,k})_{k\in\mathbb{K}_{i}}$ be an orthonormal basis of ${\mathcal{H}}_{i}$ , and, for every $k\in\mathbb{K}_{i}$ , let $\varrho_{i,k}\colon\mathbb{R}\to\mathbb{R}$ be $\alpha_{i}$ -averaged and such that $\varrho_{i,k}(0)=0$ . Assume that

[TABLE]

and define

[TABLE]

Then the following hold:

(i)

$\vartheta_{m}$ * is a Lipschitz constant of the operator $T$ of (1.3).* 2. (ii)

Define $\theta_{m}$ as in (4.4). Then $\|W_{m}\circ\cdots\circ W_{1}\|\leqslant\vartheta_{m}\leqslant\theta_{m}$ .

Remark 5.3

An expression similar to (5.3) was proposed empirically in [49] for a multilayer perceptron operating on finite-dimensional spaces under the additional assumption that the activation operators are continuously differentiable and firmly nonexpansive.

Remark 5.4

In Theorem 5.2, make the additional assumption that, for some $i\in\{1,\ldots,m-1\}$ , the functions $(\varrho_{i,k})_{k\in\mathbb{K}_{i}}$ are increasing. Then it follows from Proposition 3.1 (iii) that there exist functions $(\phi_{i,k})_{k\in\mathbb{K}_{i}}$ in $\Gamma_{0}(\mathbb{R})$ such that $(\forall k\in\mathbb{K}_{i})$ $\varrho_{i,k}=\text{\rm prox}_{\phi_{i,k}}$ . In addition, for every $k\in\mathbb{K}_{i}$ , since $\varrho_{i,k}(0)=0$ and since the set of minimizers of $\phi_{i,k}$ coincides with the set of fixed points of $\text{\rm prox}_{\phi_{i,k}}$ [8, Proposition 12.29], we deduce that $\phi_{i,k}$ is minimized at [math]. Furthermore, $\alpha_{i}=1/2$ and $R_{i}=\text{\rm prox}_{\varphi_{i}}$ , where $(\forall x\in{\mathcal{H}}_{i})$ $\varphi_{i}(x)=\sum_{k\in\mathbb{K}_{i}}\phi_{i,k}({\left\langle{{x}\mid{e_{i,k}}}\right\rangle})$ . Such a construction is used in [23, 25].

As in Proposition 4.5, the Lipschitz constant exhibited in Theorem 5.2 turns out to be a componentwise increasing function of the averagedness constants of the activation operators.

Proposition 5.5

Consider the setting of Model 1.1 with $m\geqslant 2$ . For every $i\in\{1,\ldots,m-1\}$ , suppose that ${\mathcal{H}}_{i}$ is separable, let ${\varnothing}\neq\mathbb{K}_{i}\subset\mathbb{N}$ , and let ${E_{i}}=(e_{i,k})_{k\in\mathbb{K}_{i}}$ be an orthonormal basis of ${\mathcal{H}}_{i}$ . Define $\vartheta_{m}\colon[0,1]^{m-1}\to\left[0,+\infty\right[$ by

[TABLE]

Let $(\alpha_{i})_{1\leqslant i\leqslant m-1}\in[0,1]^{m-1}$ and $(\alpha^{\prime}_{i})_{1\leqslant i\leqslant m-1}\in[0,1]^{m-1}$ be such that $(\forall i\in\{1,\ldots,m-1\})$ $\alpha_{i}\leqslant\alpha^{\prime}_{i}$ . Then $\vartheta_{m}(\alpha_{1},\ldots,\alpha_{m-1})\leqslant\vartheta_{m}(\alpha^{\prime}_{1},\ldots,\alpha^{\prime}_{m-1})$ .

5.2 Extension to non-Hilbertian norms

In certain applications, Hilbertian norms may not be the most relevant measures to quantify errors. We now state a variant of Theorem 5.2 which holds for alternative norms. It involves embeddings of Hilbert spaces; standard examples can be found in [57]. Let us also point out that these embedding conditions are automatically satisfied if the spaces are finite-dimensional.

Proposition 5.6

Consider the setting of Model 1.1 with $m\geqslant 2$ . For every $i\in\{1,\ldots,m\}$ , suppose that ${\mathcal{H}}_{i}$ is separable, let ${\varnothing}\neq\mathbb{K}_{i}\subset\mathbb{N}$ , let ${E_{i}}=(e_{i,k})_{k\in\mathbb{K}_{i}}$ be an orthonormal basis of ${\mathcal{H}}_{i}$ , and, for every $k\in\mathbb{K}_{i}$ , let $\varrho_{i,k}\colon\mathbb{R}\to\mathbb{R}$ be $\alpha_{i}$ -averaged and such that $\varrho_{i,k}(0)=0$ . Let ${\mathcal{G}}_{0}$ be the normed space obtained by equipping the vector space underlying ${\mathcal{H}}_{0}$ with a norm for which ${\mathcal{G}}_{0}$ is continuously embedded in ${\mathcal{H}}_{0}$ , and let ${\mathcal{G}}_{m}$ be the normed space obtained by equipping the vector space underlying ${\mathcal{H}}_{m}$ with a norm for which ${\mathcal{H}}_{m}$ is continuously embedded in ${\mathcal{G}}_{m}$ . Assume that

[TABLE]

Then

[TABLE]

is a Lipschitz constant of $T\colon{\mathcal{G}}_{0}\to{\mathcal{G}}_{m}$ .

Corollary 5.7

Consider the setting of Model 1.1 with $m\geqslant 2$ . Define ${\mathcal{G}}_{0}$ and $(R_{i})_{1\leqslant i\leqslant m}$ as in Proposition 5.6, let $p\in\left[1,{+\infty}\right]$ , and let $\{\omega_{k}\}_{k\in\mathbb{K}_{m}}\subset\left]0,+\infty\right[$ be such that one of the following holds:

(i)

$p\in\left[1,2\right[$ * and $\sum_{k\in\mathbb{K}_{m}}\omega_{k}^{2/(2-p)}<{+\infty}$ .* 2. (ii)

$p\in\left[2,{+\infty}\right]$ * and $\sup_{k\in\mathbb{K}_{m}}\omega_{k}<{+\infty}$ .*

Let ${\mathcal{G}}_{m}$ be the normed space obtained by equipping the vector space underlying ${\mathcal{H}}_{m}$ with the norm

[TABLE]

Then a Lipschitz constant of $T\colon{\mathcal{G}}_{0}\to{\mathcal{G}}_{m}$ is

[TABLE]

5.3 Networks with positive weights

Under certain positivity assumptions, the constant $\vartheta_{m}$ of (5.3) and (5.8) can be simplified.

Assumption 5.8

Consider the setting of Model 1.1 with $m\geqslant 2$ . For every $i\in\{0,\ldots,m\}$ , suppose that ${\mathcal{H}}_{i}$ is separable, let ${\varnothing}\neq\mathbb{K}_{i}\subset\mathbb{N}$ , and let ${E_{i}}=(e_{i,k})_{k\in\mathbb{K}_{i}}$ be an orthonormal basis of ${\mathcal{H}}_{i}$ . For every $(k_{0},\ldots,k_{m})\in\mathbb{K}_{0}\times\cdots\times\mathbb{K}_{m}$ , set

[TABLE]

We suppose that

[TABLE]

Example 5.9

Consider the particular case of Model 1.1 in which, for every $i\in\{0,\ldots,m\}$ , $N_{i}\in\mathbb{N}\smallsetminus\{0\}$ , ${\mathcal{H}}_{i}=\mathbb{R}^{N_{i}}$ , $E_{i}$ is the canonical basis of $\mathbb{R}^{N_{i}}$ and, for every $k\in\{1,\ldots,N_{i}\}$ , $\chi_{i,k}\in\{-1,1\}$ with the additional condition that, for every $l\in\{1,\ldots,N_{0}\}$ , $\chi_{0,k}=\chi_{0,l}$ . Further, for every $i\in\{1,\ldots,m\}$ , the matrix $W_{i}=[w_{i,k,l}]_{1\leqslant k\leqslant N_{i},1\leqslant l\leqslant N_{i-1}}\in\mathbb{R}^{N_{i}\times N_{i-1}}$ satisfies

[TABLE]

Then Assumption 5.8 holds. This is true in particular if, for every $i\in\{1,\ldots,m\}$ , $\{w_{i,k,l}\}_{1\leqslant k\leqslant N_{i},1\leqslant l\leqslant N_{i-1}}\subset\left[0,+\infty\right[$ , which corresponds to positively weighted networks. See [19] for the design of such networks.

In the following result, a Lipschitz constant of the network (1.3) coincides with that of the linear network $W_{m}\circ\cdots\circ W_{1}$ for standard choices of norms.

Proposition 5.10

Suppose that the assumptions of Corollary 5.7 are satisfied, that

[TABLE]

and that Assumption 5.8 holds. Then the Lipschitz constant $\vartheta_{m}$ of $T\colon{\mathcal{G}}_{0}\to{\mathcal{G}}_{m}$ in (5.8) reduces to $\vartheta_{m}=\|W_{m}\circ\cdots\circ W_{1}\|_{{\mathcal{G}}_{0},{\mathcal{G}}_{m}}$ .

We show below that the Lipschitz constant of a positively weighted network associated with weight operators $(W_{i})_{1\leqslant i\leqslant m}$ and nonseparable activation operators is not necessarily $\|W_{m}\circ\cdots\circ W_{1}\|$ .

Example 5.11

Consider the toy version of Model 1.1 in which $m=2$ , ${\mathcal{H}}_{0}={\mathcal{H}}_{1}={\mathcal{H}}_{2}=\mathbb{R}^{2}$ . Set $\varphi\colon x=(\xi_{1},\xi_{2})\mapsto\phi(\xi_{1})+\phi(\xi_{2})$ , where

[TABLE]

Let $\xi\in\left]-1,1\right[=\text{\rm dom}\,\phi^{\prime}=\text{\rm dom}\,(\operatorname{Id}+\phi^{\prime})=\text{\rm ran}\,\text{\rm prox}_{\phi}$ . Then $\xi+\phi^{\prime}(\xi)=\text{arctanh}(\xi)$ and therefore $\varrho=(\operatorname{Id}+\phi^{\prime})^{-1}=\text{tanh}$ . Consequently, we derive from [24, Example 2.13] that $(\forall x=(\xi_{1},\xi_{2})\in\mathbb{R}^{2})$ $\text{\rm prox}_{\varphi}x=\big{(}\text{tanh}(\xi_{1}),\text{tanh}(\xi_{2})\big{)}$ . Now set

[TABLE]

$R_{1}=\text{\rm prox}_{\varphi\circ U}=U\circ\text{\rm prox}_{\varphi}\circ U$ [25, Lemma 2.8], and $R_{2}=\operatorname{Id}$ . Then $\|W_{2}W_{1}\|\approx 54.72$ . If the input $x=(-3.4,2)$ is perturbed by $z=10^{-4}\times(1,\sqrt{3})$ , we get ${\|T(x+z)-Tx\|}/{\|z\|}\approx 58.18$ , which shows that, although $W_{1}$ and $W_{2}$ have strictly positive entries, the Lipschitz constant is larger than $\|W_{2}W_{1}\|$ . Note that, in this scenario, the constant of (4.4) is

[TABLE]

A sharper Lipschitz constant can be obtained by noticing that this network is equivalent to a network in which $W_{1}$ , $W_{2}$ , and $R_{1}$ are replaced by $W_{1}^{\prime}=UW_{1}$ , $W_{2}^{\prime}=W_{2}U$ , and $R_{1}^{\prime}=\text{\rm prox}_{\varphi}$ . Since $R_{1}^{\prime}$ is separable, the constant of (5.4) is $\vartheta_{2}\approx 59.54$ . In contrast, the naive bound of (1.4) is about $66.29$ .

For separable activators in finite-dimensional spaces, we have the following result, which does not require Assumption 5.8.

Proposition 5.12

Consider the setting of Model 1.1 with $m\geqslant 2$ . Suppose that the assumptions of Corollary 5.7 hold and that $\|\cdot\|_{{\mathcal{G}}_{0}}$ satisfies (5.12). In addition, assume that, for every $i\in\{0,\ldots,m\}$ , ${\mathcal{H}}_{i}=\mathbb{R}^{N_{i}}$ and $E_{i}$ is the canonical basis of $\mathbb{R}^{N_{i}}$ . For every $i\in\{1,\ldots,m\}$ , let $A_{i}$ denote the matrix obtained by taking the absolute values of the entries of the matrix $W_{i}$ . Then the Lipschitz constant $\vartheta_{m}$ of $T\colon{\mathcal{G}}_{0}\to{\mathcal{G}}_{m}$ in (5.8) satisfies $\vartheta_{m}\leqslant\|A_{m}\cdots A_{1}\|_{{\mathcal{G}}_{0},{\mathcal{G}}_{m}}$ .

6 Conclusion

Using advanced tools from nonlinear analysis, we have derived sharp Lipschitz constants for layered network structures involving compositions of nonexpansive averaged operators and affine operators. This framework has been shown to model feed-forward neural networks having a chain graph structure. Extending these results to networks having a more general dyadic acyclic graph (DAG) structure would be of interest. Among the many avenues of future research that this work suggests, it would be interesting to exploit it to devise training strategies to achieve better robustness. The proposed nonexpansive operator machinery could also be used to design network architectures with smaller Lipschitz constants. Finally, computing local Lipschitz constants could be of interest in practice and constitutes an important topic of future research.

Appendix A Technical lemmas

Lemma A.1

[23, Proposition 2.4]*

Let $R$ be a function defined from $\mathbb{R}$ to $\mathbb{R}$ . Then $R$ is the proximity operator of a function in $\Gamma_{0}(\mathbb{R})$ if and only if it is nonexpansive and increasing.*

Lemma A.2

Let $q\in\mathbb{N}\smallsetminus\{0\}$ and, for every $i\in\{1,\ldots,q\}$ , let $S_{i}$ be a nonempty subset of a real vector space $\mathcal{X}_{i}$ . Let $\psi\colon\mathcal{X}_{1}\times\cdots\times\mathcal{X}_{q}\to\mathbb{R}$ be a function which is convex with respect to each of its $q$ coordinates. Set $\boldsymbol{S}=S_{1}\times\cdots\times S_{q}$ and let $\text{\rm conv}\,\boldsymbol{S}$ be its convex envelope. Then $\sup\psi(\boldsymbol{S})=\sup\psi(\text{\rm conv}\,\boldsymbol{S})$ .

Proof. Set $\mu=\sup\psi(\boldsymbol{S})$ . Clearly, $\mu\leqslant\sup\psi(\text{\rm conv}\,\boldsymbol{S})$ . Now take $\boldsymbol{x}\in\text{\rm conv}\,\boldsymbol{S}$ . Then $\boldsymbol{x}=\sum_{j\in I}\alpha_{j}\boldsymbol{x}_{j}$ , where $(\alpha_{j})_{j\in I}$ is a finite family in $\left]0,1\right]$ such that $\sum_{j\in I}\alpha_{j}=1$ and, for every $j\in I$ , $\boldsymbol{x}_{j}=(x_{j,i})_{1\leqslant i\leqslant q}$ , with $(\forall i\in\{1,\ldots,q\})$ $x_{j,i}\in S_{i}$ . Note that $(\forall(j_{1},\ldots,j_{q})\in I^{q})$ $(x_{j_{1},1},\ldots,x_{j_{q},q})\in\boldsymbol{S}$ . Therefore,

[TABLE]

Hence, $\sup\psi(\text{\rm conv}\,\boldsymbol{S})=\sup_{\boldsymbol{x}\in\text{\rm conv}\,\boldsymbol{S}}\psi(\boldsymbol{x})\leqslant\mu$ .

Lemma A.3

Let ${\mathcal{H}}$ be a separable real Hilbert space, let ${\varnothing}\neq\mathbb{K}\subset\mathbb{N}$ , let $E=(e_{k})_{k\in\mathbb{K}}$ be an orthonormal basis of ${\mathcal{H}}$ , and let $\alpha\in[0,1]$ . For every $k\in\mathbb{K}$ , let $\varrho_{k}\colon\mathbb{R}\to\mathbb{R}$ be $\alpha$ -averaged and such that $\varrho_{k}(0)=0$ . Define $R\colon{\mathcal{H}}\to{\mathcal{H}}\colon x\mapsto\sum_{k\in\mathbb{K}}\varrho_{k}({\left\langle{{x}\mid{e_{k}}}\right\rangle})e_{k}$ , and fix $x$ and $y$ in ${\mathcal{H}}$ . Then there exists $\Lambda\in\mathscr{D}_{[1-2\alpha,1]}({E})$ such that $Rx-Ry=\Lambda(x-y)$ .

Proof. We saw in Example 3.4 that $R$ is well defined. We have

[TABLE]

For every $k\in\mathbb{K}$ , there exists a nonexpansive $\theta_{k}\colon\mathbb{R}\to\mathbb{R}$ such that $\varrho_{k}=(1-\alpha)\operatorname{Id}+\alpha\theta_{k}$ and, therefore,

[TABLE]

Consequently, for every $k\in\mathbb{K}$ , there exists $\lambda_{k}\in[1-2\alpha,1]$ such that

[TABLE]

We deduce from (A.2) that $Rx-Ry=\sum_{k\in\mathbb{K}}\lambda_{k}({\left\langle{{x}\mid{e_{k}}}\right\rangle}-{\left\langle{{y}\mid{e_{k}}}\right\rangle})e_{k}$ , as claimed.

Appendix B Proofs of main results

B.1 Proof of Proposition 3.1

(i): As seen in (1.2), there exists a nonexpansive operator $Q\colon{\mathcal{H}}\to{\mathcal{H}}$ such that $R=(1-\alpha)\operatorname{Id}+\alpha Q$ . However, by [8, Prop. 4.4 and Cor. 23.9], there exists a maximally monotone operator $A\colon{\mathcal{H}}\to 2^{{\mathcal{H}}}$ such that $Q=2J_{A}-\operatorname{Id}$ . Hence, $R=\operatorname{Id}+\lambda(J_{A}-\operatorname{Id})$ , where $\lambda=2\alpha\in\left[0,2\right]$ . For the last claim, notice that, since $J_{A}$ is firmly nonexpansive [8, Cor. 23.9], so is $R=(1-\lambda)\operatorname{Id}+\lambda J_{A}$ as a convex combination of two firmly nonexpansive operators [8, Exa. 4.7]. (ii) $\Rightarrow$ (i): It follows from [8, Cor. 22.23] that there exists $\phi\in\Gamma_{0}(\mathbb{R})$ such that $A=\partial\phi$ , which provides the expression for $R$ . The increasingness claim follows from Lemma A.1. Finally, if $\phi$ is even, then $\text{\rm prox}_{\phi}$ is odd [8, Prop. 24.10] and so is $R$ . (iii): This follows from Lemma A.1.

B.2 Proof of Example 3.3

Let $\sigma_{C}$ be the support function of $C$ and set $f=\sigma_{C}+\phi\circ\|\cdot\|\in\Gamma_{0}({\mathcal{H}})$ . Then it follows from [8, Prop. 24.30] and (3.14) that $R=\operatorname{Id}+\lambda(\text{\rm prox}_{f}-\operatorname{Id})$ , However, since $\text{\rm prox}_{f}$ is firmly nonexpansive, it is $1/2$ -averaged, which makes $R$ a $\lambda/2$ -averaged operator. Now consider the function $\phi$ of (3.10). Then it is an even function in $\Gamma_{0}(\mathbb{R})$ with [math] as its unique minimizer. Next, set $\psi=|\cdot|-\arctan|\cdot|$ . As seen in Example 3.2 (ii), $\phi=\mu\psi^{*}(\cdot/\mu)-|\cdot|^{2}/2$ and $\text{\rm dom}\,\psi^{*}$ is bounded. Therefore $\text{\rm dom}\,\phi=\mu\text{\rm dom}\,\psi^{*}$ is bounded. In turn, $\phi$ is supercoercive and we derive from [8, Prop. 14.15] that $\text{\rm dom}\,\phi^{*}=\mathbb{R}$ . Hence, since $\phi=\phi^{**}$ is strictly convex, it follows derive from [8, Prop. 18.9] that $\phi^{*}$ is differentiable on $\mathbb{R}$ . In addition, $d_{C}=\|\cdot\|$ . Altogether, (3.14) reduces to

[TABLE]

and hence, in view of Example 3.2 (ii), to (3.15).

B.3 Proof of Example 3.4

Let $x\in{\mathcal{H}}$ and $y\in{\mathcal{H}}$ . It follows from the nonexpansiveness of the functions $(\varrho_{k})_{k\in\mathbb{K}}$ that

[TABLE]

Hence, $R$ is well defined. For every $k\in\mathbb{K}$ , by (1.2) there exists a nonexpansive function $\theta_{k}\colon\mathbb{R}\to\mathbb{R}$ such that $\varrho_{k}=(1-\alpha)\operatorname{Id}+\alpha\theta_{k}$ . Hence, $Rx=(1-\alpha)x+\alpha Qx$ , where $Qx=\sum_{k\in\mathbb{K}}\theta_{k}({\left\langle{{x}\mid{e_{k}}}\right\rangle})e_{k}$ . Therefore,

[TABLE]

This shows that $Q$ is nonexpansive and hence that $R$ is $\alpha$ -averaged.

B.4 Proof of Example 3.5

Let $S$ be the sorting operator of Example 3.7. Then

[TABLE]

where (B.4) follows from [32, Thm. 368]. This shows that $S$ is nonexpansive. Furthermore, $Q=2\text{\rm proj}_{C}-\operatorname{Id}$ is nonexpansive [8, Cor. 4.18]. Note that

[TABLE]

Since $((1-\omega)Q+2\omega S)/(1+\omega)$ is nonexpansive as a convex combination of nonexpansive operators, the operator $(1-\omega)\text{\rm proj}_{C}+\omega S$ is $(1+\omega)/2$ -averaged.

B.5 Proof of Example 3.7

Set $A=\operatorname{Diag}(\tau_{1},\ldots,\tau_{N-1})$ . Let $x$ and $y$ be in $\mathbb{R}^{N-1}$ , and define $\widetilde{x}=[(Ax)^{\top},\theta]^{\top}$ and $\widetilde{y}=[(Ay)^{\top},\theta]^{\top}$ . As seen in (B.5), $S$ is nonexpansive. Consequently,

[TABLE]

This shows that $R$ is Lipschitzian with constant $\max\{|\tau_{1}|,\ldots,|\tau_{N-1}|\}<1$ . It is thus $\alpha$ -averaged with $\alpha=(1+\max\{|\tau_{1}|,\ldots,|\tau_{N-1}|\})/2$ [8, Prop. 4.38].

B.6 Proof of Theorem 4.2

For every $i\in\{1,\ldots,m\}$ , $P_{i}=R_{i}(\cdot+b_{i})$ is $\alpha_{i}$ -averaged and, therefore, there exists a nonexpansive operator $Q_{i}\colon{\mathcal{H}}_{i}\to{\mathcal{H}}_{i}$ such that $P_{i}=(1-\alpha_{i})\operatorname{Id}+\alpha_{i}Q_{i}$ . Since $T=P_{m}\circ W_{m}\circ\cdots\circ P_{1}\circ W_{1}$ and $P_{m}$ is nonexpansive, it suffices to show that

[TABLE]

Let us prove this result by induction. Let $x\in{\mathcal{H}}_{0}$ and $y\in{\mathcal{H}}_{0}$ . If $m=2$ , we derive from the nonexpansiveness of $Q_{1}$ that

[TABLE]

Hence, $T$ is Lipschitzian with constant

[TABLE]

Now assume that $m>2$ and that (B.8) holds at order $m-1$ . Then

[TABLE]

Hence, the nonexpansiveness of $Q_{m-1}$ yields

[TABLE]

On the other hand, the induction hypothesis yields

[TABLE]

Similarly, replacing $W_{m-1}$ by $W_{m}\circ W_{m-1}$ above, we get

[TABLE]

Using (LABEL:e:21), and then inserting (LABEL:e:23) and (LABEL:e:22), we obtain

[TABLE]

Furthermore, we deduce from (4.3) that

[TABLE]

Therefore

[TABLE]

which implies that, if $m-1\not\in\{j_{1},\ldots,j_{k}\}$ , then $\beta_{m;\{j_{1},\ldots,j_{k},m-1\}}=\alpha_{m-1}\beta_{m-1;\{j_{1},\ldots,j_{k}\}}$ . Hence, (B.14) yields

[TABLE]

Thus, we obtain

[TABLE]

which establishes (B.8).

B.7 Proof of Proposition 4.3

Define $(\beta_{m;\mathbb{J}})_{\mathbb{J}\subset\{1,\ldots,m-1\}}$ as in (4.3). (i): For every $k\in\{1,\ldots,m-1\}$ and every $(j_{1},\ldots,j_{k})\in\mathbb{J}_{m,k}$ , (4.2) yields

[TABLE]

Consequently, it follows from (4.4) that

[TABLE]

In view of (4.3), $(\beta_{m;\mathbb{J}})_{\mathbb{J}\subset\{1,\ldots,m-1\}}$ is the discrete probability distribution of a vector of $m-1$ independent Bernoulli random variables. Hence, $\sum_{\mathbb{J}\subset\{1,\ldots,m-1\}}\beta_{m;\mathbb{J}}=1$ in (B.20). (ii): For every $i\in\{1,\ldots,m-1\}$ , $\alpha_{i}=0$ . Therefore, in view of (4.3),

[TABLE]

Hence, the result follows from (4.4). (iii): For every $i\in\{1,\ldots,m-1\}$ , $\alpha_{i}=1$ . Therefore, in view of (4.3),

[TABLE]

Invoking (4.4) allows us to conclude. (iv): For every $i\in\{1,\ldots,m-1\}$ $\alpha_{i}=1/2$ . Hence, (4.3) yields $(\forall\mathbb{J}\subset\{1,\ldots,m-1\})$ $\beta_{m;\mathbb{J}}=2^{1-m}$ . Invoking once again (4.4) yields the result. (v): It follows from (4.2) that

[TABLE]

We decompose this expression in a sum of terms depending on the value $i$ taken by $j_{k}$ , namely,

[TABLE]

In addition, for every $(j_{1},\ldots,j_{k-1})\in\mathbb{J}_{i,k-1}$ , we derive from (4.3) that

[TABLE]

Using the above equality in (B.24), factorizing common factors, and invoking (4.4) yields

[TABLE]

and we obtain (4.6).

B.8 Proof of Proposition 4.5

Let $l\in\{1,\ldots,m-1\}$ and set

[TABLE]

For every $k\in\{1,\ldots,m-1\}$ and every $(j_{1},\ldots,j_{k})\in\mathbb{J}_{m,k}$ , (4.2) yields

[TABLE]

We infer from (4.4) that

[TABLE]

In view of (B.28) we conclude that

[TABLE]

B.9 Proof of Theorem 5.2

(i): For every $i\in\{1,\ldots,m\}$ , set $P_{i}=R_{i}(\cdot+b_{i})$ and $(\forall k\in\mathbb{K}_{i})$ $\pi_{i,k}=$ $\varrho_{i,k}(\cdot+{\left\langle{{b_{i}}\mid{e_{i,k}}}\right\rangle})$ . Note that, for every $i\in\{1,\ldots,m\}$ and every $k\in\mathbb{K}_{i}$ , $\pi_{i,k}$ is $\alpha_{i}$ -averaged. Furthermore, $(\forall i\in\{1,\ldots,m-1\})(\forall x\in{\mathcal{H}}_{i})$ $P_{i}x=\sum_{k\in\mathbb{K}_{i}}\pi_{i,k}({\left\langle{{x}\mid{e_{i,k}}}\right\rangle})e_{i,k}$ . Now fix $x$ and $y$ in ${\mathcal{H}}_{0}$ . It follows from (1.3) and the nonexpansiveness of $P_{m}$ that

[TABLE]

In view of Lemma A.3, for every $i\in\{1,\ldots,m-1\}$ , there exists $\Lambda_{i}\in\mathscr{D}_{[1-2\alpha_{i},1]}(E_{i})$ such that

[TABLE]

Recursive application of this identity yields

[TABLE]

This implies that $\|Tx-Ty\|\leqslant\|W_{m}\circ\Lambda_{m-1}\circ\cdots\circ\Lambda_{1}\circ W_{1}\|\,\|x-y\|$ . Thus,

[TABLE]

is a Lipschitz constant of $T$ . Set $S=\{1-2\alpha_{1},1\}^{\mathbb{K}_{1}}\times\cdots\times\{1-2\alpha_{m-1},1\}^{\mathbb{K}_{m-1}}$ and $C=[1-2\alpha_{1},1]^{\mathbb{K}_{1}}\times\cdots\times[1-2\alpha_{m-1},1]^{\mathbb{K}_{m-1}}$ . For every $i\in\{1,\ldots,m-1\}$ , $\Lambda_{i}\colon{\mathcal{H}}_{i}\to{\mathcal{H}}_{i}$ is generated from a sequence $(\lambda_{i,k})_{k\in\mathbb{K}_{i}}$ in $[1-2\alpha_{i},1]$ via the construction of (5.1). The function

[TABLE]

is convex with respect to each of its coordinates. Hence, we deduce from Lemma A.2 that $\sup\psi(C)=\sup\psi(\text{\rm conv}\,S)=\sup\psi(S)$ , as claimed.

(ii): For every $i\in\{1,\ldots,m-1\}$ , the identity operator $\operatorname{Id}_{i}$ of ${\mathcal{H}}_{i}$ lies in $\mathscr{D}_{\{1-2\alpha_{i},1\}}(E_{i})$ . Hence, $\vartheta_{m}\geqslant\|W_{m}\circ\operatorname{Id}_{m-1}\circ\cdots\circ\operatorname{Id}_{1}\circ W_{1}\|=\|W_{m}\circ\cdots\circ W_{1}\|$ . For every $i\in\{1,\ldots,m-1\}$ , let $\Lambda_{i}\in\mathscr{D}_{\{1-2\alpha_{i},1\}}(E_{i})$ and note that the linear operator

[TABLE]

is nonexpansive. Using the same kind of decomposition as in the proof of Theorem 4.2 yields

[TABLE]

and allows us to conclude that $\vartheta_{m}\leqslant\theta_{m}$ .

B.10 Proof of Proposition 5.5

It follows from (B.34) that

[TABLE]

B.11 Proof of Proposition 5.6

Let us first note that, because of the embeddings, $W_{1}\colon{\mathcal{G}}_{0}\to{\mathcal{H}}_{1}$ is continuous and, likewise, every $\Lambda_{m}\in\mathscr{D}_{[1-2\alpha_{m},1]}(E_{m})$ is continuous from ${\mathcal{H}}_{m}$ to ${\mathcal{G}}_{m}$ . Hence, for every $(\Lambda_{i})_{1\leqslant i\leqslant m}\in\mathscr{D}_{[1-2\alpha_{1},1]}(E_{1})\times\cdots\times\mathscr{D}_{[1-2\alpha_{m},1]}(E_{m-1})$ , $\Lambda_{m}\circ W_{m}\circ\cdots\circ\Lambda_{1}\circ W_{1}\colon{\mathcal{G}}_{0}\to{\mathcal{G}}_{m}$ is continuous. We now follow the same argument as in the proof of Theorem 5.2. Let $x$ and $y$ be in ${\mathcal{G}}_{0}$ . For every $i\in\{1,\ldots,m\}$ , there exists $\Lambda_{i}\in\mathscr{D}_{[1-2\alpha_{i},1]}(E_{i})$ such that $Tx-Ty=(\Lambda_{m}\circ W_{m}\circ\Lambda_{m-1}\circ\cdots\circ\Lambda_{1}\circ W_{1})(x-y)$ . Thus, $\|Tx-Ty\|_{{\mathcal{G}}_{m}}\leqslant\|\Lambda_{m}\circ W_{m}\circ\Lambda_{m-1}\circ\cdots\circ\Lambda_{1}\circ W_{1}\|_{{\mathcal{G}}_{0},{\mathcal{G}}_{m}}\,\|x-y\|_{{\mathcal{G}}_{0}}$ , which leads to (5.6).

B.12 Proof of Corollary 5.7

Since, for every $x\in{\mathcal{H}}_{m}$ , $({\left\langle{{x}\mid{e_{m,k}}}\right\rangle})_{k\in\mathbb{K}_{m}}\in\ell^{2}(\mathbb{K}_{m})$ , it follows from Hölder’s inequality that $\|\cdot\|_{{\mathcal{G}}_{m}}$ in (5.7) is well defined and does provide a continuous embedding of ${\mathcal{H}}_{m}$ in ${\mathcal{G}}_{m}$ . As in the proof of Theorem 5.2, it is enough to take the supremum in (5.8) over $\boldsymbol{D}=\mathscr{D}_{[1-2\alpha_{1},1]}(E_{1})\times\cdots\times\mathscr{D}_{[1-2\alpha_{m-1},1]}(E_{m-1})$ . For every $i\in\{1,\ldots,m\}$ , let $\Lambda_{i}\in\mathscr{D}_{[1-2\alpha_{i},1]}(E_{i})$ . Then

[TABLE]

Let us designate by $(\lambda_{m,k})_{k\in\mathbb{K}_{m}}$ the sequence in $[1-2\alpha_{m},1]$ involved in the construction of $\Lambda_{m}$ in (5.1). If $p<{+\infty}$ , then

[TABLE]

which shows that $\|\Lambda_{m}\|_{{\mathcal{G}}_{m},{\mathcal{G}}_{m}}\leqslant 1$ . This inequality holds analogously if $p={+\infty}$ . We then deduce from (B.38) that $\vartheta_{m}\leqslant\sup_{(\Lambda_{1},\ldots,\Lambda_{m-1})\in\boldsymbol{D}}\|W_{m}\circ\Lambda_{m-1}\circ\cdots\circ\Lambda_{1}\circ W_{1}\|_{{\mathcal{G}}_{0},{\mathcal{G}}_{m}}$ . On the other hand, it follows from (5.6) that

[TABLE]

which concludes the proof.

B.13 Proof of Proposition 5.10

For every $i\in\{1,\ldots,m-1\}$ , let $\Lambda_{i}\in\mathscr{D}_{\{1-2\alpha_{i},1\}}(E_{i})$ and let $(\lambda_{i,k})_{k\in\mathbb{K}_{i}}$ be the associated sequence in (5.1). Define

[TABLE]

and set $\Lambda_{m}\colon{\mathcal{H}}_{m}\to{\mathcal{H}}_{m}\colon x\mapsto\sum_{k\in\mathbb{K}_{m}}\lambda_{m,k}{\left\langle{{x}\mid{e_{m,k}}}\right\rangle}e_{m,k}$ and $V_{m}=\Lambda_{m}W_{m}$ . Then, by (5.10),

[TABLE]

In addition, it follows from (5.7) and (B.41) that

[TABLE]

Therefore, without loss of generality, we assume that

[TABLE]

Let us now show that

[TABLE]

Let $\varepsilon\in\left]0,+\infty\right[$ . Then there exists $x\in{\mathcal{H}}_{0}$ such that $\|x\|_{{\mathcal{G}}_{0}}=1$ and

[TABLE]

If $p<+\infty$ in (5.7), this yields

[TABLE]

On the other hand,

[TABLE]

which, in view of (5.1), implies that

[TABLE]

Using (5.9) recursively yields

[TABLE]

We then deduce from (B.44) that

[TABLE]

Set $y=\sum_{k_{0}\in\mathbb{K}_{0}}\left|\left\langle{{x}\mid{e_{0,k_{0}}}}\right\rangle\right|e_{0,k_{0}}$ . In view of (5.12), $\|y\|_{{\mathcal{G}}_{0}}=\|x\|_{{\mathcal{G}}_{0}}=1$ . Thus, (B.13) yields

[TABLE]

It then follows from (B.46) and the fact that $\|y\|_{{\mathcal{G}}_{0}}=1$ that

[TABLE]

The same inequality is obtained similarly for $p=+\infty$ . This establishes (B.45), which leads to

[TABLE]

Since the converse inequality holds straightforwardly, the proof is complete.

B.14 Proof of Proposition 5.12

We use arguments similar to those of the proof of Proposition 5.10. For every $i\in\{1,\ldots,m-1\}$ , let $\Lambda_{i}\in\mathscr{D}_{\{1-2\alpha_{i},1\}}(E_{i})$ . There exists $x\in{\mathcal{H}}_{0}$ such that $\|x\|_{{\mathcal{G}}_{0}}=1$ and

[TABLE]

On the other hand, for every $k_{m}\in\mathbb{K}_{m}$ ,

[TABLE]

Setting $y=\sum_{k_{0}\in\mathbb{K}_{0}}\left|\left\langle{{x}\mid{e_{0,k_{0}}}}\right\rangle\right|e_{0,k_{0}}$ yields $\big{|}{\left\langle{{W_{m}\Lambda_{m-1}\cdots\Lambda_{1}W_{1}x}\mid{e_{m,k_{m}}}}\right\rangle}\big{|}$$\leqslant{\left\langle{{(A_{m}\cdots A_{1})y}\mid{e_{m,k_{m}}}}\right\rangle}$ , and (B.54) implies that $\|W_{m}\Lambda_{m-1}\cdots\Lambda_{1}W_{1}\|_{{\mathcal{G}}_{0},{\mathcal{G}}_{m}}$$\leqslant\|A_{m}\cdots A_{1}y\|_{{\mathcal{G}}_{m}}\leqslant\|A_{m}\cdots A_{1}\|_{{\mathcal{G}}_{0},{\mathcal{G}}_{m}}$ , which concludes the proof.

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] N. Akhtar and A. Mian, Threat of adversarial attacks on deep learning in computer vision: A survey, IEEE Access , vol. 6, pp. 14410–14430, 2018.
2[2] C. H. Aladag, E. Egrioglu, and U. Yolcu, Robust multilayer neural network based on median neuron model, Neural Comput. Appl. , vol. 24, pp. 945–956, 2014.
3[3] A. Athalye, N. Carlini, and D. Wagner, Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples, Proc. Intl. Conf. Machine Learn. , pp. 274–283, 2018.
4[4] J.-B. Baillon, R. E. Bruck, and S. Reich, On the asymptotic behavior of nonexpansive mappings and semigroups in Banach spaces, Houston J. Math. , vol. 4, pp. 1–9, 1978.
5[5] R. Balan, M. Singh, and D. Zou, Lipschitz properties for deep convolutional networks, 2017. https://arxiv.org/abs/1701.05217.pdf
6[6] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, Spectrally-normalized margin bounds for neural networks, Adv. Neural Inform. Process. Syst. , vol. 30, pp. 6240–6249, 2017.
7[7] M. Basirat and P. M. Roth, The quest for the golden activation function, arxiv, 2018. https://arxiv.org/pdf/1808.00783
8[8] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd ed., corrected reprint. Springer, New York, 2019.