Lipschitz Certificates for Layered Network Structures Driven by Averaged Activation Operators
Patrick L. Combettes, Jean-Christophe Pesquet

TL;DR
This paper develops a method to compute tight Lipschitz constants for layered neural networks using averaged operators, improving robustness assessment by capturing layer interactions more accurately.
Contribution
It introduces a novel framework for deriving sharp Lipschitz bounds for layered networks with averaged operators, surpassing traditional product-based estimates.
Findings
Tighter Lipschitz constants than traditional bounds.
Applicable to standard convolutional neural networks.
Enhanced robustness evaluation for neural network models.
Abstract
Obtaining sharp Lipschitz constants for feed-forward neural networks is essential to assess their robustness in the face of perturbations of their inputs. We derive such constants in the context of a general layered network model involving compositions of nonexpansive averaged operators and affine operators. By exploiting this architecture, our analysis finely captures the interactions between the layers, yielding tighter Lipschitz constants than those resulting from the product of individual bounds for groups of layers. The proposed framework is shown to cover in particular many practical instances encountered in feed-forward neural networks. Our Lipschitz constant estimates are further improved in the case of structures employing scalar nonlinear functions, which include standard convolutional networks as special cases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and ELM · Stochastic Gradient Optimization Techniques
Lipschitz Certificates for Layered Network Structures Driven
by Averaged Activation Operators††thanks: Contact author: P. L. Combettes, [email protected], phone: +1 919 515 2671. The work of P. L. Combettes was supported by the National Science Foundation under grant CCF-1715671. The work of J.-C. Pesquet was supported by Institut Universitaire de France.
Patrick L. Combettes1 and Jean-Christophe Pesquet2
North Carolina State University, Department of Mathematics, Raleigh, NC 27695-8205, USA
CentraleSupélec, Inria, Université Paris-Saclay, Center for Visual Computing, 91190 Gif sur Yvette, France
( )
Abstract
Obtaining sharp Lipschitz constants for feed-forward neural networks is essential to assess their robustness in the face of perturbations of their inputs. We derive such constants in the context of a general layered network model involving compositions of nonexpansive averaged operators and affine operators. By exploiting this architecture, our analysis finely captures the interactions between the layers, yielding tighter Lipschitz constants than those resulting from the product of individual bounds for groups of layers. The proposed framework is shown to cover in particular many practical instances encountered in feed-forward neural networks. Our Lipschitz constant estimates are further improved in the case of structures employing scalar nonlinear functions, which include standard convolutional networks as special cases.
1 Introduction
Artificial neural networks are becoming increasingly central tools in tasks such as learning, modeling, data processing, and decision making. As first noted in [52], neural networks are vulnerable to adversarial examples which, though close to other data inputs, lead to very different outputs. This potential lack of stability makes the networks vulnerable and unreliable in key application areas; see, for instance, [1, 30, 35] and the references therein. To protect networks against such instabilities various techniques have been explored [39, 43, 44, 54]. Although these defense strategies may be effective in certain scenarios, they do not provide formal guarantees of robustness for general networks and they have been shown to be breakable by new attacks; see, for instance, [3, 18].
It has been acknowledged for some time that the Lipschitz behavior of a network plays a key role in the analysis of its robustness [52]. Simply put, if a layered network is modeled by an operator acting between normed spaces, with Lipschitz constant , given an input and a perturbation , we can majorize the perturbation on the output via the inequality
[TABLE]
Thus can be used as a certificate of robustness of the network provided that it is tightly estimated. Lipschitz regularity is also an important ingredient in the derivation of generalization bounds and approximation bounds [6, 11, 50], and of reachability conditions [47]. In [52] the estimation of is performed by evaluating the Lipschitz constant of the layers individually and then defining as the product of these constants, which typically yields pessimistic bounds. Lipschitz constants have also been computed for specific situations, e.g., [5, 33, 49, 53]. Overall, however, deriving analytically accurate constants for general contexts remains an open problem. The objective of the present paper is to address this question for a general class of layered networks. Mathematically, our network model is described as an alternation of affine and nonlinear operators. This type of structure also arises in variational and equilibrium problems, as well as in network science, e.g., [16, 24, 27, 56]. Adopting the same terminology as in the neural network literature, where they model the activity of neurons, the nonlinear operators will be called activation operators. Our stability analysis focuses on the following -layer model, in which the activation operators are averaged nonexpansive operators (see Fig. 1). Recall that an operator acting on a Hilbert space is -averaged for some if there exists a nonexpansive (i.e., -Lipschitzian) operator such that
[TABLE]
In other words, is an underrelaxation of a nonexpansive operator (see [8] for a detailed account). This class of operators was introduced in [4] and shown in [21] to model various problems in nonlinear analysis as it includes common operators such as projection operators, proximity operators, resolvents of monotone operators, reflection operators, gradient step operators, and various combinations thereof. Recent theoretical developments and applications to data science include [9, 10, 12, 13, 15, 22, 26, 34, 41, 51, 55, 56].
Model 1.1
Let be an integer and let be nonzero real Hilbert spaces. For every , let be a bounded linear operator, let , let , and let be an -averaged operator. Set
[TABLE]
Since the operators are nonexpansive, a Lipschitz constant for in (1.3) is
[TABLE]
However, as already mentioned, this constant is usually quite loose and of limited use to assess the actual stability of the network. A novelty of our approach is to take into account the averagedness properties of the individual activation operators to capture more sharply the overall interactions between the layers, yielding tighter constants than those provided by computing bounds for groups of layers. Our specific contributions are the following:
- •
We show that the most common activation operators used in neural networks are averaged operators. This not only provides an a posteriori justification for Model 1.1, but also indicates that this highly structured framework should be of interest in the analysis of other properties of layered networks beyond stability.
- •
We derive a general expression for a Lipschitz constant of in terms of the averagedness constants of the activation operators and the norms of certain compositions of the linear operators . This Lipschitz constant is shown to lie between the simple upper bound (1.4) and the lower bound corresponding to a purely linear network. Our analysis applies to any type of linear operator, in particular convolutive ones, and it does not require any additional assumptions on the activation operator. In particular, differentiability is not assumed and our results therefore cover, in particular, networks using the rectified linear unit (ReLU) and max-pooling operations.
- •
In the common situation when the activation operators are separable, we obtain tighter Lipschitz constants for various norms.
- •
Under some positivity condition, we prove that a Lipschitz constant of the network reduces to that of the associated purely linear network obtained by removing the nonlinear operators.
In [24], we investigated the special case of Model 1.1 in which the activation operators are proximity operators, hence -averaged (see Section 3.1). The objective there was to study the asymptotic behavior of deep network structures rather than their stability.
The remainder of the paper is organized as follows. In Section 2 we present an illustration of our main result in a simple special case. In Section 3.1 we provide the necessary nonlinear analysis background. In Section 3.2 we show that a wide array of activation operators used in neural networks are indeed nonexpansive. In Section 4 we derive general results concerning Lipschitz constants for Model 1.1. Section 5 refines this analysis in the case of separable activation operators.
2 Preview of the main results in a simple scenario
We illustrate on a simple instance the main results of the paper. More precisely, we consider a three-layer () network where, for every , is the standard Euclidean space . In this case, each linear operator is identified with a matrix in . To further simplify our setting, we assume that the operators , , and correspond to ReLU layers, that is, for each ,
[TABLE]
In view of (1.2), is -averaged since has Lipschitz constant 1. This implies that the operators , , and are also -averaged [24]. Let us now introduce two parameters which will play a central role in our analysis, namely,
[TABLE]
and
[TABLE]
where is the spectral norm and, for each , denotes the set of diagonal matrices with entries in . In this context, our main result states that both and are Lipschitz constants of the network, and that
[TABLE]
In addition, if the entries of the matrices are in , then a Lipschitz constant of the network is .
Example 2.1
To illustrate the improvement of the proposed bound over the classical product norm estimate, we consider a fully connected network with , , , and . The entries of the matrices are generated randomly and independently according to a normal distribution. We evaluate the Lipschitz constant estimate provided by (2.2) and the lower bound in (2.4). The average (resp. minimal) value of computed over 1000 realizations is approximately equal to (resp. ), while the average (resp. minimal) value of is approximately equal to (resp. ). In addition, the average (resp. minimal) value of computed over 1000 realizations is approximately equal to (resp. ). In agreement with (2.4), this estimation of the Lipschitz constant is better than and significantly sharper than .
In the remainder of this paper, we show that the above results hold in a much more general context (for an arbitrary number of layers , arbitrary Hilbert spaces, and a wide class of activation operators), and that some of them can be extended to non-Euclidean norms. To establish these results, we need to introduce suitable mathematical tools in the next section.
3 Nonexpansive averaged activation operators
3.1 Nonlinear analysis tools and notation
We review some key facts and definitions which will be used subsequently; see [8] for further information. Throughout, is a real Hilbert space with power set , scalar product , and associated norm .
Let be an operator and let . Then is nonexpansive if it is -Lipschitzian, -averaged if there exists a nonexpansive operator such that , and firmly nonexpansive if it is -averaged. Let be a set-valued operator. We denote by \text{\rm gra}\,A=\big{\{}{(x,u)\in{\mathcal{H}}\times{\mathcal{H}}}~{}\big{|}~{}{u\in Ax}\big{\}} the graph of and by the inverse of , i.e., the operator with graph \big{\{}{(u,x)\in{\mathcal{H}}\times{\mathcal{H}}}~{}\big{|}~{}{u\in Ax}\big{\}}. In addition, is monotone if
[TABLE]
and maximally monotone if there exists no monotone operator such that . If is maximally monotone, then its resolvent is firmly nonexpansive. We denote by the class of proper lower semicontinuous convex functions from to . Let . The conjugate of is
[TABLE]
and the subdifferential of is the maximally monotone operator
[TABLE]
For every , the unique minimizer of is denoted by . We have and is therefore firmly nonexpansive.
Let be a nonempty convex subset of . Then is the indicator function of (it takes values [math] on and on its complement) and is its distance function. If is closed, its projection operator is .
3.2 Activators as averaged operators
We show via various illustrations that the assumption made in Model 1.1 on the activation operators covers many existing instances of feed-forward neural networks. Let us start with some key properties.
Proposition 3.1
Let be a real Hilbert space, let , and let be -averaged. Then the following hold:
- (i)
There exist a maximally monotone operator and a constant such that . Furthermore, if , then is firmly nonexpansive. 2. (ii)
Suppose that . Then there exist a function and a constant such that . Furthermore, is increasing if and is odd if is even. 3. (iii)
Suppose that and that is increasing. Then there exists such that .
Next, we illustrate the pervasiveness of nonexpansive averaged activation operators in practice, starting with activation operators on the real line.
Example 3.2
Proposition 3.1(ii) states that activation functions on the real line can be expressed in the generic form
[TABLE]
Here are a few explicit instantiations of this proximal representation.
- (i)
If , we obtain the class of proximal activation functions discussed in [24] and which was seen there to include standard instances such as the unimodal sigmoid activation function [24, Example 2.13], the saturated linear activation function [24, Example 2.5], the ReLU activation function [24, Example 2.6], the inverse square root unit activation function [24, Example 2.9], the hyperbolic tangent activation function [24, Example 2.12], and the Elliot activation function [24, Example 2.15]. Additional examples in this category are the following. Given , the capped ReLU activation function [36] is
[TABLE]
and, for , the exponential linear unit (ELU) function [20] is
[TABLE]
It follows from [8, Cor. 24.5, Prop. 24.32, and Exa. 13.2(v)] that , where
[TABLE]
The softplus activation function [29] is also a proximity operator since it is nonexpansive and increasing (see Proposition 3.1(iii)). 2. (ii)
The Geman–McClure function [28]
[TABLE]
will be employed in Example 3.3. Set . Then is nonexpansive and . The conjugate of is 1-strongly convex and given by , where
[TABLE]
It follows from [8, Cor. 24.5] that with (see Fig. 2)
[TABLE] 3. (iii)
Take . Then we obtain the leaky ReLU activation function [38] for , the ReLU activation function for , and the absolute value activation function [17] for . 4. (iv)
The use of nonmonotonic activation functions has been advocated in various papers. They turn out to be -averaged (alternatively, in view of Proposition 3.1(ii), they are of the form (3.4) with ). To compute the averagedness constant of a nonexpansive operator , one can proceed as follows. According to (1.2), we must find the smallest such that remains nonexpansive. This means that the supremum of the modulus of the one-sided derivatives (the derivatives if they exist) over should be one. Thus, we obtain for the sine activation function [42], as well as for the absolute value function [17] and the mirrored ReLU activation function [58]
[TABLE]
for the swish activation function [45]
[TABLE]
for the exponential linear squashing (ELiSH) function [7]
[TABLE]
and for the Gaussian activation function [40].
Next, is a technique for lifting a proximal activation operator from to a Hilbert space .
Example 3.3
Let be a real Hilbert space, let , let be a nonempty closed convex subset of , let be an even function such that is differentiable on with [math] as its unique minimizer. Set
[TABLE]
Then is -averaged. In particular, set , , and define as in (3.10). Then we infer that the squashing function
[TABLE]
used in capsule networks [48] is a proximal activation operator.
Another construction that builds on activation functions on the real line is the following, which is reminiscent of the original multilayer perceptrons [46].
Example 3.4
Let be a separable real Hilbert space, let , let be an orthonormal basis of , and let . For every , let be -averaged and such that . Define . Then is -averaged.
Example 3.5
Let be a strictly positive integer, let , and let be a nonempty closed convex subset of . Set
[TABLE]
where denotes the vector obtained by sorting the components of in ascending order. Then is -averaged.
Remark 3.6
Set C=\big{\{}{(\xi_{k})_{1\leqslant k\leqslant N}\in\mathbb{R}^{N}}~{}\big{|}~{}{\xi_{1}=\cdots=\xi_{N}}\big{\}} in Example 3.5. Then
[TABLE]
Now set . Then corresponds to the max-average pooling performed on a block of size [37]. When , the standard average-pooling operation is obtained, which is associated with the activation operator . When , we recover the standard max-pooling operation [14], which is the main building block of maxout layers [31]. The max-pooling operator is nonexpansive.
Example 3.7
Let , let , and let . Set
[TABLE]
where is the matrix obtained by retaining the first rows of the identity matrix of size , and . Then is -averaged.
Remark 3.8
Let be an odd integer, let , let , let be the activation operator defined in Example 3.7, and set . Then, for every , . This corresponds to the median neuron model introduced in [2].
Remark 3.9
Multi-component averaged activation operators can be derived from theabove examples. Indeed, let be real Hilbert spaces and let be their Hilbert direct sum. For every , let and let be -averaged. Then is -averaged with .
4 Lipschitz constants for layered networks
The objective of this section is to derive Lipschitz constants for networks conforming to Model 1.1. Note that, if , a Lipschitz constant of is clearly since is nonexpansive. We shall therefore focus henceforth on the case . Throughout, the following notation is employed.
Notation 4.1
Let and . Then
[TABLE]
and, for every ,
[TABLE]
Theorem 4.2
Consider the setting of Model 1.1 with . Set
[TABLE]
and
[TABLE]
Then is a Lipschitz constant of .
The following proposition features some important special cases.
Proposition 4.3
Consider the setting of Model 1.1 with , and let be defined as in (4.4). Then the following hold:
- (i)
. 2. (ii)
Suppose that, for every , . Then . 3. (iii)
Suppose that, for every , is purely nonexpansive in the sense that is its smallest averaging constant. Then . 4. (iv)
Suppose that, for every , is firmly nonexpansive. Then
[TABLE] 5. (v)
Set . Then
[TABLE]
Remark 4.4
Proposition 4.3(i)–4.3(iii) show that the tightest bound in terms of stability corresponds to a linear network, while the loosest corresponds to a network with nonlinearities having no stronger property than nonexpansiveness.
We close this section by observing that the Lipschitz constant exhibited in Theorem 4.2 is a componentwise increasing function of the averagedness constants of the activation operators.
Proposition 4.5
Consider the setting of Model 1.1 with . Make the Lipschitz constant in Theorem 4.2 a function of . Let and be such that . Then .
Remark 4.6
Proposition 4.5 suggests that, in terms of stability, it is better to use proximal activation operators, such as those listed in Example 3.2(i)–(ii), than -averaged activation operators for which , such as those mentioned in Example 3.2(iv).
5 Networks using separable activation operators
We show that sharper Lipschitz constants can be derived in the case of networks featuring the type of separable structure described in Example 3.4. Note that this class of networks is the most commonly used, standard convnets being special cases. The following notation will be used.
Notation 5.1
Let be a separable real Hilbert space, let , let be an orthonormal basis of , and let be a nonempty bounded subset of . Then
[TABLE]
5.1 General results
Theorem 5.2
Consider the setting of Model 1.1 with . For every , suppose that is separable, let , let be an orthonormal basis of , and, for every , let be -averaged and such that . Assume that
[TABLE]
and define
[TABLE]
Then the following hold:
- (i)
* is a Lipschitz constant of the operator of (1.3).* 2. (ii)
Define as in (4.4). Then .
Remark 5.3
An expression similar to (5.3) was proposed empirically in [49] for a multilayer perceptron operating on finite-dimensional spaces under the additional assumption that the activation operators are continuously differentiable and firmly nonexpansive.
Remark 5.4
In Theorem 5.2, make the additional assumption that, for some, the functions are increasing. Then it follows from Proposition 3.1(iii) that there exist functions in such that . In addition, for every , since and since the set of minimizers of coincides with the set of fixed points of [8, Proposition 12.29], we deduce that is minimized at [math]. Furthermore, and , where . Such a construction is used in [23, 25].
As in Proposition 4.5, the Lipschitz constant exhibited in Theorem 5.2 turns out to be a componentwise increasing function of the averagedness constants of the activation operators.
Proposition 5.5
Consider the setting of Model 1.1 with . For every , suppose that is separable, let , and let be an orthonormal basis of . Define by
[TABLE]
Let and be such that . Then .
5.2 Extension to non-Hilbertian norms
In certain applications, Hilbertian norms may not be the most relevant measures to quantify errors. We now state a variant of Theorem 5.2 which holds for alternative norms. It involves embeddings of Hilbert spaces; standard examples can be found in [57]. Let us also point out that these embedding conditions are automatically satisfied if the spaces are finite-dimensional.
Proposition 5.6
Consider the setting of Model 1.1 with . For every , suppose that is separable, let , let be an orthonormal basis of , and, for every , let be -averaged and such that . Let be the normed space obtained by equipping the vector space underlying with a norm for which is continuously embedded in , and let be the normed space obtained by equipping the vector space underlying with a norm for which is continuously embedded in . Assume that
[TABLE]
Then
[TABLE]
is a Lipschitz constant of .
Corollary 5.7
Consider the setting of Model 1.1 with . Define and as in Proposition 5.6, let , and let be such that one of the following holds:
- (i)
* and .* 2. (ii)
* and .*
Let be the normed space obtained by equipping the vector space underlying with the norm
[TABLE]
Then a Lipschitz constant of is
[TABLE]
5.3 Networks with positive weights
Under certain positivity assumptions, the constant of (5.3) and (5.8) can be simplified.
Assumption 5.8
Consider the setting of Model 1.1 with . For every , suppose that is separable, let , and let be an orthonormal basis of . For every , set
[TABLE]
We suppose that
[TABLE]
Example 5.9
Consider the particular case of Model 1.1 in which, for every , , , is the canonical basis of and, for every , with the additional condition that, for every , . Further, for every , the matrix satisfies
[TABLE]
Then Assumption 5.8 holds. This is true in particular if, for every ,, which corresponds to positively weighted networks. See [19] for the design of such networks.
In the following result, a Lipschitz constant of the network (1.3) coincides with that of the linear network for standard choices of norms.
Proposition 5.10
Suppose that the assumptions of Corollary 5.7 are satisfied, that
[TABLE]
and that Assumption 5.8 holds. Then the Lipschitz constant of in (5.8) reduces to .
We show below that the Lipschitz constant of a positively weighted network associated with weight operators and nonseparable activation operators is not necessarily .
Example 5.11
Consider the toy version of Model 1.1 in which , . Set , where
[TABLE]
Let . Then and therefore . Consequently, we derive from [24, Example 2.13] that \text{\rm prox}_{\varphi}x=\big{(}\text{tanh}(\xi_{1}),\text{tanh}(\xi_{2})\big{)}. Now set
[TABLE]
[25, Lemma 2.8], and . Then . If the input is perturbed by , we get , which shows that, although and have strictly positive entries, the Lipschitz constant is larger than . Note that, in this scenario, the constant of (4.4) is
[TABLE]
A sharper Lipschitz constant can be obtained by noticing that this network is equivalent to a network in which , , and are replaced by , , and . Since is separable, the constant of (5.4) is . In contrast, the naive bound of (1.4) is about .
For separable activators in finite-dimensional spaces, we have the following result, which does not require Assumption 5.8.
Proposition 5.12
Consider the setting of Model 1.1 with . Suppose that the assumptions of Corollary 5.7 hold and that satisfies (5.12). In addition, assume that, for every , and is the canonical basis of . For every , let denote the matrix obtained by taking the absolute values of the entries of the matrix . Then the Lipschitz constant of in (5.8) satisfies .
6 Conclusion
Using advanced tools from nonlinear analysis, we have derived sharp Lipschitz constants for layered network structures involving compositions of nonexpansive averaged operators and affine operators. This framework has been shown to model feed-forward neural networks having a chain graph structure. Extending these results to networks having a more general dyadic acyclic graph (DAG) structure would be of interest. Among the many avenues of future research that this work suggests, it would be interesting to exploit it to devise training strategies to achieve better robustness. The proposed nonexpansive operator machinery could also be used to design network architectures with smaller Lipschitz constants. Finally, computing local Lipschitz constants could be of interest in practice and constitutes an important topic of future research.
Appendix A Technical lemmas
Lemma A.1
[23, Proposition 2.4]*
Let be a function defined from to . Then is the proximity operator of a function in if and only if it is nonexpansive and increasing.*
Lemma A.2
Let and, for every , let be a nonempty subset of a real vector space . Let be a function which is convex with respect to each of its coordinates. Set and let be its convex envelope. Then .
Proof. Set . Clearly, . Now take . Then , where is a finite family in such that and, for every , , with . Note that . Therefore,
[TABLE]
Hence, .
Lemma A.3
Let be a separable real Hilbert space, let , let be an orthonormal basis of , and let . For every , let be -averaged and such that . Define , and fix and in . Then there exists such that .
Proof. We saw in Example 3.4 that is well defined. We have
[TABLE]
For every , there exists a nonexpansive such that and, therefore,
[TABLE]
Consequently, for every , there exists such that
[TABLE]
We deduce from (A.2) that , as claimed.
Appendix B Proofs of main results
B.1 Proof of Proposition 3.1
(i): As seen in (1.2), there exists a nonexpansive operator such that . However, by [8, Prop. 4.4 and Cor. 23.9], there exists a maximally monotone operator such that . Hence, , where . For the last claim, notice that, since is firmly nonexpansive [8, Cor. 23.9], so is as a convex combination of two firmly nonexpansive operators [8, Exa. 4.7]. (ii)(i): It follows from [8, Cor. 22.23] that there exists such that , which provides the expression for . The increasingness claim follows from Lemma A.1. Finally, if is even, then is odd [8, Prop. 24.10] and so is . (iii): This follows from Lemma A.1.
B.2 Proof of Example 3.3
Let be the support function of and set . Then it follows from [8, Prop. 24.30] and (3.14) that , However, since is firmly nonexpansive, it is -averaged, which makes a -averaged operator. Now consider the function of (3.10). Then it is an even function in with [math] as its unique minimizer. Next, set . As seen in Example 3.2(ii), and is bounded. Therefore is bounded. In turn, is supercoercive and we derive from [8, Prop. 14.15] that . Hence, since is strictly convex, it follows derive from [8, Prop. 18.9] that is differentiable on . In addition, . Altogether, (3.14) reduces to
[TABLE]
and hence, in view of Example 3.2(ii), to (3.15).
B.3 Proof of Example 3.4
Let and . It follows from the nonexpansiveness of the functions that
[TABLE]
Hence, is well defined. For every , by (1.2) there exists a nonexpansive function such that . Hence, , where . Therefore,
[TABLE]
This shows that is nonexpansive and hence that is -averaged.
B.4 Proof of Example 3.5
Let be the sorting operator of Example 3.7. Then
[TABLE]
where (B.4) follows from [32, Thm. 368]. This shows that is nonexpansive. Furthermore, is nonexpansive [8, Cor. 4.18]. Note that
[TABLE]
Since is nonexpansive as a convex combination of nonexpansive operators, the operator is -averaged.
B.5 Proof of Example 3.7
Set . Let and be in , and define and . As seen in (B.5), is nonexpansive. Consequently,
[TABLE]
This shows that is Lipschitzian with constant . It is thus -averaged with [8, Prop. 4.38].
B.6 Proof of Theorem 4.2
For every , is -averaged and, therefore, there exists a nonexpansive operator such that . Since and is nonexpansive, it suffices to show that
[TABLE]
Let us prove this result by induction. Let and . If , we derive from the nonexpansiveness of that
[TABLE]
Hence, is Lipschitzian with constant
[TABLE]
Now assume that and that (B.8) holds at order . Then
[TABLE]
Hence, the nonexpansiveness of yields
[TABLE]
On the other hand, the induction hypothesis yields
[TABLE]
Similarly, replacing by above, we get
[TABLE]
Using (LABEL:e:21), and then inserting (LABEL:e:23) and (LABEL:e:22), we obtain
[TABLE]
Furthermore, we deduce from (4.3) that
[TABLE]
Therefore
[TABLE]
which implies that, if , then . Hence, (B.14) yields
[TABLE]
Thus, we obtain
[TABLE]
which establishes (B.8).
B.7 Proof of Proposition 4.3
Define as in (4.3). (i): For every and every , (4.2) yields
[TABLE]
Consequently, it follows from (4.4) that
[TABLE]
In view of (4.3), is the discrete probability distribution of a vector of independent Bernoulli random variables. Hence, in (B.20). (ii): For every , . Therefore, in view of (4.3),
[TABLE]
Hence, the result follows from (4.4). (iii): For every , . Therefore, in view of (4.3),
[TABLE]
Invoking (4.4) allows us to conclude. (iv): For every . Hence, (4.3) yields . Invoking once again (4.4) yields the result. (v): It follows from (4.2) that
[TABLE]
We decompose this expression in a sum of terms depending on the value taken by , namely,
[TABLE]
In addition, for every , we derive from (4.3) that
[TABLE]
Using the above equality in (B.24), factorizing common factors, and invoking (4.4) yields
[TABLE]
and we obtain (4.6).
B.8 Proof of Proposition 4.5
Let and set
[TABLE]
For every and every , (4.2) yields
[TABLE]
We infer from (4.4) that
[TABLE]
In view of (B.28) we conclude that
[TABLE]
B.9 Proof of Theorem 5.2
(i): For every , set and . Note that, for every and every , is -averaged. Furthermore, . Now fix and in . It follows from (1.3) and the nonexpansiveness of that
[TABLE]
In view of Lemma A.3, for every , there exists such that
[TABLE]
Recursive application of this identity yields
[TABLE]
This implies that . Thus,
[TABLE]
is a Lipschitz constant of . Set and . For every , is generated from a sequence in via the construction of (5.1). The function
[TABLE]
is convex with respect to each of its coordinates. Hence, we deduce from Lemma A.2 that , as claimed.
(ii): For every , the identity operator of lies in . Hence, . For every , let and note that the linear operator
[TABLE]
is nonexpansive. Using the same kind of decomposition as in the proof of Theorem 4.2 yields
[TABLE]
and allows us to conclude that .
B.10 Proof of Proposition 5.5
It follows from (B.34) that
[TABLE]
B.11 Proof of Proposition 5.6
Let us first note that, because of the embeddings, is continuous and, likewise, every is continuous from to . Hence, for every , is continuous. We now follow the same argument as in the proof of Theorem 5.2. Let and be in . For every , there exists such that . Thus, , which leads to (5.6).
B.12 Proof of Corollary 5.7
Since, for every , , it follows from Hölder’s inequality that in (5.7) is well defined and does provide a continuous embedding of in . As in the proof of Theorem 5.2, it is enough to take the supremum in (5.8) over . For every , let . Then
[TABLE]
Let us designate by the sequence in involved in the construction of in (5.1). If , then
[TABLE]
which shows that . This inequality holds analogously if . We then deduce from (B.38) that . On the other hand, it follows from (5.6) that
[TABLE]
which concludes the proof.
B.13 Proof of Proposition 5.10
For every , let and let be the associated sequence in (5.1). Define
[TABLE]
and set and . Then, by (5.10),
[TABLE]
In addition, it follows from (5.7) and (B.41) that
[TABLE]
Therefore, without loss of generality, we assume that
[TABLE]
Let us now show that
[TABLE]
Let . Then there exists such that and
[TABLE]
If in (5.7), this yields
[TABLE]
On the other hand,
[TABLE]
which, in view of (5.1), implies that
[TABLE]
Using (5.9) recursively yields
[TABLE]
We then deduce from (B.44) that
[TABLE]
Set . In view of (5.12), . Thus, (B.13) yields
[TABLE]
It then follows from (B.46) and the fact that that
[TABLE]
The same inequality is obtained similarly for . This establishes (B.45), which leads to
[TABLE]
Since the converse inequality holds straightforwardly, the proof is complete.
B.14 Proof of Proposition 5.12
We use arguments similar to those of the proof of Proposition 5.10. For every , let . There exists such that and
[TABLE]
On the other hand, for every ,
[TABLE]
Setting yields \big{|}{\left\langle{{W_{m}\Lambda_{m-1}\cdots\Lambda_{1}W_{1}x}\mid{e_{m,k_{m}}}}\right\rangle}\big{|}$$\leqslant{\left\langle{{(A_{m}\cdots A_{1})y}\mid{e_{m,k_{m}}}}\right\rangle}, and (B.54) implies that \|W_{m}\Lambda_{m-1}\cdots\Lambda_{1}W_{1}\|_{{\mathcal{G}}_{0},{\mathcal{G}}_{m}}$$\leqslant\|A_{m}\cdots A_{1}y\|_{{\mathcal{G}}_{m}}\leqslant\|A_{m}\cdots A_{1}\|_{{\mathcal{G}}_{0},{\mathcal{G}}_{m}}, which concludes the proof.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Akhtar and A. Mian, Threat of adversarial attacks on deep learning in computer vision: A survey, IEEE Access , vol. 6, pp. 14410–14430, 2018.
- 2[2] C. H. Aladag, E. Egrioglu, and U. Yolcu, Robust multilayer neural network based on median neuron model, Neural Comput. Appl. , vol. 24, pp. 945–956, 2014.
- 3[3] A. Athalye, N. Carlini, and D. Wagner, Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples, Proc. Intl. Conf. Machine Learn. , pp. 274–283, 2018.
- 4[4] J.-B. Baillon, R. E. Bruck, and S. Reich, On the asymptotic behavior of nonexpansive mappings and semigroups in Banach spaces, Houston J. Math. , vol. 4, pp. 1–9, 1978.
- 5[5] R. Balan, M. Singh, and D. Zou, Lipschitz properties for deep convolutional networks, 2017. https://arxiv.org/abs/1701.05217.pdf
- 6[6] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, Spectrally-normalized margin bounds for neural networks, Adv. Neural Inform. Process. Syst. , vol. 30, pp. 6240–6249, 2017.
- 7[7] M. Basirat and P. M. Roth, The quest for the golden activation function, arxiv, 2018. https://arxiv.org/pdf/1808.00783
- 8[8] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd ed., corrected reprint. Springer, New York, 2019.
