Testing Stationarity Concepts for ReLU Networks: Hardness, Regularity, and Robust Algorithms
Lai Tian, Anthony Man-Cho So

TL;DR
This paper investigates the computational complexity of stationarity testing in ReLU neural networks, establishing hardness results, providing a regularity condition, and proposing a robust algorithm for near-approximate stationarity testing.
Contribution
It proves the co-NP-hardness of certain stationarity tests, introduces a simple regularity condition for subdifferential chain rules, and develops a practical algorithm for robust stationarity testing in ReLU networks.
Findings
Testing first-order stationarity is co-NP-hard.
A simple regularity condition for subdifferential chain rule validity.
A robust algorithm for near-approximate stationarity testing.
Abstract
We study the computational problem of the stationarity test for the empirical loss of neural networks with ReLU activation functions. Our contributions are: Hardness: We show that checking a certain first-order approximate stationarity concept for a piecewise linear function is co-NP-hard. This implies that testing a certain stationarity concept for a modern nonsmooth neural network is in general computationally intractable. As a corollary, we prove that testing so-called first-order minimality for functions in abs-normal form is co-NP-complete, which was conjectured by Griewank and Walther (2019, SIAM J. Optim., vol. 29, p284). Regularity: We establish a necessary and sufficient condition for the validity of an equality-type subdifferential chain rule in terms of Clarke, Fr\'echet, and limiting subdifferentials of the empirical loss of two-layer ReLU networks. This new condition is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Advanced Neural Network Applications
MethodsTest
Testing Stationarity Concepts for ReLU Networks:
Hardness, Regularity, and Robust Algorithms
Lai Tian Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Sha Tin, N.T., Hong Kong SAR. E-mail: [email protected].
Anthony Man-Cho So Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Sha Tin, N.T., Hong Kong SAR. E-mail: [email protected].
Abstract
We study the computational problem of the stationarity test for the empirical loss of neural networks with ReLU activation functions. Our contributions are:
Hardness: We show that checking a certain first-order approximate stationarity concept for a piecewise linear function is co-NP-hard. This implies that testing a certain stationarity concept for a modern nonsmooth neural network is in general computationally intractable. As a corollary, we prove that testing so-called first-order minimality for functions in abs-normal form is co-NP-complete, which was conjectured by Griewank and Walther (2019, SIAM J. Optim., vol. 29, p284). 2. 2.
Regularity: We establish a necessary and sufficient condition for the validity of an equality-type subdifferential chain rule in terms of Clarke, Fréchet, and limiting subdifferentials of the empirical loss of two-layer ReLU networks. This new condition is simple and efficiently checkable. 3. 3.
Robust algorithms: We introduce an algorithmic scheme to test near-approximate stationarity in terms of both Clarke and Fréchet subdifferentials. Our scheme makes no false positive or false negative error when the tested point is sufficiently close to a stationary one and a certain qualification is satisfied. This is the first practical and robust stationarity test approach for two-layer ReLU networks.
1 Introduction
The theoretical analysis of ReLU neural network training is challenging from the optimization perspective, though the empirical performance of various “gradient”-based algorithms is surprisingly good. A key difficulty comes from the entanglement of nonconvexity and nonsmoothness in the objective function of the empirical loss, which causes not only the notion of gradient from classical analysis meaningless, but also the subdifferential set from convex analysis vacuous. Consequently, the study of such a nonconvex nondifferentiable function requires the use of tools from variational analysis Rockafellar and Wets (2009).
For a continuously differentiable function , a point is called stationary (or critical) if . However, the situation is much more complicated when is nondifferentiable at . Indeed, there are many different stationarity concepts (see 6) for nonsmooth functions Li et al. (2020); Cui and Pang (2021). For general Lipschitz functions, recently, under the oracle complexity framework of Nemirovskij and Yudin (1983), substantial progress has been made on the design of provable algorithms for finding approximately stationary (in the sense of perturbed) points Zhang et al. (2020); Tian et al. (2022); Davis et al. (2022); Lin et al. (2022); Metel and Takeda (2022); Kong and Lewis (2022) and also on establishing the hardness of computing such approximate stationary points Kornowski and Shamir (2022a); Tian and So (2022); Kornowski and Shamir (2022b); Jordan et al. (2022).
As a complement to these developments, in this paper, we consider the complexity of and robust algorithms for checking whether a given neural network is an (approximately) stationary one with respect to the empirical loss. This is a task already considered by Yun et al. (2018). We emphasize that “checking” and “finding” are two very different computational problems. While the co-NP-hardness of checking the local optimality of a given point in smooth nonconvex programming was shown by Murty and Kabadi (1987) in 1987, the complexity of “finding” a local minimizer was an open question proposed by Pardalos and Vavasis (1992) since 1992, and is recently settled by Ahmadi and Zhang (2022).
Given a neural network with smooth elemental components, testing the (approximate) stationarity of a point is simply an application of the classic gradient chain rule. In a modern computational environment, this is usually done by using Algorithmic Differentiation (AD) Griewank and Walther (2008) software, e.g., PyTorch and TensorFlow. A natural question that arises is whether testing the stationarity for a piecewise smooth function (e.g., empirical loss of a ReLU network) is as easy as testing for a smooth one. Surprisingly, we show (in 10) that such testing is, in general, computationally intractable.
The difficulty here is due to the failure of an exact (equality-type) subdifferential chain rule. For a general locally Lipschitz function, the calculus rules are only known to hold in the form of set inclusions rather than equalities, except in several special cases (see 8). This prevents one from computing the subdifferential set of the empirical loss with that of elemental components. Thus, to facilitate the tractability of stationarity testing, it is of interest to find out a condition, under which an equality-type chain rule holds, and the subdifferential set of the empirical loss can be characterized. By contrast, given a first-order oracle providing the whole generalized subdifferential set at the queried point in the oracle framework Kornowski and Shamir (2022a); Tian and So (2022); Kornowski and Shamir (2022b); Jordan et al. (2022), the stationarity testing task reduces to a simple linear program, which can be solved by interior-point methods in polynomial time. However, in practice, even computing an element in the generalized subdifferential for a nonsmooth function can be highly non-trivial Burke et al. (2002); Nesterov (2005); Huang and Ma (2010); Khan and Barton (2013). Therefore, a condition for the validity of the exact chain rule could be useful for subgradient computation and stationarity testing and analysis.
The most closely related work to ours is the one by Yun et al. (2018). They considered a two-layer ReLU network and introduced a theoretical algorithm to sequentially check Clarke stationarity (see 6), Fréchet stationarity, and a certain second-order optimality condition. For Fréchet stationarity testing, they proposed to verify the nonnegativity of a directional derivative in every possible direction, for which a trivial test in the worst case requires checking exponentially many inequalities. By exploiting polyhedral geometry, they showed that it suffices to check only extreme rays, which can be done in polynomial time. A limitation of the work Yun et al. (2018) (see also the discussion in (Yun et al., 2018, Section 5)) is that the algorithm therein can only perform exact stationarity testing (see Section 5.1). That is to say if the objective function is , then the algorithm in Yun et al. (2018) will certify stationarity if and only if . However, as pointed out by Yun et al. (2018, Section 5), in practice, such an exact nondifferentiable point is almost impossible to reach. Therefore, it is desirable to have a robust stationarity testing algorithm that works for points sufficiently close to a stationary one. In other words, we are interested in testing so-called near-approximate stationarity (see 25). We mention that, without exploiting structures in the nonsmooth objective function, such robust testing is impossible in general (Tian and So, 2022, Theorem 2.7).
1.1 Our Results and Techniques
Hardness.
Our first main result shows that checking certain first-order approximate stationarity concept for an unconstrained piecewise differentiable function is co-NP-hard (see 10). This implies that testing a certain stationarity concept for a shallow modern convolutional neural network is co-NP-hard (see 12). Our reduction is from the 3-satisfiability (3SAT) to a stationarity testing problem. As a corollary, we prove that testing so-called first-order minimality (FOM) for functions in abs-normal form is co-NP-complete (see 11) and give an affirmative answer to a conjecture of Griewank and Walther (2019, SIAM J. Optim., vol. 29, p284).
Our other results concern the empirical loss of a two-layer ReLU network, which was also studied by Yun et al. (2018). Given the training data with the parametrization, we first make the following blanket assumptions.
Assumption 1** (Blanket assumptions).**
The loss function is smooth and has locally Lipschitz gradient. For simplicity of notation, we write for . For any , we assume , which is superfluous for the parametrization.
The empirical loss of a two-layer ReLU neural network with hidden nodes can be written as
[TABLE]
Regularity.
By naïvely abusing the convex subdifferential chain rule for , we consider the following “generalized subdifferential” of the empirical loss as
[TABLE]
with . This “generalized subdifferential” is popular in practical computation and theoretical analysis. For example, see (Wang et al., 2019, Equation (9)), (Arora et al., 2019, Section 3.1), and (Safran et al., 2022, Equations (5) and (6)). However, as is nonconvex and nonsmooth, we can only assert a fuzzy chain rule (see (Clarke, 1990, Section 2.3)) for the Clarke subdifferential of , which is a set inclusion rather than an equation.
Our second main result is a necessary and sufficient condition for the validity of a series of equality-type subdifferential chain rules for the empirical loss of this shallow ReLU network. We show that, under this regularity condition, exact chain rules hold for three commonly used generalized subdifferentials, i.e., Clarke (see 2 and 14), limiting (see 4 and 16), and Fréchet (see 3 and 17). It is notable that while sufficient conditions for the equality-type calculus rules are rather rich in the literature (see (Rockafellar and Wets, 2009, Chapter 10)), a necessary condition is rarely seen, let alone an efficiently computable, necessary and sufficient condition in our 14.
Robust algorithms.
Our third main result is an algorithmic scheme to test the so-called near-approximate stationarity (see 25) in terms of both Clarke and Fréchet subdifferentials. We show that, for an approximate stationary point , any point that is sufficiently close to can be certified (with Algorithm 4) as near-approximate stationary. Our technique is a new rounding scheme (see Algorithm 3) motivated by the notion of active manifold identification Lewis (2002); Lemaréchal et al. (2000) in the literature. This new rounding scheme is capable of identifying the activation pattern of the target stationary point and finding a nearby point with the same pattern. One notable application of such a near-approximate stationarity test is to obtain a termination criterion for algorithms that only have asymptotic convergence results. For example, every limiting point of the sequence generated by the stochastic subgradient method has been shown to be Clarke stationary (see 6) by Davis et al. (2020, Corollary 5.11), but it is still unclear when to terminate the algorithm, and how to certify the obtained point is at least close to some Clarke stationary point, as the norm of any vector in the subdifferential is almost surely lower bounded away from zero during the entire trajectory (consider running the subgradient method on ).
Notation.
Scalars, vectors and matrices are denoted by lowercase letters, boldface lower case letters, and boldface uppercase letters, respectively. The notation used in this paper is mostly standard: (we may write to emphasize the dimension); for a closed set , which is defined as if the set ; denotes the convex hull of the set ; the vector denotes the -th column of identity matrix ; ; denotes the project to the -th argument operator; i.e., for sets ; the extended-real is defined as ; the addition of two sets is always understood in the sense of Minkowski; ; for any integer .
Organization.
We introduce the background on generalized differentiation theory and formal definitions of stationarity concepts in Section 2. Then, in Section 3, we present our main hardness results. The necessary and sufficient condition of the validity of chain rule in terms of various subdifferential constructions is presented in Section 4. We discuss the robust algorithms to test near-approximate stationarity concepts in Section 5. All proofs are deferred to the Appendices.
2 Preliminaries
The following construction of subdifferential by Clarke (1990, Theorem 2.5.1) is classic.
Definition 2** (Clarke subdifferential).**
Given a point , the Clarke subdifferential of a locally Lipschitz function at is defined by
[TABLE]
For a locally Lipschitz function, the Clarke subdifferential is always nonempty, convex, and compact (Clarke, 1990, Proposition 2.1.2(a)). The following set generated by a directional derivative is known as the Fréchet subdifferential of (Rockafellar and Wets, 2009, Exercise 8.4).
Definition 3** (Fréchet subdifferential).**
Given a point , the Fréchet subdifferential of a locally Lipschitz and directional differentiable function at is defined by
[TABLE]
The set-valued mapping of Fréchet subdifferential of is not outer semicontinuous (see (Rockafellar and Wets, 2009, Definition 5.4)), which means that given with , we cannot assert . The following limiting subdifferential (or the Mordukhovich subdifferential) (Rockafellar and Wets, 2009, Definition 8.3(b)) is more robust for analysis.
Definition 4** (Limiting subdifferential).**
Given a point , the limiting subdifferential of a locally Lipschitz and directional differentiable function at is defined by
[TABLE]
where the outer limit is taken in the sense of Kuratowski (see, e.g., (Rockafellar and Wets, 2009, p152, Equation 5(1))).
In the following result, we record a generalized Fermat’s rule for optimality conditions and the relationship among the aforementioned three subdifferentials.
Fact 5** (Rockafellar and Wets (2009, Theorem 8.6, 8.49, 10.1)).**
Given a locally Lipschitz function and a point , then we have If the point is a local minimizer of the function , then it holds that .
We are now ready to state the definitions of various stationarity concepts.
Definition 6** (Stationarity concepts).**
Given a locally Lipschitz function , we say that the point is an
- •
-Clarke stationary point if \textnormal{dist}\big{(}\bm{0},\partial_{C}f(\bm{x})\big{)}\leqslant\varepsilon;
- •
-Fréchet stationary point if \textnormal{dist}\big{(}\bm{0},\widehat{\partial}f(\bm{x})\big{)}\leqslant\varepsilon;
- •
-limiting stationary point if \textnormal{dist}\big{(}\bm{0},\partial f(\bm{x})\big{)}\leqslant\varepsilon.
The following Clarke regularity for locally Lipschitz and directional differentiable functions is a classic notion related to the validity of various subdifferential calculus rules; see (Clarke, 1990, Definition 2.3.4) and (Rockafellar and Wets, 2009, Corollary 8.11).
Definition 7** (Clarke regularity).**
For a locally Lipschitz directional differentiable function and a point , one has is Clarke regular at if .
We record some basic equality-type calculus rules for Clarke subdifferential as follows; see (Clarke, 1990, Proposition 2.3.3, Theorem 2.3.10), and (Rockafellar, 1985, Proposition 2.5). We refer the reader to (Rockafellar and Wets, 2009, Chapter 10) for similar calculus rules for Fréchet and limiting subdifferentials.
Fact 8** (Calculus rules).**
Let be two locally Lipschitz functions.
- •
If is strictly differentiable at , then ;
- •
If , then ;
- •
Given a strictly differentiable mapping and a point , if the function (or ) is Clarke regular at , then (or ) is Clarke regular at and , where is the Jacobian of mapping . The equality also holds when is surjective.
Remark 9**.**
The sum rule is a special case of the chain rule, which does not hold for Lipschitz functions trivially. For example, consider . The empirical loss of a ReLU network is in general not Clarke regular. To see this, let . It is easy to verify neither nor is Clarke regular. Another remark here is on the notion of partial subdifferentiation; see (Rockafellar and Wets, 2009, Corollary 10.11) and (Clarke, 1990, p48). In general, we cannot say much about the relationship between and (see (Clarke, 1990, Example 2.5.2)), except the following inclusion (Clarke, 1990, Proposition 2.3.16):
3 Hardness of Stationarity Testing
For smooth nonconvex programming, co-NP-hardness has been shown for local optimality testing (Murty and Kabadi, 1987, Theorem 2) and second-order sufficient condition testing (Murty and Kabadi, 1987, Theorem 4). However, in the nonsmooth case, we show that checking a first-order necessary condition approximately in terms of certain subdifferential is already co-NP-hard.
Theorem 10** (Testing of piecewise linear functions).**
Given a -Lipschitz piecewise linear function in the form of max–min representation111Any piecewise linear function can be written using a max-min representation as where is a finite index set; see (Scholtes, 2012, Proposition 2.2.2). The input data are and with integer data. For any , checking whether the point satisfying \textnormal{dist}\big{(}\bm{0},\widehat{\partial}f(\bm{0})\big{)}\leqslant\nicefrac{{1}}{{\sqrt{\eta}}} is co-NP-hard, and checking whether is strongly co-NP-hard.
We compare 10 with the classic hardness result of Murty and Kabadi (1987). In Murty and Kabadi (1987), checking the local optimality of a simply constrained indefinite quadratic problem (Murty and Kabadi, 1987, Problem 1) and of an unconstraint quartic polynomial objective (Murty and Kabadi, 1987, Problem 11) are both co-NP-complete. However, these hardness results are inapplicable for checking first-order necessary conditions. In fact, for any hard construction in Murty and Kabadi (1987) and a given point , testing can be done in polynomial time with respect to the input size. In 10, we show that for a class of simple unconstrained piecewise differentiable functions, even an approximate test of the first-order necessary condition for a certain point is already computationally intractable.
Nonsmooth functions in real-world applications usually contain structures that can be exploited in theoretical analysis and algorithmic design. A subclass of piecewise differentiable functions, termed or functions representable in abs-normal form, and defined as the composition of smooth functions and the absolute value function, is introduced by Griewank (2013); see Appendix A for a brief introduction and (Griewank and Walther, 2019, Definition 2.1) for details. An important corollary of our hard construction concerns the complexity of checking an optimality condition for functions in . The following result gives an affirmative answer to a conjecture of Griewank and Walther (2019, p284):
Corollary 11** (Testing of abs-normal form).**
Testing first order minimality (FOM) for a piecewise differentiable function given in the abs-normal form is co-NP-complete.
Now, we report another notable corollary about the complexity of testing a certain stationarity concept for the empirical loss of a modern convolutional neural network.
Corollary 12** (Testing of loss of nonsmooth networks).**
Let be the empirical loss function of a shallow neural network with ReLU activation function, max-pooling operator, and convolution operator. Suppose the width of the first layer is . Then, for any , testing the -Fréchet stationarity \textnormal{dist}\big{(}\bm{0},\widehat{\partial}f(\bm{\theta})\big{)}\leqslant\nicefrac{{1}}{{\sqrt{\eta}}} for a certain is co-NP-hard, and testing for is strongly co-NP-hard.
12 shows a computational tractability separation for the stationarity test between smooth and nonsmooth networks. In the smooth setting, given the gradient of every component function, we can compute the gradient norm of the loss function by iteratively applying chain rule. But in the nonsmooth case, while the subdifferential of every elemental function can be computed easily, the validity of the subdifferential chain rule like those in 8 is not justified, which turns out to cause a serious computational hurdle in stationarity test (strong co-NP-hardness).
4 Regularity Conditions
In this section, we study the regularity conditions for the validity of the equality-type chain rule in terms of Clarke, Fréchet, and limiting subdifferentials of the empirical loss of two-layer ReLU networks.
4.1 Setup
For simplicity of reference, we introduce the following notation, which will be used in various subdifferential constructions of the empirical loss .
Definition 13**.**
Let the parameters be given. We define the following shorthands:
- (a)
We write constants for any . 2. (b)
For any and , we define the following two indices sets:
[TABLE]
We may write and when the reference point is clear from the context. 3. (c)
For any , we define the following nonempty convex compact set related to the Clarke subdifferential:
[TABLE] 4. (d)
For any , we define the following nonempty compact set related to the limiting subdifferential:
[TABLE] 5. (e)
For any , we define the following convex compact set related to the Fréchet subdifferential:
[TABLE] 6. (f)
If an equation holds for all the three subdifferentials, i.e., Clarke/limiting/Fréchet subdifferentials (), we will write the equation simply with and also (for ). For example, if the equation holds , then we get , and .
4.2 Main Results
Theorem 14** (Clarke chain rule).**
Under 1, we claim that the exact Clarke subdifferential chain rule holds for at a given point , that is
[TABLE]
if and only if the data points satisfy the following Span Qualification (SQ):
[TABLE]
Remark 15**.**
Note that for any , the indices sets and can be computed in . Then, checking SQ is no harder than checking the Linear Independence Constraint Qualification (LICQ) in nonlinear programming and can be done with, e.g., Zassenhaus algorithm.
Theorem 16** (Limiting chain rule).**
Under 1, we claim that the exact limiting subdifferential chain rule holds for at a given point , that is
[TABLE]
if and only if the data points satisfy SQ.
For Fréchet subdifferential, the situation is different as the default chain rule is the reverse set inclusion ; see (Rockafellar and Wets, 2009, Corollary 10.9, Theorem 10.49). If , we have the exact chain rule trivially, as can only be the empty set. Therefore, the interesting case is when the Fréchet subdifferential is nonempty.
Theorem 17** (Fréchet chain rule).**
Under 1, for any given point such that the subdifferential , we have the following exact chain rule for the empirical loss
[TABLE]
if and only if the data points satisfy SQ.
4.3 Discussion
There are several existing regularity conditions related to the validity of exact chain rule of the empirical loss. We briefly introduce them here and defer the details to the 54 in Section C.5.
Definition 18** (Regularities).**
We consider the following regularity conditions:
- •
General position data: (Montufar et al., 2014, Section 2.2), (Yun et al., 2018, Assumption 2), and Bubeck et al. (2020)**;
- •
Linear Independence Kink Qualification (LIKQ): (Griewank and Walther, 2019, Definition 2.6) and (Griewank and Walther, 2016, Definition 2);
- •
Linearly Independent Activated Data (LIAD): Let the index set . For any fixed , the data points are linearly independent.
The general position assumption is from the study of hyperplane arrangement. If the data points are generated from an absolutely continuous probability measure (with respect to the Lebesgue measure), then they are in general position almost surely. The LIKQ is introduced by Griewank and Walther (2016, Definition 2) to ensure an efficient Fréchet stationarity test for piecewise differentiable function represented in abs-normal form. See Appendix A for a brief introduction. The LIAD condition is natural and equivalent to the subjectivity condition in 8. Let us present the following result, in which we establish the relationship among SQ and the three other regularity conditions in 18.
Proposition 19** (Regularity comparison).**
For the empirical loss of a shallow ReLU network under 1, we have the following relationship:
[TABLE]
We exhibit two examples to show the one-side arrows in 19 are strict.
Example 20** (SQ LIAD).**
Let the function be given as
[TABLE]
Consider . It is easy to verify that SQ is satisfied but not LIAD. Besides, is nonconvex, nonsmooth, and non-separable. Neither nor is Clarke regular. But by 14, the equality-type subdifferential sum rule still holds.
Example 21** (LIAD general position).**
Let the function be given as
[TABLE]
Consider and . LIAD is satisfied, but the data is not in general position.
In practice, for data , if the features of data include a discrete-valued component, e.g., , then the points are rarely in general position, as at least half of them must lie in the same affine hyperplane or .
Remark 22** ( for general position data).**
Besides, if the data points are in general position, we have the following compact representation for
[TABLE]
The following corollary concerning the Clarke regularity of all local minimizers could be of independent interest.
Corollary 23**.**
If at a point, SQ is satisfied and the empirical loss function has nonempty Fréchet subdifferential here, then the function is Clarke regular at that point. Consequently, with data in general position, is Clarke regular at every local minimizer.
5 Testing of Stationarity Concepts
To perform the stationarity test, we need the following quantitative regularities to characterize the curvature of the pieces in the empirical loss.
Assumption 24**.**
In this section, we further assume that for any , the norm of data and the function is -Lipschitz continuous with an -Lipschitz continuous gradient
5.1 Exact Stationarity Test
As an immediate illustration of the results in Section 4, we record the following exact testing schemes for Clarke and Fréchet stationary points. Compared with the developments in Yun et al. (2018) which check the Fréchet stationarity from the primal perspective and use polyhedral geometry to avoid redundant computation, by using 17, our treatment for Fréchet stationarity is transparent and its correctness is self-evident.
Clarke stationarity.
Suppose that SQ is satisfied at the point . By 14, it is a Clarke stationarity point of if and only if, for any ,
- (a)
; 2. (b)
.
Condition (a) is a simple equality test and condition (b) can be checked by solving a linear programming problem. Algorithm 1 is for testing -Clarke stationary points.
Fréchet stationarity.
Suppose that SQ is satisfied at the point . By 17, it is a Fréchet stationarity point of if and only if, for any ,
- (a)
; 2. (b)
; 3. (c)
.
Similarly, all above conditions can be checked in polynomial time with Algorithm 2.
5.2 Robust Stationarity Test
In this subsection, we introduce our main algorithmic results. First, we formally define the notion of stationarities that we are aiming to check; see Davis and Drusvyatskiy (2019); Kornowski and Shamir (2022a); Tian et al. (2022) for results on finding near-approximately stationary points for Lipschitz functions.
Definition 25** (Near-Approximate Stationarity, NAS).**
Given a locally Lipschitz function , we say that the point is an
- •
-Clarke NAS point, if \textnormal{dist}\Big{(}\bm{0},\cup_{\bm{y}\in\mathbb{B}_{\delta}(\bm{x})}\partial_{C}f(\bm{y})\Big{)}\leqslant\varepsilon;
- •
-Fréchet NAS point, if \textnormal{dist}\Big{(}\bm{0},\cup_{\bm{y}\in\mathbb{B}_{\delta}(\bm{x})}\widehat{\partial}f(\bm{y})\Big{)}\leqslant\varepsilon.
We consider a constructive approach, that is, we certify the -Clarke NAS of a point for the function only if we find a point satisfying . Note that, in any time, if a point passes the exact stationarity test, say, with Algorithm 1, then must be an -Clarke NAS point. In other words, there is no false positive in the test. The question is that, if is sufficiently closed to a Clarke stationary point, can we always find a point near such that is -Clarke stationary? That is to say, we need to control the false negative of our robust test. Without exploiting structures in the objective function, finding such a point is impossible in general (Tian and So, 2022, Theorem 2.7). Our technique is a new rounding scheme (see Algorithm 3), which is motivated by the notion of active manifold identification Lewis (2002); Lemaréchal et al. (2000) in the literature. This new rounding scheme is capable to identify the activation pattern of the target stationary point that is sufficiently close to.
Now, suppose that is -smooth and a point satisfies . Without knowing the concrete structure of , what we can say for any point is that , which is the best result we can hope for our test, as we do not assume any concrete structure in the loss except their smoothness. Such an estimation cannot hold trivially for a nonsmooth function. Consider and . For any and , we have .
5.2.1 Testing Clarke NAS
We define two constants that will be used in the analysis.
Definition 26** (Clarke).**
Given a point with a Euclidean norm , we define the following constants about the separation and curvature of pieces around this point:
- •
Separation:
- •
Curvature: .222See Section D.1 for the exact value.
Remark 27**.**
If for any and , it holds , then we define the separation constant , as in the optimization of extended-real-valued functions, . It is notable that, while the separation constant is usually unknown when running the testing algorithm, the curvature constant can be easily estimated when the candidate network and the radius are given.
Theorem 28** (Robust Clarke test).**
Let an -Clarke stationary point satisfying SQ be given. For any and any
[TABLE]
if the output point of Algorithm 3 satisfies SQ, then we have
[TABLE]
In 28, we show that for a point that is sufficiently closed to an -Clarke stationary one, and a properly chosen parameter , one can correctly certify the near-approximate stationarity of this point in the style as if the function is smooth by calling Algorithm 4 with . A natural question here is how to choose a proper parameter , as the separation constant is usually unknown. It turns out that a simple line search will work for that.
Remark 29** (Line search).**
Set the initial value of radius to, say, . Then, in the -th iteration, run Algorithm 4 with parameter and set . Note that for a sufficiently small , the rounding scheme in Algorithm 3 becomes superfluous, as for any and such that , we have for a small . Therefore, we can stop the line search within at most
[TABLE]
iterations. It is immediate that, if (u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})\in\mathbb{B}_{C_{\tau}^{\textnormal{Clarke}}/2}\big{(}(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})\big{)}, then there exists a radius in the iteration sequence such that
[TABLE]
This search scheme also works for the Fréchet NAS test and we will not repeat that.
5.2.2 Testing Fréchet NAS
Unlike the Clarke case, we need the following extra nondegeneracy condition on to identify the pattern of and avoid the Fréchet subdifferential being empty.
Assumption 30**.**
Given a point , we assume that for any such that , we have .
The following two constants will be used in the analysis.
Definition 31** (Fréchet).**
Given a point with a Euclidean norm , we define two constants concerning the separation and curvature of pieces around this point:
- •
Separation:
- •
Curvature: .333See Section D.2 for the exact value.
Then, for Fréchet NAS test, we have the following result similar to 28.
Theorem 32** (Robust Fréchet test).**
Let an -Fréchet stationary point satisfying SQ be given. For any and any
[TABLE]
if the output point of Algorithm 5 satisfies SQ, then we have
[TABLE]
Appendix A Abs-Normal Form of Piecewise Differentiable Functions
We briefly review the abs-normal representation of a subclass of piecewise differentiable functions. See Griewank (2013); Griewank and Walther (2016) for details.
A.1 The General Framework
The abs-normal representation Griewank (2013) is a piecewise linearization scheme concerning a certain subclass of piecewise differentiable functions in the sense of Scholtes (2012). In this subclass, functions are defined as compositions of smooth functions and the absolute value function. By identifies and , composition with these nonsmooth elemental functions can also be represented in the abs-normal form.
Let be a function in such subclass. By numbering all input to the absolute value functions in the evaluation order as “switching variables” for , the function can be written in the following abs-normal form:
[TABLE]
where , the smooth mapping , and the smooth function . As the numbering of is in the evaluation order, is a function of only if . In sum, we have
[TABLE]
where a successive evaluation of with given . To see such an evaluation of is well-defined, note that and for any ,
[TABLE]
We remark that, similar to the Difference of Convex (DC) decomposition in DC programming, the function may have many different abs-normal decomposition. The following vectors and matrices are useful when study the function in abs-normal form:
[TABLE]
For any , we will denote by . Let us define (see also (Griewank and Walther, 2016, Equation (11)))
[TABLE]
which will play a key role in the definition of LIKQ (see 54).
A.2 Abs-Normal Form of Shallow ReLU Networks
We rewrite the empirical loss of the shallow ReLU network with absolute value functions as
[TABLE]
Then, as there are absolute value evaluations in total, we define the switching variable and the smooth mapping as
[TABLE]
The smooth function in the abs-normal form can be written as
[TABLE]
where . Consequently, the matrix , which implies the function is “simply switched” in the sense of Griewank and Walther (2016). For the matrix and any , the -th row of can be written as
[TABLE]
Appendix B Proofs for Section 3
B.1 The Problems
Problem 33** (3SAT).**
Given a collection of clauses on Boolean variables such that clause is limited to a disjunction of at most three literals for any . Let the following formula of in conjunctive normal form be given
[TABLE]
Is there an satisfying ?
Problem 34** (Piecewise Linear Test, PLT).**
Suppose and the input data be given. Let us define a function as
[TABLE]
Is there a vector satisfying and
[TABLE]
Its complement is given by
[TABLE]
Problem 35** (Neural Network Test, NNT).**
Suppose . Let the input data \bm{Y}=\left[\begin{array}[]{c|c|c}\bm{y}_{1}&\cdots&\bm{y}_{3n}\end{array}\right]\subseteq\mathbb{Z}^{m\times 3n} be given. Let us define as
[TABLE]
Is an -Fréchet stationary point of , i.e., \textnormal{dist}\big{(}\bm{0},\widehat{\partial}f_{\textsf{NNT}}(-\mathbf{1}_{3n},\bm{0}_{m})\big{)}\leqslant\varepsilon?
Problem 36** (Abs-Normal Form Test, ANFT).**
Suppose a piecewise linear function is given in the abs-linear form with vectors and matrices . Is there a definite signature vector such that the following system with respect to is incompatible
[TABLE]
B.2 Hardness of Piecewise Linear Test
Lemma 37**.**
34* (PLT) is co-NP-hard.*
Proof.
We have to show that is an element of the complexity class NP-hard. 3SAT in 33 is known to be strongly NP-complete Garey and Johnson (1979). We give a polynomial-time reduction from 3SAT to . Given any instance of 3SAT, we get clauses for . We will refer literals in by their positions. For example, given , we say the literal occurs in at position , the literal occurs in at position , and the literal occurs in at position . We construct the data as follows
[TABLE]
Note the following positive -homogeneous function in the construction of PLT
[TABLE]
Suppose that for any , there exists such that . We will exhibit an such that the given 3SAT is satisfied. Let and there exists such that . For any , let
[TABLE]
We show . By , we get for any
[TABLE]
which implies that there exists a such that . Let the index of the Boolean literal occurs in at position be . Now we consider two cases. If occurs in at position , then . We get . So, by definition, which implies . Otherwise, if occurs in at position , then . We get . So by definition, which implies . This shows that and the given 3SAT is satisfied.
Conversely, we show that if there exists a vector such that and , then 3SAT cannot be satisfied. Suppose to the contrary that there exists such that . For any , let
[TABLE]
As , for any , there exists a literal of clause that is satisfied. Let the index of this literal be and the position of it in be . We consider two cases. If literal occurs in at position , then . As due to literal , we get and by definition. Then, for such , we get
[TABLE]
Otherwise, if literal occurs in at position , then . As due to literal , we get and by definition. Then, for any , we get
[TABLE]
This gives
[TABLE]
a contradiction. Hence 34 is in the class co-NP-hard. ∎
While it is not clear whether the 34 with a positive is an element of the complexity class co-NP, we show that, when , 34 is in co-NP.
Lemma 38**.**
If , then 34 is in the complexity class of co-NP.
Proof.
For , we only need to test . Given any checking whether can be done in time. If the answer to 34 is yes, by homogeneity in , there exist a direction and a vector such that and for any . There are only elements in the set and all resulting \left[\begin{array}[]{c|c|c}\bm{y}_{s_{1}}&\cdots&\bm{y}_{3n-3+s_{n}}\end{array}\right] are integer matrix of polynomial length relative to the input size of 34. So the certificate can be obtained by solving a linear program in polynomial time. Therefore, if there exists such that , then a nondeterministic algorithm can find and satisfying in polynomial time. Thus, 34 with is an element of the complexity class co-NP. ∎
Proof of 10.
We first note that 34 can be written in the standard max-min form in polynomial time by the following elementary identify:
[TABLE]
Besides, it holds . By 3, we know \textnormal{dist}\big{(}\bm{0},\widehat{\partial}f_{\textsf{PLT}}(\bm{0})\big{)}\leqslant\varepsilon if and only if there exists a vector satisfying and , which is the definition of 34. Note that if , in the reduction from 3SAT in the proof of 37, all numerical parameters are bounded by a polynomial of the input size. The proof completes by 37.∎
B.3 Hardness of Abs-Normal Form Test
Proof of 11.
We first show that PLT in 34 can be written in the abs-normal form in polynomial time. For ease of notation, let for any . Then, we can rewrite every in the abs-linear form as
[TABLE]
Note that the function can be expressed as
[TABLE]
which can be written in abs-normal form as
[TABLE]
In sum, we have
[TABLE]
Then, we know
[TABLE]
Then, the matrices can be computed in polynomial time.
We note that if and only if the function is first-order minimal in abs-normal form and this is shown in the discussion below (Griewank and Walther, 2019, Equation (2)) (see also (Griewank and Walther, 2016, p3)). Then, the answer of ANFT in 36 for the abs-normal form of is No if and only if is a Fréchet stationary point of . Then, by 37, ANFT in 36 is NP-hard. To see ANFT is in NP, for any given , the computation of the vector \bm{a}^{\top}+\bm{b}^{\top}\big{(}\mathop{\textnormal{Diag}}(\bm{\sigma})-\bm{L}\big{)}^{-1}\bm{Z} and the matrix \big{(}\mathop{\textnormal{Diag}}(\bm{\sigma})-\bm{L}\big{)}^{-1}\bm{Z} can be done in polynomial time. Then, ANFT for a given reduces to check the infeasibility of a linear system, which is in P. In sum, we have shown ANFT in 36 is NP-complete, which implies a general test of FOM without kink qualification in (Griewank and Walther, 2019, Theorem 4.1) is co-NP-complete. ∎
B.4 Hardness of Neural Network Test
Lemma 39**.**
35* (NNT) is co-NP-hard. If , 35 is co-NP-complete.*
Proof.
We first prove that is an -Fréchet stationary point of if and only if there exists such that with the same input data . By (Rockafellar and Wets, 2009, Exercise 8.4) and is B-differentiable; see (Cui and Pang, 2021, Definition 4.1.1), we get \textnormal{dist}\big{(}\bm{0},\widehat{\partial}f_{\textsf{NNT}}(-\mathbf{1}_{3n},\bm{0}_{m})\big{)}\leqslant\varepsilon if and only if there exists such that
[TABLE]
Using the chain rule of directional derivative for B-differentiable function (Cui and Pang, 2021, Proposition 4.1.2(a)), we have
[TABLE]
For any , consider and . We get that Section B.4 holds if and only if and , which completes the proof by the co-NP-hardness of 34 in 37 and co-NP-completeness if in 38. ∎
Proof of 12.
Note that 35 can be represented by the empirical loss of a convolutional neural network with and architecture
[TABLE]
where and \bm{Y}=\left[\begin{array}[]{c|c|c}\bm{y}_{1}&\cdots&\bm{y}_{3n}\end{array}\right]\subseteq\mathbb{Z}^{m\times 3n}. If , in the reduction from 3SAT to PLT, then to NNT, all numerical parameters are bounded by a polynomial of the input size. The proof completes by 39. ∎
Appendix C Proofs for Section 4
C.1 Proof Roadmap
Recall the loss function of shallow ReLU neural network:
[TABLE]
Set constants for any . Let us first consider a partially linearized loss function defined by
[TABLE]
By exploiting the smoothness of and a Lagrange scalarization technique in 46, we will show that
[TABLE]
Then, we focus on the linearized . By separation of and using again the Lagrange scalarization technique in form of 48, we have
[TABLE]
where (a) is due to (Rockafellar, 1985, Proposition 2.5) and (Rockafellar and Wets, 2009, Proposition 10.5); (b) is by 48. Therefore, it holds
[TABLE]
which implies that the validity of exact chain rule of rely on a careful study of . In particular, if we have the exact chain rule for any as follows
[TABLE]
then we get the validity of exact chain rule for . That is
[TABLE]
To prove Equation 1, we need a fine-grained analysis of . First, we isolate the nonsmooth part out by rewritting
[TABLE]
where we define
[TABLE]
What remaining is to study the subdifferential of this non-separable piecewise linear function for any and figure out conditions, under which
[TABLE]
This will be done in Section C.4.
C.2 Technical Lemmas
Lemma 40** (Gordan, cf. (Bertsimas and Tsitsiklis, 1997, Exercise 4.26)).**
Let be given. Then, exactly one of the following statements is true:
- •
There exists an such that .
- •
There exists a such that with .
Lemma 41**.**
Let be sets in . Suppose further that is convex and closed, and is nonempty and bounded. If the strict inclusion holds, then we can assert .
Proof.
Let . The claim is trivial when . Choose and set . As is closed, the following is well-defined
[TABLE]
Let . As and is closed, we know . By the optimality condition and convexity of , we know , which implies . As is bounded, we know . Let be
[TABLE]
where . We claim . Suppose not. Therefore, there exist such that . However, we compute
[TABLE]
which gives the contradiction. ∎
Remark 42**.**
Though the claim seems straightforward, 41 is indeed non-trivial. We record the following counterexamples when different conditions are removed.
- •
* is empty: .*
- •
* is unbounded: if and are nonempty, then .*
- •
* is nonconvex: if , then .*
- •
* is not closed: if , then .*
Lemma 43**.**
Let be linearly independent. Define a convex set . For any , the point is an extreme point of .
Proof.
Suppose not and with and . We know and for any by definition. Thus, it holds
[TABLE]
As are linearly independent, we know that, for any , it holds . If , we have . Meanwhile, we know if . Therefore, it holds , a contradiction. ∎
Lemma 44**.**
Let a function be . If there exists such that and , then we have .
Proof.
Suppose not and let . We write
[TABLE]
where we define . Then, by (Rockafellar and Wets, 2009, Exercise 8.8(c)), we have
[TABLE]
Let and we know . By (Rockafellar and Wets, 2009, Exercise 8.4), for any , it holds
[TABLE]
Let and we know . Thus, . Let be any such that and . Then, we have
[TABLE]
a contradiction. ∎
Definition 45** (Bouligand subdifferential, c.f. (Cui and Pang, 2021, Definition 4.3.1)).**
Given a point , the Bouligand subdifferential of a locally Lipschitz function at is defined by
[TABLE]
C.3 Partial Linearization via Lagrange Scalarization
The following theorem is a powerful and general principle.
Theorem 46** (Partial linearization).**
Let a point and a locally Lipschitz be given in form of composition , where the gradient of is locally Lipschitz near and is locally Lipschitz near . Suppose and are directionally differentiable. Then, we have
[TABLE]
Proof.
Let the partially linearized at be defined as
[TABLE]
For the limiting subdifferential version, the claim directly follows from a margin function chain rule (Mordukhovich and Shao, 1996, Theorem 6.5). The Clarke subdifferential version directly follows from the relation between Clarke and limiting subdifferential (Rockafellar and Wets, 2009, Theorem 8.49) and (Mordukhovich and Shao, 1996, Theorem 6.5). However, as the proof of (Mordukhovich and Shao, 1996, Theorem 6.5) uses a perturbation argument to approximate with -Fréchet subdifferential, the machinery is somehow complicated. Here we give an elementary proof for the Clarke version from the primal perspective using tools from convex analysis. We show for any . Note that the Clarke generalized subderivative can be written as
[TABLE]
where the difference quotient function of at and direction is defined by
[TABLE]
We assume is -smooth near and is -Lipschitz near . We will use the following estimation (see (Nesterov, 2003, Lemma 1.2.3)) if is -smooth at :
[TABLE]
To prove , we compute as follows
[TABLE]
Therefore, for any , we know
[TABLE]
where in we use . For the converse direction , we just compute similarly. We have proved . The claim follows from the correspondence between sublinear and convex (Clarke, 1990, Proposition 2.1.5).
Now we show the relation holds for Fréchet subdifferential. As and are locally Lipschitz and directional differentiable, they are Bouligand-differentiable (B-differentiable) according to (Cui and Pang, 2021, Definition 4.1.1). Then, by (Cui and Pang, 2021, Proposition 4.1.2(a)), we know that
[TABLE]
where the directional derivative is defined element-wise as according to (Cui and Pang, 2021, Definition 1.1.4). Thus, combined with \bar{f}^{\prime}(\bm{x};\bm{d})=\left\langle\nabla h\big{(}G(\bm{x})\big{)},G^{\prime}(\bm{x};d)\right\rangle, we have shown for any , which implies
[TABLE]
by (Rockafellar and Wets, 2009, Exercise 8.4) (note that for B-differentiable , the subderivative in (Rockafellar and Wets, 2009, Exercise 8.4) is equal to the directional derivative by (Rockafellar and Wets, 2009, Exercise 9.15)). ∎
Remark 47**.**
46* is fundamentally different from the classic exact chain rule as the exact chain rule does not hold even for very simple function. Consider and . We have . In contrast, by 46, we have . One should compare 46 with (Clarke, 1990, Theorem 2.3.9, Theorem 2.3.10). Besides, 46 implies (Clarke, 1990, Theorem 2.3.9(ii)).*
Corollary 48**.**
Let be , where is a Lipschitz function. Then, we have .
Proof.
Let be . It is easy to see is smooth at any . Let and . As , by 46, we know
[TABLE]
as required. ∎
C.4 Exact Chain Rule of a Non-Separable Piecewise Linear Function
In this section, we consider the validity of the exact subdifferential chain rule of a simple piecewise-linear function, which is defined by
[TABLE]
C.4.1 Chain Rule for Clarke Subdifferential
Theorem 49** (Clarke).**
Suppose for any . We have the exact Clarke subdifferential chain rule
[TABLE]
if and only if \textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}=\{\bm{0}\}.
Proof.
We have divided the proof into 50 and 51. ∎
Lemma 50** (Necessary).**
If there exists such that
[TABLE]
then
Proof.
We first prove that assuming certain regularity on and is without loss of generality. Let the indices set be a selection from such that are linearly independent and satisfy
[TABLE]
Similarly, we define for . Then, we write
[TABLE]
where
[TABLE]
By the fuzzy sum rule (Clarke, 1990, Proposition 2.3.3), we know
[TABLE]
where we define . Thus, to prove , by 41 and (Clarke, 1990, Proposition 2.1.2(a)), we only need to show . So, by abuse of notation and focus on , we assume are linearly independent. Similarly, we assume are linearly independent. In the following, we may use 41 and above argument implicitly to assume regularity for simplicity. As \bm{v}\in\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}, we write
[TABLE]
It is safe to assume . Fix . We can further assume are linearly independent. To see this, suppose to the contrary are not linearly independent, we get We know that there exist such that , as otherwise by linear independence of , for any , it holds that , hence that are linearly independent. As , we have
[TABLE]
Plug in to Equation 2 and is removed. Repeat this procedure and by abuse of notation, we have are linearly independent. After that, we exam and . We remove if and remove if , which is without of generality by 41. It is possible that all are removed and we get and \bm{y}_{m}\in\textnormal{span}\big{(}\{\bm{x}_{i}\}_{i=1}^{n}\big{)}. But as , we always have . Then, we can write
[TABLE]
with for any . Note that, for such and , we have the exact chain rule
[TABLE]
by using (Clarke, 1990, Theorem 2.3.10) and linear independence. We proceed to show that by exhibiting an element in . Let and we define
[TABLE]
Note that
[TABLE]
By Gordan’s Theorem in 40, we have certified the nonexistence of direction such that
[TABLE]
By , similarly, we certify the nonexistence of direction such that
[TABLE]
Let the Bouligand subdifferential of at be ; see (Cui and Pang, 2021, Definition 4.3.1). Define
[TABLE]
By (Cui and Pang, 2021, Proposition 4.4.8(c)) and the nonexistences of for (4) and (5), we have proved that . Let us define a set
[TABLE]
Besides, using (Cui and Pang, 2021, Proposition 4.4.8(c)), we have . Then, with (Rockafellar and Wets, 2009, Theorem 9.61), it follows that
[TABLE]
Therefore, to prove , we only need to show
[TABLE]
To this end, we define two sets satisfying as
[TABLE]
Thus, we can write . It is evident that . If , we have
[TABLE]
We now show that it must be by considering three cases:
Case 1.
. Without loss of generality, we assume . Note that for any , using the representation of in Equation 3, we have
[TABLE]
where . Similarly, we write as
[TABLE]
where . Therefore, we know
[TABLE]
As are linearly independent, it holds
[TABLE]
If , we have
[TABLE]
which gives the contradiction.
Case 2.
but . Suppose . Then, we write
[TABLE]
Note that are linearly independent. By abuse of notation and swapping and , we still write . Then, we have and the situation reduces to the Case 1.
Case 3.
. In that case, we have . By a similar manipulation as these in Case 1, we have
[TABLE]
As are linearly independent, it holds
[TABLE]
If , we have
[TABLE]
which gives the contradiction.
Therefore, we have shown which implies However, as are linearly independent, is an extreme point of by 43. Thus, we know by definition, a contradiction. ∎
Lemma 51** (Sufficient).**
If the following condition holds
[TABLE]
then
Proof.
We first do a general preparation that will be reused in other developments. Let \bm{X}=\left[\begin{array}[]{c|c|c}\bm{x}_{1}&\cdots&\bm{x}_{n}\end{array}\right]\in\mathbb{R}^{d\times n} and \bm{Y}=\left[\begin{array}[]{c|c|c}\bm{y}_{1}&\cdots&\bm{y}_{m}\end{array}\right]\in\mathbb{R}^{d\times m} be given. The thin-SVD of can be written as with , and . Similarly, for , we have with , and . As \textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}=\{\bm{0}\}, we know . Therefore, we can write
[TABLE]
where and \bm{U}\coloneqq\left[\begin{array}[]{c|c}\bm{U}_{x}&\bm{U}_{y}\end{array}\right]\in\mathop{\textnormal{St}}(d,r_{x}+r_{y}).
Let an auxiliary function be
[TABLE]
As is separable with respect to and , by (Rockafellar, 1985, Proposition 2.5) and (Rockafellar and Wets, 2009, Proposition 10.5), we know
[TABLE]
Note that . We compute
[TABLE]
where (a) is using (Clarke, 1990, Theorem 2.3.10), (Rockafellar and Wets, 2009, Exercise 10.7), and is full column rank; (b) is from ; (c) is using the reasoning in (a) for separately.
In particular for Clarke subdifferential, we know using (Clarke, 1990, Proposition 2.3.1). As is convex, is equal to the convex subdifferential of by (Clarke, 1990, Proposition 2.2.7). Then, by (Hiriart-Urruty and Lemaréchal, 2004, §D, Corollary 4.3.2), a direct computation gives
[TABLE]
as required. ∎
Proof of 14.
According to the argument in Section C.1, we only need to consider the Clarke subdifferential for every . It is showed in 49 that we have
[TABLE]
if and only if the following span qualification is satisfied:
[TABLE]
Then, put all cases together, and 14 is proved. ∎
C.4.2 Chain Rule for Limiting Subdifferential
Theorem 52** (Limiting).**
Suppose and for any . We have the exact limiting subdifferential chain rule
[TABLE]
if and only if \textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}=\{\bm{0}\}.
Proof.
(Sufficient) We begin with the general argument in the proof of 51 until Equation . After that, we will focus on the proof of
[TABLE]
For the ease of notation, we denote . Note that by the definition of limiting subdifferential (see 4), we have
[TABLE]
Let . Then, there exist and such that and . We can assume for any and any , we have , as otherwise, by 44, and is undefined. Then, for any , the function is strictly differentiable at , which implies
[TABLE]
As is a finite set, it is trivially closed with the usual Euclidean metric. We have . For the reverse direction, let . Then, there exists such that
[TABLE]
with for any . Let . We get for any . Then, we know the function is strictly differentiable at and . Thus, for any , we get . Consequently, we get and .
(Necessary) Suppose \bm{0}\neq\bm{v}\in\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)} and
[TABLE]
Then, by taking a convex hull on both size and using (Rockafellar and Wets, 2009, Theorem 8.49), we get
[TABLE]
which is a contradiction to 50. ∎
Proof of 16.
According to the argument in Section C.1, we only need to consider the limiting subdifferential for every . It is showed in 52 that we have
[TABLE]
if and only if the following span qualification is satisfied:
[TABLE]
Then, put all cases together, and 16 is proved. ∎
C.4.3 Chain Rule for Fréchet Subdifferential
Theorem 53** (Fréchet).**
Suppose and for any . For any given such that , we have the following exact chain rule
[TABLE]
if and only if \textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}=\{\bm{0}\}.
Proof.
(Sufficient) We begin with the general argument in the proof of 51 until Equation . We will focus on the proof of
[TABLE]
For the ease of notation, we denote . Then, by 44, we know that if there exists such that and , then we have . If , then and . The claim follows trivially.
(Necessary) Suppose \bm{0}\neq\bm{v}\in\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}. There exists as otherwise \bm{v}\notin\{\bm{0}\}\supseteq\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}. Then, we get and . Thus, from the assumption that , we know by definition. ∎
Proof of 17.
According to the argument in Section C.1, we only need to consider the Fréchet subdifferential for every . It is showed in 53 that we have
[TABLE]
if and only if the following span qualification is satisfied:
[TABLE]
Then, put all cases together, and 17 is proved. ∎
C.5 Proofs for Section 4.3
Definition 54** (Regularities).**
We consider the following regularity conditions:
- •
General position data (Yun et al., 2018, Assumption 2): No data points lie on the same affine hyperplane, which is equivalent to the nonexistence of and index set with such that for any .
- •
Linear Independence Kink Qualification (LIKQ) (Griewank and Walther, 2016, Definition 2), (Griewank and Walther, 2019, Definition 2.6): Let the -th row of the matrix in Appendix A be . We define the following index set
[TABLE]
LIKQ is satisfied if the vectors are linearly independent.
- •
Linearly Independent Activated Data (LIAD): Let the index set . For any fixed , the data points are linearly independent.
Proof of 19.
For the relation general position LIAD, it directly follows from (Yun et al., 2018, Lemma 1). By the analysis in Section A.2, we know LIKQ is satisfied for the empirical loss of two-layer ReLU network if and only if
[TABLE]
are linearly independent. It is easy to see that LIKQ holds if and only if, for any given , the data points are linearly independent. Thus, we have the relation LIAD LIKQ. Note that . If are linearly independent, then it is evident that
[TABLE]
which implies LIAD SQ. ∎
Proof of 23.
Under SQ, if the Fréchet subdifferential is nonempty, we get for any . By 14 and 17, we have and are equal at that point. Then, Clarke regularity follows from 7. By 19, if the data points are in general position, then they satisfy SQ. Using (Rockafellar and Wets, 2009, Theorem 10.1), the Fréchet subdifferential is nonempty at every local minimizer, which completes the proof. ∎
Appendix D Proofs for Section 5
D.1 Testing Clarke NAS
Proof of 28.
We consider an -Clarke stationary point with
[TABLE]
By 14, we know there exists such that
[TABLE]
Note that for any in the returned vector of Algorithm 3. In this subsection, we will write rather than for simplicity. Given a positive radius , we aim to show that, for any
[TABLE]
we can certify that the rounded point returned by Algorithm 3 satisfies
[TABLE]
where is a constant depending on the curvature that we will discuss later.
We define the following shorthands for convenience
[TABLE]
Recall the definition of the rounded and we define indices sets as
[TABLE]
We consider the following quantity related to the point for any :
[TABLE]
Note that . For any such that , we have
[TABLE]
Thus, we know . Similarly, for any such that , we have
[TABLE]
which implies . We have, for any such that , it holds
[TABLE]
So, we know . As are disjoint and , we know
[TABLE]
Meanwhile, as is feasible to the quadratic program in Algorithm 3, we get
[TABLE]
which implies and for any . It is evident that
[TABLE]
as, for any , is feasible to the quadratic program for computing in Algorithm 3. Therefore, we know
[TABLE]
By triangle inequality, it holds that
[TABLE]
Using 14, we get
[TABLE]
where we define . We first compute
[TABLE]
We now upper bound the second term . Note that
[TABLE]
Now we estimate these two terms. For , we compute
[TABLE]
For , we see that
[TABLE]
Summarizing, we have
[TABLE]
We proceed to upper bound \sum_{k=1}^{H}\textnormal{dist}\Big{(}\bm{g}^{*}_{k},\partial_{C}\overline{L}_{k}(\widehat{\bm{w}}_{k})\Big{)}.
By 14, we know that there exist such that the Clarke subgradient can be written as
[TABLE]
Now, we are well prepared to upper bound \textnormal{dist}\big{(}\bm{g}_{k}^{*},\partial_{C}\overline{L}_{k}(\widehat{\bm{w}}_{k})\big{)}. Let
[TABLE]
which, by 14, belongs to the Clarke subdifferential . We upper bound
[TABLE]
with
[TABLE]
Then, we have
[TABLE]
In sum, we have proved that
[TABLE]
where . ∎
D.2 Testing Fréchet NAS
Proof of 32.
Some steps in the computation are similar to these in the proof of 28 in Section D.1, and we may skip them for simplicity. We consider an -Fréchet stationary point with By 17, there exists a regular subgradient such that
[TABLE]
Given a positive radius , we aim to show that, for any
[TABLE]
we can certify the rounded point returned by Algorithm 5 satisfying
[TABLE]
where is a constant depending on the curvature that we will discuss later.
Similar to Section D.1, we define the following shorthands for convenience
[TABLE]
We consider the following quantity related to the point for any :
[TABLE]
We use the same indices sets for computing the rounded as those in Section D.1. Note that . The argument in Section D.1 shows that
[TABLE]
Meanwhile, as is feasible to the quadratic program in Algorithm 5, we get
[TABLE]
which implies and for any . It is evident that
[TABLE]
as, for any , is feasible to the quadratic program for computing in Algorithm 5.
However, the identification of is not sufficient to bound \textnormal{dist}\Big{(}\bm{g}^{*}_{k},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\Big{)}, as the index set may not be an empty set, which, by 17, implies and \textnormal{dist}\Big{(}\bm{g}^{*}_{k},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\Big{)}=+\infty. We are thus looking for the identification of . We define a constant and consider the following quantity related to the point for any :
[TABLE]
Note that . Fix any and we consider two cases. If , by 17 and 30, we know for any . Then, for any , we have
[TABLE]
which by the rounding step of in Algorithm 5 implies if , then and
[TABLE]
If , we can see that
[TABLE]
which implies by rounding step of in Algorithm 5. Thus, we have proved that for any and , we get , hence that , and finally that . By 17, we conclude that .
Summarizing, we have
[TABLE]
This shows by triangle inequality that
[TABLE]
Using 17, we get
[TABLE]
where we define . We first compute
[TABLE]
We now upper bound the second term . A computation similar to that in Section D.1 shows that
[TABLE]
where and .
We proceed to upper bound \sum_{k=1}^{H}\textnormal{dist}\Big{(}\bm{g}^{*}_{k},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\Big{)}. By 17, we know that there exist such that can be written as
[TABLE]
Now, we are well prepared to upper bound \textnormal{dist}\big{(}\bm{g}_{k}^{*},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\big{)}. Let
[TABLE]
which, by 17, belongs to the Fréchet subdifferential . We proceed to upper bound \textnormal{dist}\big{(}\bm{g}_{k}^{*},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\big{)}\leqslant\|\widehat{\bm{g}}_{k}-\bm{g}_{k}^{*}\| with
[TABLE]
Then, we have
[TABLE]
In sum, we have proved that
[TABLE]
where . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ahmadi and Zhang (2022) A. A. Ahmadi and J. Zhang. On the complexity of finding a local minimizer of a quadratic function over a polytope. Mathematical Programming , 195(1-2):783–792, 2022.
- 2Arora et al. (2019) S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning , pages 322–332. PMLR, 2019.
- 3Bertsimas and Tsitsiklis (1997) D. Bertsimas and J. N. Tsitsiklis. Introduction to Linear Optimization , volume 6. Athena Scientific Belmont, MA, 1997.
- 4Bubeck et al. (2020) S. Bubeck, R. Eldan, Y. T. Lee, and D. Mikulincer. Network size and size of the weights in memorization with two-layers neural networks. In Advances in Neural Information Processing Systems , volume 33, pages 4977–4986, 2020.
- 5Burke et al. (2002) J. V. Burke, A. S. Lewis, and M. L. Overton. Approximating subdifferentials by random sampling of gradients. Mathematics of Operations Research , 27(3):567–584, 2002.
- 6Clarke (1990) F. H. Clarke. Optimization and Nonsmooth Analysis . SIAM, 1990.
- 7Cui and Pang (2021) Y. Cui and J.-S. Pang. Modern Nonconvex Nondifferentiable Optimization . SIAM, 2021.
- 8Davis and Drusvyatskiy (2019) D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization , 29(1):207–239, 2019.
