Testing Stationarity Concepts for ReLU Networks: Hardness, Regularity,   and Robust Algorithms

Lai Tian; Anthony Man-Cho So

arXiv:2302.12261·math.OC·February 27, 2023

Testing Stationarity Concepts for ReLU Networks: Hardness, Regularity, and Robust Algorithms

Lai Tian, Anthony Man-Cho So

PDF

Open Access

TL;DR

This paper investigates the computational complexity of stationarity testing in ReLU neural networks, establishing hardness results, providing a regularity condition, and proposing a robust algorithm for near-approximate stationarity testing.

Contribution

It proves the co-NP-hardness of certain stationarity tests, introduces a simple regularity condition for subdifferential chain rules, and develops a practical algorithm for robust stationarity testing in ReLU networks.

Findings

01

Testing first-order stationarity is co-NP-hard.

02

A simple regularity condition for subdifferential chain rule validity.

03

A robust algorithm for near-approximate stationarity testing.

Abstract

We study the computational problem of the stationarity test for the empirical loss of neural networks with ReLU activation functions. Our contributions are: Hardness: We show that checking a certain first-order approximate stationarity concept for a piecewise linear function is co-NP-hard. This implies that testing a certain stationarity concept for a modern nonsmooth neural network is in general computationally intractable. As a corollary, we prove that testing so-called first-order minimality for functions in abs-normal form is co-NP-complete, which was conjectured by Griewank and Walther (2019, SIAM J. Optim., vol. 29, p284). Regularity: We establish a necessary and sufficient condition for the validity of an equality-type subdifferential chain rule in terms of Clarke, Fr\'echet, and limiting subdifferentials of the empirical loss of two-layer ReLU networks. This new condition is…

Equations442

L (u_{1}, w_{1}, \dots, u_{H}, w_{H}) : = i = 1 \sum N ℓ_{i} (k = 1 \sum H u_{k} \cdot max {w_{k}^{⊤} x_{i}, 0}) .

L (u_{1}, w_{1}, \dots, u_{H}, w_{H}) : = i = 1 \sum N ℓ_{i} (k = 1 \sum H u_{k} \cdot max {w_{k}^{⊤} x_{i}, 0}) .

\tilde{G}\coloneqq\sum_{i=1}^{N}\rho_{i}\cdot\prod_{k=1}^{H}\left\{\max\left\{\bm{w}_{k}^{\top}\bm{x}_{i},0\right\}\right\}\times\left\{\begin{array}[]{rcl}\left\{u_{k}\cdot\bm{x}_{i}\cdot\mathbf{1}_{\bm{w}_{k}^{\top}\bm{x}_{i}>0}\right\}&\mbox{if}&\bm{w}_{k}^{\top}\bm{x}_{i}\neq 0,\\ u_{k}\cdot\bm{x}_{i}\cdot[0,1]&\mbox{if}&\bm{w}_{k}^{\top}\bm{x}_{i}=0,\end{array}\right.

\tilde{G}\coloneqq\sum_{i=1}^{N}\rho_{i}\cdot\prod_{k=1}^{H}\left\{\max\left\{\bm{w}_{k}^{\top}\bm{x}_{i},0\right\}\right\}\times\left\{\begin{array}[]{rcl}\left\{u_{k}\cdot\bm{x}_{i}\cdot\mathbf{1}_{\bm{w}_{k}^{\top}\bm{x}_{i}>0}\right\}&\mbox{if}&\bm{w}_{k}^{\top}\bm{x}_{i}\neq 0,\\ u_{k}\cdot\bm{x}_{i}\cdot[0,1]&\mbox{if}&\bm{w}_{k}^{\top}\bm{x}_{i}=0,\end{array}\right.

\partial_{C}f(\bm{x}):=\textnormal{Conv}\big{\{}\bm{s}:\exists\bm{x}^{\prime}\!\rightarrow\!\bm{x},\nabla f(\bm{x}^{\prime})\textnormal{ exists},\nabla f(\bm{x}^{\prime})\!\rightarrow\!\bm{s}\big{\}}.

\partial_{C}f(\bm{x}):=\textnormal{Conv}\big{\{}\bm{s}:\exists\bm{x}^{\prime}\!\rightarrow\!\bm{x},\nabla f(\bm{x}^{\prime})\textnormal{ exists},\nabla f(\bm{x}^{\prime})\!\rightarrow\!\bm{s}\big{\}}.

\widehat{\partial}f(\bm{x}):=\big{\{}\bm{s}:\bm{s}^{\top}\bm{d}\leqslant f^{\prime}(\bm{x};\bm{d})\text{ for all }\bm{d}\big{\}}.

\widehat{\partial}f(\bm{x}):=\big{\{}\bm{s}:\bm{s}^{\top}\bm{d}\leqslant f^{\prime}(\bm{x};\bm{d})\text{ for all }\bm{d}\big{\}}.

\partial f (x) := x^{'} \to x lim sup \partial f (x^{'}),

\partial f (x) := x^{'} \to x lim sup \partial f (x^{'}),

I_{k}^{+} (w_{k})

I_{k}^{+} (w_{k})

I_{k}^{-} (w_{k})

G_{k}^{C} : = i \in [N] \ (I_{k}^{+} \cup I_{k}^{-}) \sum u_{k} ρ_{i} \cdot 1_{w_{k}^{⊤} x_{i} > 0} \cdot x_{i} + j \in I_{k}^{+} \cup I_{k}^{-} \sum u_{k} ρ_{j} \cdot x_{j} \cdot [0, 1] .

G_{k}^{C} : = i \in [N] \ (I_{k}^{+} \cup I_{k}^{-}) \sum u_{k} ρ_{i} \cdot 1_{w_{k}^{⊤} x_{i} > 0} \cdot x_{i} + j \in I_{k}^{+} \cup I_{k}^{-} \sum u_{k} ρ_{j} \cdot x_{j} \cdot [0, 1] .

G_{k}^{L}

G_{k}^{L}

+ j \in I_{k}^{+} \sum u_{k} ρ_{i} x_{j} \cdot [0, 1] + ⎩ ⎨ ⎧ j \in I_{k}^{-} \sum u_{k} ρ_{i} \cdot 1_{d^{⊤} x_{j} > 0} \cdot x_{j} : \exists d \in R^{d}, t \in I_{k}^{-} min x_{t}^{⊤} d > 0 ⎭ ⎬ ⎫ .

G^{F}_{k}\coloneqq\sum_{i\in[N]\backslash(\mathcal{I}_{k}^{+}\cup\mathcal{I}_{k}^{-})}u_{k}\rho_{i}\cdot\mathbf{1}_{\bm{w}_{k}^{\top}\bm{x}_{i}>0}\cdot\bm{x}_{i}+\sum_{j\in\mathcal{I}_{k}^{+}}u_{k}\rho_{j}\bm{x}_{j}\cdot[0,1]+\left\{\begin{array}[]{rcl}\emptyset&\mbox{if}&\left|\mathcal{I}_{k}^{-}\right|>0,\\ \bm{0}&\mbox{if}&\left|\mathcal{I}_{k}^{-}\right|=0.\end{array}\right.

G^{F}_{k}\coloneqq\sum_{i\in[N]\backslash(\mathcal{I}_{k}^{+}\cup\mathcal{I}_{k}^{-})}u_{k}\rho_{i}\cdot\mathbf{1}_{\bm{w}_{k}^{\top}\bm{x}_{i}>0}\cdot\bm{x}_{i}+\sum_{j\in\mathcal{I}_{k}^{+}}u_{k}\rho_{j}\bm{x}_{j}\cdot[0,1]+\left\{\begin{array}[]{rcl}\emptyset&\mbox{if}&\left|\mathcal{I}_{k}^{-}\right|>0,\\ \bm{0}&\mbox{if}&\left|\mathcal{I}_{k}^{-}\right|=0.\end{array}\right.

\partial_{C} L (u_{1}, w_{1}, \dots, u_{H}, w_{H}) = k = 1 \prod H {i = 1 \sum N ρ_{i} \cdot max {w_{k}^{⊤} x_{i}, 0}} \times G_{k}^{C},

\partial_{C} L (u_{1}, w_{1}, \dots, u_{H}, w_{H}) = k = 1 \prod H {i = 1 \sum N ρ_{i} \cdot max {w_{k}^{⊤} x_{i}, 0}} \times G_{k}^{C},

1 ⩽ k ⩽ H ⋃ span ({x_{i}}_{i \in I_{k}^{+}}) \cap span ({x_{j}}_{j \in I_{k}^{-}}) = {0} . \vspace - 2.5 mm

1 ⩽ k ⩽ H ⋃ span ({x_{i}}_{i \in I_{k}^{+}}) \cap span ({x_{j}}_{j \in I_{k}^{-}}) = {0} . \vspace - 2.5 mm

\partial L (u_{1}, w_{1}, \dots, u_{H}, w_{H}) = k = 1 \prod H {i = 1 \sum N ρ_{i} \cdot max {w_{k}^{⊤} x_{i}, 0}} \times G_{k}^{L},

\partial L (u_{1}, w_{1}, \dots, u_{H}, w_{H}) = k = 1 \prod H {i = 1 \sum N ρ_{i} \cdot max {w_{k}^{⊤} x_{i}, 0}} \times G_{k}^{L},

\partial L (u_{1}, w_{1}, \dots, u_{H}, w_{H}) = k = 1 \prod H {i = 1 \sum N ρ_{i} \cdot max {w_{k}^{⊤} x_{i}, 0}} \times G_{k}^{F},

\partial L (u_{1}, w_{1}, \dots, u_{H}, w_{H}) = k = 1 \prod H {i = 1 \sum N ρ_{i} \cdot max {w_{k}^{⊤} x_{i}, 0}} \times G_{k}^{F},

general position ⟹ LIKQ ⟺ LIAD ⟹ SQ .

general position ⟹ LIKQ ⟺ LIAD ⟹ SQ .

f (x, y, z, b) : = max {2 y + b, 0} + max {2 x + 2 z + b, 0} + max {x + y + z + b, 0} - max {x - z + b, 0} .

f (x, y, z, b) : = max {2 y + b, 0} + max {2 x + 2 z + b, 0} + max {x + y + z + b, 0} - max {x - z + b, 0} .

f (x, y, b) : = max {- 2 y + b, 0} + max {- y + b, 0} + max {x + b, 0} - max {y + b, 0} .

f (x, y, b) : = max {- 2 y + b, 0} + max {- y + b, 0} + max {x + b, 0} - max {y + b, 0} .

G_{k}^{L} = i \in [N] \ (I_{k}^{+} \cup I_{k}^{-}) \sum u_{k} ρ_{i} \cdot 1_{w_{k}^{⊤} x_{i} > 0} \cdot x_{i} + j \in I_{k}^{+} \sum u_{k} ρ_{j} x_{j} \cdot [0, 1] + j^{'} \in I_{k}^{-} \sum u_{k} ρ_{j^{'}} x_{j^{'}} \cdot {0, 1} .

G_{k}^{L} = i \in [N] \ (I_{k}^{+} \cup I_{k}^{-}) \sum u_{k} ρ_{i} \cdot 1_{w_{k}^{⊤} x_{i} > 0} \cdot x_{i} + j \in I_{k}^{+} \sum u_{k} ρ_{j} x_{j} \cdot [0, 1] + j^{'} \in I_{k}^{-} \sum u_{k} ρ_{j^{'}} x_{j^{'}} \cdot {0, 1} .

w_{k} = argmin_{z \in R^{d}}

w_{k} = argmin_{z \in R^{d}}

z^{⊤} x_{i} ⩾ 2 R \cdot δ,

z^{⊤} x_{i} ⩽ - 2 R \cdot δ,

z^{⊤} x_{i} = 0,

(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})\in\mathbb{B}_{\delta}\big{(}(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})\big{)},

(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})\in\mathbb{B}_{\delta}\big{(}(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})\big{)},

\textnormal{dist}\Big{(}\bm{0},\partial_{C}L(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{w}}_{H})\Big{)}\leqslant\varepsilon+C_{\mu}^{\textnormal{Clarke}}\cdot\delta.

\textnormal{dist}\Big{(}\bm{0},\partial_{C}L(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{w}}_{H})\Big{)}\leqslant\varepsilon+C_{\mu}^{\textnormal{Clarke}}\cdot\delta.

⌈ lo g_{2} (2 R / min {x_{i}^{⊤} w_{k} : i \in [N], k \in [H], x_{i}^{⊤} w_{k} \neq = 0}) ⌉

⌈ lo g_{2} (2 R / min {x_{i}^{⊤} w_{k} : i \in [N], k \in [H], x_{i}^{⊤} w_{k} \neq = 0}) ⌉

\textnormal{dist}\Big{(}\bm{0},\partial_{C}L(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{w}}_{H})\Big{)}\leqslant\varepsilon+C_{\mu}^{\textnormal{Clarke}}\cdot\delta_{t}.

\textnormal{dist}\Big{(}\bm{0},\partial_{C}L(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{w}}_{H})\Big{)}\leqslant\varepsilon+C_{\mu}^{\textnormal{Clarke}}\cdot\delta_{t}.

w_{k} = argmin_{z \in R^{d}}

w_{k} = argmin_{z \in R^{d}}

z^{⊤} x_{i} ⩾ 2 R \cdot δ,

z^{⊤} x_{i} ⩽ - 2 R \cdot δ,

z^{⊤} x_{i} = 0,

(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})\in\mathbb{B}_{\delta}\big{(}(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})\big{)},

(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})\in\mathbb{B}_{\delta}\big{(}(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})\big{)},

\textnormal{dist}\Big{(}\bm{0},\widehat{\partial}L(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{w}}_{H})\Big{)}\leqslant\varepsilon+C_{\mu}^{\textnormal{Fr\'{e}chet}}\cdot\delta.

\textnormal{dist}\Big{(}\bm{0},\widehat{\partial}L(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{w}}_{H})\Big{)}\leqslant\varepsilon+C_{\mu}^{\textnormal{Fr\'{e}chet}}\cdot\delta.

z = F (x, p), y = f (x, p),

z = F (x, p), y = f (x, p),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Advanced Neural Network Applications

MethodsTest

Full text

Testing Stationarity Concepts for ReLU Networks:

Hardness, Regularity, and Robust Algorithms

Lai Tian Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Sha Tin, N.T., Hong Kong SAR. E-mail: [email protected].

Anthony Man-Cho So Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Sha Tin, N.T., Hong Kong SAR. E-mail: [email protected].

Abstract

We study the computational problem of the stationarity test for the empirical loss of neural networks with ReLU activation functions. Our contributions are:

Hardness: We show that checking a certain first-order approximate stationarity concept for a piecewise linear function is co-NP-hard. This implies that testing a certain stationarity concept for a modern nonsmooth neural network is in general computationally intractable. As a corollary, we prove that testing so-called first-order minimality for functions in abs-normal form is co-NP-complete, which was conjectured by Griewank and Walther (2019, SIAM J. Optim., vol. 29, p284). 2. 2.

Regularity: We establish a necessary and sufficient condition for the validity of an equality-type subdifferential chain rule in terms of Clarke, Fréchet, and limiting subdifferentials of the empirical loss of two-layer ReLU networks. This new condition is simple and efficiently checkable. 3. 3.

Robust algorithms: We introduce an algorithmic scheme to test near-approximate stationarity in terms of both Clarke and Fréchet subdifferentials. Our scheme makes no false positive or false negative error when the tested point is sufficiently close to a stationary one and a certain qualification is satisfied. This is the first practical and robust stationarity test approach for two-layer ReLU networks.

1 Introduction

The theoretical analysis of ReLU neural network training is challenging from the optimization perspective, though the empirical performance of various “gradient”-based algorithms is surprisingly good. A key difficulty comes from the entanglement of nonconvexity and nonsmoothness in the objective function of the empirical loss, which causes not only the notion of gradient from classical analysis meaningless, but also the subdifferential set from convex analysis vacuous. Consequently, the study of such a nonconvex nondifferentiable function requires the use of tools from variational analysis Rockafellar and Wets (2009).

For a continuously differentiable function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , a point $\bm{x}\in\mathbb{R}^{d}$ is called stationary (or critical) if $\nabla f(\bm{x})=\bm{0}$ . However, the situation is much more complicated when $f$ is nondifferentiable at $\bm{x}$ . Indeed, there are many different stationarity concepts (see 6) for nonsmooth functions Li et al. (2020); Cui and Pang (2021). For general Lipschitz functions, recently, under the oracle complexity framework of Nemirovskij and Yudin (1983), substantial progress has been made on the design of provable algorithms for finding approximately stationary (in the sense of perturbed) points Zhang et al. (2020); Tian et al. (2022); Davis et al. (2022); Lin et al. (2022); Metel and Takeda (2022); Kong and Lewis (2022) and also on establishing the hardness of computing such approximate stationary points Kornowski and Shamir (2022a); Tian and So (2022); Kornowski and Shamir (2022b); Jordan et al. (2022).

As a complement to these developments, in this paper, we consider the complexity of and robust algorithms for checking whether a given neural network is an (approximately) stationary one with respect to the empirical loss. This is a task already considered by Yun et al. (2018). We emphasize that “checking” and “finding” are two very different computational problems. While the co-NP-hardness of checking the local optimality of a given point in smooth nonconvex programming was shown by Murty and Kabadi (1987) in 1987, the complexity of “finding” a local minimizer was an open question proposed by Pardalos and Vavasis (1992) since 1992, and is recently settled by Ahmadi and Zhang (2022).

Given a neural network with smooth elemental components, testing the (approximate) stationarity of a point is simply an application of the classic gradient chain rule. In a modern computational environment, this is usually done by using Algorithmic Differentiation (AD) Griewank and Walther (2008) software, e.g., PyTorch and TensorFlow. A natural question that arises is whether testing the stationarity for a piecewise smooth function (e.g., empirical loss of a ReLU network) is as easy as testing for a smooth one. Surprisingly, we show (in 10) that such testing is, in general, computationally intractable.

The difficulty here is due to the failure of an exact (equality-type) subdifferential chain rule. For a general locally Lipschitz function, the calculus rules are only known to hold in the form of set inclusions rather than equalities, except in several special cases (see 8). This prevents one from computing the subdifferential set of the empirical loss with that of elemental components. Thus, to facilitate the tractability of stationarity testing, it is of interest to find out a condition, under which an equality-type chain rule holds, and the subdifferential set of the empirical loss can be characterized. By contrast, given a first-order oracle providing the whole generalized subdifferential set at the queried point in the oracle framework Kornowski and Shamir (2022a); Tian and So (2022); Kornowski and Shamir (2022b); Jordan et al. (2022), the stationarity testing task reduces to a simple linear program, which can be solved by interior-point methods in polynomial time. However, in practice, even computing an element in the generalized subdifferential for a nonsmooth function can be highly non-trivial Burke et al. (2002); Nesterov (2005); Huang and Ma (2010); Khan and Barton (2013). Therefore, a condition for the validity of the exact chain rule could be useful for subgradient computation and stationarity testing and analysis.

The most closely related work to ours is the one by Yun et al. (2018). They considered a two-layer ReLU network and introduced a theoretical algorithm to sequentially check Clarke stationarity (see 6), Fréchet stationarity, and a certain second-order optimality condition. For Fréchet stationarity testing, they proposed to verify the nonnegativity of a directional derivative in every possible direction, for which a trivial test in the worst case requires checking exponentially many inequalities. By exploiting polyhedral geometry, they showed that it suffices to check only extreme rays, which can be done in polynomial time. A limitation of the work Yun et al. (2018) (see also the discussion in (Yun et al., 2018, Section 5)) is that the algorithm therein can only perform exact stationarity testing (see Section 5.1). That is to say if the objective function is $x\mapsto|x|$ , then the algorithm in Yun et al. (2018) will certify stationarity if and only if $x=0$ . However, as pointed out by Yun et al. (2018, Section 5), in practice, such an exact nondifferentiable point is almost impossible to reach. Therefore, it is desirable to have a robust stationarity testing algorithm that works for points sufficiently close to a stationary one. In other words, we are interested in testing so-called near-approximate stationarity (see 25). We mention that, without exploiting structures in the nonsmooth objective function, such robust testing is impossible in general (Tian and So, 2022, Theorem 2.7).

1.1 Our Results and Techniques

Hardness.

Our first main result shows that checking certain first-order approximate stationarity concept for an unconstrained piecewise differentiable function is co-NP-hard (see 10). This implies that testing a certain stationarity concept for a shallow modern convolutional neural network is co-NP-hard (see 12). Our reduction is from the 3-satisfiability (3SAT) to a stationarity testing problem. As a corollary, we prove that testing so-called first-order minimality (FOM) for functions in abs-normal form is co-NP-complete (see 11) and give an affirmative answer to a conjecture of Griewank and Walther (2019, SIAM J. Optim., vol. 29, p284).

Our other results concern the empirical loss of a two-layer ReLU network, which was also studied by Yun et al. (2018). Given the training data $\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}\subseteq\mathbb{R}^{d}\times\mathbb{R}$ with the $\bm{x}=(\tilde{\bm{x}},1)$ parametrization, we first make the following blanket assumptions.

Assumption 1 (Blanket assumptions).

The loss function $\ell:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}$ is smooth and has locally Lipschitz gradient. For simplicity of notation, we write $\ell_{i}(\cdot)$ for $\ell(\cdot,y_{i})$ . For any $i\in[N]$ , we assume $\bm{x}_{i}\neq\bm{0}$ , which is superfluous for the $\bm{x}=(\tilde{\bm{x}},1)$ parametrization.

The empirical loss of a two-layer ReLU neural network with $H$ hidden nodes can be written as

[TABLE]

Regularity.

By naïvely abusing the convex subdifferential chain rule for $L$ , we consider the following “generalized subdifferential” of the empirical loss $L$ as

[TABLE]

with $\rho_{i}\coloneqq\ell_{i}^{\prime}\left(\sum_{k=1}^{H}u_{k}\cdot\max\left\{\bm{w}_{k}^{\top}\bm{x}_{i},0\right\}\right),\forall i\in[N]$ . This “generalized subdifferential” is popular in practical computation and theoretical analysis. For example, see (Wang et al., 2019, Equation (9)), (Arora et al., 2019, Section 3.1), and (Safran et al., 2022, Equations (5) and (6)). However, as $L$ is nonconvex and nonsmooth, we can only assert a fuzzy chain rule (see (Clarke, 1990, Section 2.3)) for the Clarke subdifferential $\partial_{C}L$ of $L$ , which is a set inclusion $\partial_{C}L(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})\subseteq\tilde{G}$ rather than an equation.

Our second main result is a necessary and sufficient condition for the validity of a series of equality-type subdifferential chain rules for the empirical loss of this shallow ReLU network. We show that, under this regularity condition, exact chain rules hold for three commonly used generalized subdifferentials, i.e., Clarke (see 2 and 14), limiting (see 4 and 16), and Fréchet (see 3 and 17). It is notable that while sufficient conditions for the equality-type calculus rules are rather rich in the literature (see (Rockafellar and Wets, 2009, Chapter 10)), a necessary condition is rarely seen, let alone an efficiently computable, necessary and sufficient condition in our 14.

Robust algorithms.

Our third main result is an algorithmic scheme to test the so-called near-approximate stationarity (see 25) in terms of both Clarke and Fréchet subdifferentials. We show that, for an approximate stationary point $\bm{x}^{*}$ , any point that is sufficiently close to $\bm{x}^{*}$ can be certified (with Algorithm 4) as near-approximate stationary. Our technique is a new rounding scheme (see Algorithm 3) motivated by the notion of active manifold identification Lewis (2002); Lemaréchal et al. (2000) in the literature. This new rounding scheme is capable of identifying the activation pattern of the target stationary point and finding a nearby point with the same pattern. One notable application of such a near-approximate stationarity test is to obtain a termination criterion for algorithms that only have asymptotic convergence results. For example, every limiting point of the sequence generated by the stochastic subgradient method has been shown to be Clarke stationary (see 6) by Davis et al. (2020, Corollary 5.11), but it is still unclear when to terminate the algorithm, and how to certify the obtained point is at least close to some Clarke stationary point, as the norm of any vector in the subdifferential is almost surely lower bounded away from zero during the entire trajectory (consider running the subgradient method on $x\mapsto|x|$ ).

Notation.

Scalars, vectors and matrices are denoted by lowercase letters, boldface lower case letters, and boldface uppercase letters, respectively. The notation used in this paper is mostly standard: $\mathbb{B}_{\varepsilon}(\bm{x})\coloneqq\{\bm{v}:\|\bm{v}-\bm{x}\|\leqslant\varepsilon\}$ (we may write $\mathbb{B}_{\varepsilon}^{d}(\bm{x})$ to emphasize the dimension); $\textnormal{dist}(\bm{x},S)\coloneqq\inf_{\bm{v}\in S}\|\bm{v}-\bm{x}\|$ for a closed set $S$ , which is defined as $+\infty$ if the set $S=\emptyset$ ; $\textnormal{Conv}(S)$ denotes the convex hull of the set $S$ ; the vector $\bm{e}_{i}$ denotes the $i$ -th column of identity matrix $\bm{I}$ ; $\mathbb{R}_{+}\coloneqq\{x\in\mathbb{R}:x\geqslant 0\}$ ; $\pi_{i}$ denotes the project to the $i$ -th argument operator; i.e., $\pi_{i}\left(\prod_{j=1}^{n}S_{j}\right)\coloneqq S_{i}$ for sets $\{S_{i}\}_{i=1}^{n}$ ; the extended-real $\overline{\mathbb{R}}$ is defined as $\mathbb{R}\cup\{-\infty,+\infty\}$ ; the addition of two sets is always understood in the sense of Minkowski; $\overline{\mathbb{Z}}\coloneqq\mathbb{Z}\cup\{-\infty,\infty\}$ ; $[m]\coloneqq\{1,\dots,m\}$ for any integer $m\geqslant 1$ .

Organization.

We introduce the background on generalized differentiation theory and formal definitions of stationarity concepts in Section 2. Then, in Section 3, we present our main hardness results. The necessary and sufficient condition of the validity of chain rule in terms of various subdifferential constructions is presented in Section 4. We discuss the robust algorithms to test near-approximate stationarity concepts in Section 5. All proofs are deferred to the Appendices.

2 Preliminaries

The following construction of subdifferential by Clarke (1990, Theorem 2.5.1) is classic.

Definition 2 (Clarke subdifferential).

Given a point $\bm{x}$ , the Clarke subdifferential of a locally Lipschitz function $f$ at $\bm{x}$ is defined by

[TABLE]

For a locally Lipschitz function, the Clarke subdifferential is always nonempty, convex, and compact (Clarke, 1990, Proposition 2.1.2(a)). The following set generated by a directional derivative $f^{\prime}$ is known as the Fréchet subdifferential of $f$ (Rockafellar and Wets, 2009, Exercise 8.4).

Definition 3 (Fréchet subdifferential).

Given a point $\bm{x}$ , the Fréchet subdifferential of a locally Lipschitz and directional differentiable function $f$ at $\bm{x}$ is defined by

[TABLE]

The set-valued mapping $\widehat{\partial}f$ of Fréchet subdifferential of $f$ is not outer semicontinuous (see (Rockafellar and Wets, 2009, Definition 5.4)), which means that given $\bm{x}_{\nu}\rightarrow\bm{x},\bm{g}_{\nu}\rightarrow\bm{g}$ with $\bm{g}_{\nu}\in\widehat{\partial}f(\bm{x}_{\nu})$ , we cannot assert $\bm{g}\in\widehat{\partial}f(\bm{x})$ . The following limiting subdifferential (or the Mordukhovich subdifferential) (Rockafellar and Wets, 2009, Definition 8.3(b)) is more robust for analysis.

Definition 4 (Limiting subdifferential).

Given a point $\bm{x}$ , the limiting subdifferential of a locally Lipschitz and directional differentiable function $f$ at $\bm{x}$ is defined by

[TABLE]

where the outer limit is taken in the sense of Kuratowski (see, e.g., (Rockafellar and Wets, 2009, p152, Equation 5(1))).

In the following result, we record a generalized Fermat’s rule for optimality conditions and the relationship among the aforementioned three subdifferentials.

Fact 5 (Rockafellar and Wets (2009, Theorem 8.6, 8.49, 10.1)).

Given a locally Lipschitz function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ and a point $\bm{x}\in\mathbb{R}^{d}$ , then we have $\widehat{\partial}f(\bm{x})\subseteq\partial f(\bm{x})\subseteq\partial_{C}f(\bm{x}).$ If the point $\bm{x}$ is a local minimizer of the function $f$ , then it holds that $\bm{0}\in\widehat{\partial}f(\bm{x})$ .

We are now ready to state the definitions of various stationarity concepts.

Definition 6 (Stationarity concepts).

Given a locally Lipschitz function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , we say that the point $\bm{x}\in\mathbb{R}^{d}$ is an

•

$\varepsilon$ -Clarke stationary point if $\textnormal{dist}\big{(}\bm{0},\partial_{C}f(\bm{x})\big{)}\leqslant\varepsilon$ ;

•

$\varepsilon$ -Fréchet stationary point if $\textnormal{dist}\big{(}\bm{0},\widehat{\partial}f(\bm{x})\big{)}\leqslant\varepsilon$ ;

•

$\varepsilon$ -limiting stationary point if $\textnormal{dist}\big{(}\bm{0},\partial f(\bm{x})\big{)}\leqslant\varepsilon$ .

The following Clarke regularity for locally Lipschitz and directional differentiable functions is a classic notion related to the validity of various subdifferential calculus rules; see (Clarke, 1990, Definition 2.3.4) and (Rockafellar and Wets, 2009, Corollary 8.11).

Definition 7 (Clarke regularity).

For a locally Lipschitz directional differentiable function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ and a point $\bm{x}$ , one has $f$ is Clarke regular at $\bm{x}$ if $\partial_{C}f(\bm{x})=\widehat{\partial}f(\bm{x})$ .

We record some basic equality-type calculus rules for Clarke subdifferential as follows; see (Clarke, 1990, Proposition 2.3.3, Theorem 2.3.10), and (Rockafellar, 1985, Proposition 2.5). We refer the reader to (Rockafellar and Wets, 2009, Chapter 10) for similar calculus rules for Fréchet and limiting subdifferentials.

Fact 8 (Calculus rules).

Let $f:\mathbb{R}^{d}\rightarrow\mathbb{R},g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be two locally Lipschitz functions.

•

If $f$ is strictly differentiable at $\bm{x}$ , then $\partial_{C}(f+g)(\bm{x})=\nabla f(\bm{x})+\partial_{C}g(\bm{x})$ ;

•

If $h(\bm{x},\bm{y})=f(\bm{x})+g(\bm{y})$ , then $\partial_{C}h(\bm{x},\bm{y})=\partial_{C}f(\bm{x})\times\partial_{C}g(\bm{y})$ ;

•

Given a strictly differentiable mapping $G:\mathbb{R}^{n}\rightarrow\mathbb{R}^{d}$ and a point $\bm{y}\in\mathbb{R}^{n}$ , if the function $f$ (or $-f$ ) is Clarke regular at $G(\bm{y})$ , then $f\circ G$ (or $-f\circ G$ ) is Clarke regular at $\bm{y}$ and $\partial_{C}[f\circ G](\bm{y})=(JG(\bm{y}))^{\top}\partial_{C}f(G(\bm{y}))$ , where $JG$ is the Jacobian of mapping $G$ . The equality also holds when $JG$ is surjective.

Remark 9.

The sum rule is a special case of the chain rule, which does not hold for Lipschitz functions trivially. For example, consider $\partial_{C}[|\cdot|-|\cdot|](0)=\{0\}\subsetneq\partial_{C}[|\cdot|](0)+(-\partial_{C}[|\cdot|](0))=[-2,2]$ . The empirical loss of a ReLU network is in general not Clarke regular. To see this, let $f(x,y)=\max\{x,0\}-\max\{y,0\}$ . It is easy to verify neither $f$ nor $-f$ is Clarke regular. Another remark here is on the notion of partial subdifferentiation; see (Rockafellar and Wets, 2009, Corollary 10.11) and (Clarke, 1990, p48). In general, we cannot say much about the relationship between $\partial f(\bm{x},\bm{y})$ and $\partial_{\bm{x}}f(\bm{x},\bm{y})\times\partial_{\bm{y}}f(\bm{x},\bm{y})$ (see (Clarke, 1990, Example 2.5.2)), except the following inclusion (Clarke, 1990, Proposition 2.3.16): $\partial_{\bm{x}}f(\bm{x},\bm{y})\times\partial_{\bm{y}}f(\bm{x},\bm{y})\subseteq\pi_{1}\partial f(\bm{x},\bm{y})\times\pi_{2}\partial f(\bm{x},\bm{y}).$

3 Hardness of Stationarity Testing

For smooth nonconvex programming, co-NP-hardness has been shown for local optimality testing (Murty and Kabadi, 1987, Theorem 2) and second-order sufficient condition testing (Murty and Kabadi, 1987, Theorem 4). However, in the nonsmooth case, we show that checking a first-order necessary condition approximately in terms of certain subdifferential is already co-NP-hard.

Theorem 10 (Testing of piecewise linear functions).

Given a $3\sqrt{d}$ -Lipschitz piecewise linear function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ in the form of max–min representation111Any piecewise linear function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ can be written using a max-min representation as $f(\bm{x})=\max_{1\leqslant i\leqslant l}\min_{j\in M_{i}}\bm{a}_{j}^{\top}\bm{x}+b_{j},$ where $M_{i}\subseteq[m]$ is a finite index set; see (Scholtes, 2012, Proposition 2.2.2). The input data are $d\in\mathbb{N},m\in\mathbb{N},l\in\mathbb{N},\{(\bm{a}_{j},b_{j})\}_{j=1}^{m},$ and $\{M_{i}\}_{i=1}^{l}.$ with integer data. For any $\eta\in(d,+\infty]\cap\overline{\mathbb{Z}}$ , checking whether the point $\bm{0}\in\mathbb{Z}^{d}$ satisfying $\textnormal{dist}\big{(}\bm{0},\widehat{\partial}f(\bm{0})\big{)}\leqslant\nicefrac{{1}}{{\sqrt{\eta}}}$ is co-NP-hard, and checking whether $\bm{0}\in\widehat{\partial}f(\bm{0})$ is strongly co-NP-hard.

We compare 10 with the classic hardness result of Murty and Kabadi (1987). In Murty and Kabadi (1987), checking the local optimality of a simply constrained indefinite quadratic problem (Murty and Kabadi, 1987, Problem 1) and of an unconstraint quartic polynomial objective (Murty and Kabadi, 1987, Problem 11) are both co-NP-complete. However, these hardness results are inapplicable for checking first-order necessary conditions. In fact, for any hard construction $f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}}$ in Murty and Kabadi (1987) and a given point $\bm{x}\in\mathbb{Q}^{n}$ , testing $\bm{0}\in\widehat{\partial}f(\bm{x})$ can be done in polynomial time with respect to the input size. In 10, we show that for a class of simple unconstrained piecewise differentiable functions, even an approximate test of the first-order necessary condition $\bm{0}\in\widehat{\partial}f(\bm{x})$ for a certain point $\bm{x}$ is already computationally intractable.

Nonsmooth functions in real-world applications usually contain structures that can be exploited in theoretical analysis and algorithmic design. A subclass of piecewise differentiable functions, termed $C^{d}_{\textnormal{abs}}$ or functions representable in abs-normal form, and defined as the composition of smooth functions and the absolute value function, is introduced by Griewank (2013); see Appendix A for a brief introduction and (Griewank and Walther, 2019, Definition 2.1) for details. An important corollary of our hard construction concerns the complexity of checking an optimality condition for functions in $C^{d}_{\textnormal{abs}}$ . The following result gives an affirmative answer to a conjecture of Griewank and Walther (2019, p284):

Corollary 11 (Testing of abs-normal form).

Testing first order minimality (FOM) for a piecewise differentiable function given in the abs-normal form is co-NP-complete.

Now, we report another notable corollary about the complexity of testing a certain stationarity concept for the empirical loss of a modern convolutional neural network.

Corollary 12 (Testing of loss of nonsmooth networks).

Let $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be the empirical loss function of a shallow neural network with ReLU activation function, max-pooling operator, and convolution operator. Suppose the width of the first layer is $m$ . Then, for any $\eta\in(m,+\infty]\cap\overline{\mathbb{Z}}$ , testing the $\nicefrac{{1}}{{\sqrt{\eta}}}$ -Fréchet stationarity $\textnormal{dist}\big{(}\bm{0},\widehat{\partial}f(\bm{\theta})\big{)}\leqslant\nicefrac{{1}}{{\sqrt{\eta}}}$ for a certain $\bm{\theta}\in\mathbb{Q}^{d}$ is co-NP-hard, and testing $\bm{0}\in\widehat{\partial}f(\bm{\theta})$ for $\bm{\theta}$ is strongly co-NP-hard.

12 shows a computational tractability separation for the stationarity test between smooth and nonsmooth networks. In the smooth setting, given the gradient of every component function, we can compute the gradient norm of the loss function by iteratively applying chain rule. But in the nonsmooth case, while the subdifferential of every elemental function can be computed easily, the validity of the subdifferential chain rule like those in 8 is not justified, which turns out to cause a serious computational hurdle in stationarity test (strong co-NP-hardness).

4 Regularity Conditions

In this section, we study the regularity conditions for the validity of the equality-type chain rule in terms of Clarke, Fréchet, and limiting subdifferentials of the empirical loss of two-layer ReLU networks.

4.1 Setup

For simplicity of reference, we introduce the following notation, which will be used in various subdifferential constructions of the empirical loss $L$ .

Definition 13.

Let the parameters $\{(u_{k},\bm{w}_{k})\}_{k=1}^{H}$ be given. We define the following shorthands:

(a)

We write constants $\rho_{i}\coloneqq\ell_{i}^{\prime}\left(\sum_{k=1}^{H}u_{k}\cdot\max\left\{\bm{w}_{k}^{\top}\bm{x}_{i},0\right\}\right)$ for any $i\in[N]$ . 2. (b)

For any $k\in[H]$ and $\bm{w}_{k}\in\mathbb{R}^{d}$ , we define the following two indices sets:

[TABLE]

We may write $\mathcal{I}_{k}^{+}$ and $\mathcal{I}_{k}^{-}$ when the reference point $\bm{w}_{k}$ is clear from the context. 3. (c)

For any $k\in[H]$ , we define the following nonempty convex compact set $G^{C}_{k}\subseteq\mathbb{R}^{d}$ related to the Clarke subdifferential:

[TABLE] 4. (d)

For any $k\in[H]$ , we define the following nonempty compact set $G^{L}_{k}\subseteq\mathbb{R}^{d}$ related to the limiting subdifferential:

[TABLE] 5. (e)

For any $k\in[H]$ , we define the following convex compact set $G^{F}_{k}\subseteq\mathbb{R}^{d}$ related to the Fréchet subdifferential:

[TABLE] 6. (f)

If an equation holds for all the three subdifferentials, i.e., Clarke/limiting/Fréchet subdifferentials ( $\partial_{C}f/\partial f/\widehat{\partial}f$ ), we will write the equation simply with $\partial_{\triangleleft}f$ and also $G_{k}^{\triangleleft}$ (for $G_{k}^{C}/G_{k}^{L}/G_{k}^{F}$ ). For example, if the equation $\partial_{\triangleleft}f_{k}(\bm{w}_{k})=G_{k}^{\triangleleft}$ holds , then we get $\partial_{C}f_{k}(\bm{w}_{k})=G_{k}^{C},\partial f_{k}(\bm{w}_{k})=G_{k}^{L}$ , and $\widehat{\partial}f_{k}(\bm{w}_{k})=G_{k}^{F}$ .

4.2 Main Results

Theorem 14 (Clarke chain rule).

Under 1, we claim that the exact Clarke subdifferential chain rule holds for $L$ at a given point $(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})$ , that is

[TABLE]

if and only if the data points $\{\bm{x}_{i}\}_{i=1}^{N}$ satisfy the following Span Qualification (SQ):

[TABLE]

Remark 15.

Note that for any $k\in[H]$ , the indices sets $\mathcal{I}_{k}^{-}$ and $\mathcal{I}_{k}^{+}$ can be computed in $O(Nd)$ . Then, checking SQ is no harder than checking the Linear Independence Constraint Qualification (LICQ) in nonlinear programming and can be done with, e.g., Zassenhaus algorithm.

Theorem 16 (Limiting chain rule).

Under 1, we claim that the exact limiting subdifferential chain rule holds for $L$ at a given point $(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})$ , that is

[TABLE]

if and only if the data points $\{\bm{x}_{i}\}_{i=1}^{N}$ satisfy SQ.

For Fréchet subdifferential, the situation is different as the default chain rule is the reverse set inclusion $\widehat{\partial}L(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})\supseteq\prod_{k=1}^{H}\left\{\sum_{i=1}^{N}\rho_{i}\cdot\max\left\{\bm{w}_{k}^{\top}\bm{x}_{i},0\right\}\right\}\times G^{F}_{k}$ ; see (Rockafellar and Wets, 2009, Corollary 10.9, Theorem 10.49). If $\widehat{\partial}L(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})=\emptyset$ , we have the exact chain rule trivially, as $G^{F}_{k}$ can only be the empty set. Therefore, the interesting case is when the Fréchet subdifferential is nonempty.

Theorem 17 (Fréchet chain rule).

Under 1, for any given point such that the subdifferential $\widehat{\partial}L(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})\neq\emptyset$ , we have the following exact chain rule for the empirical loss $L$

[TABLE]

if and only if the data points $\{\bm{x}_{i}\}_{i=1}^{N}$ satisfy SQ.

4.3 Discussion

There are several existing regularity conditions related to the validity of exact chain rule of the empirical loss. We briefly introduce them here and defer the details to the 54 in Section C.5.

Definition 18 (Regularities).

We consider the following regularity conditions:

•

General position data: (Montufar et al., 2014, Section 2.2), (Yun et al., 2018, Assumption 2), and Bubeck et al. (2020)**;

•

Linear Independence Kink Qualification (LIKQ): (Griewank and Walther, 2019, Definition 2.6) and (Griewank and Walther, 2016, Definition 2);

•

Linearly Independent Activated Data (LIAD): Let the index set $\mathcal{J}_{k}\coloneqq\{j:\bm{w}_{k}^{\top}\bm{x}_{j}=0\}$ . For any fixed $k\in[H]$ , the data points $\{\bm{x}_{i}\}_{i\in\mathcal{J}_{k}}$ are linearly independent.

The general position assumption is from the study of hyperplane arrangement. If the data points are generated from an absolutely continuous probability measure (with respect to the Lebesgue measure), then they are in general position almost surely. The LIKQ is introduced by Griewank and Walther (2016, Definition 2) to ensure an efficient Fréchet stationarity test for piecewise differentiable function represented in abs-normal form. See Appendix A for a brief introduction. The LIAD condition is natural and equivalent to the subjectivity condition in 8. Let us present the following result, in which we establish the relationship among SQ and the three other regularity conditions in 18.

Proposition 19 (Regularity comparison).

For the empirical loss of a shallow ReLU network under 1, we have the following relationship:

[TABLE]

We exhibit two examples to show the one-side arrows in 19 are strict.

Example 20 (SQ $\nRightarrow$ LIAD).

Let the function $f:\mathbb{R}^{4}\rightarrow\mathbb{R}$ be given as

[TABLE]

Consider $x=y=z=b=0$ . It is easy to verify that SQ is satisfied but not LIAD. Besides, $f$ is nonconvex, nonsmooth, and non-separable. Neither $f$ nor $-f$ is Clarke regular. But by 14, the equality-type subdifferential sum rule still holds.

Example 21 (LIAD $\nRightarrow$ general position).

Let the function $f:\mathbb{R}^{3}\rightarrow\mathbb{R}$ be given as

[TABLE]

Consider $x=y=1$ and $b=-1$ . LIAD is satisfied, but the data is not in general position.

In practice, for data $\bm{x}\in\mathbb{R}^{d}$ , if the features of data include a discrete-valued component, e.g., $x_{1}\in\{-1,+1\}$ , then the points $\{\bm{x}_{i}\}_{i=1}^{N}$ are rarely in general position, as at least half of them must lie in the same affine hyperplane $\{\bm{y}:\bm{e}_{1}^{\top}\bm{y}=1\}$ or $\{\bm{y}:\bm{e}_{1}^{\top}\bm{y}=-1\}$ .

Remark 22 ( $G^{L}_{k}$ for general position data).

Besides, if the data points are in general position, we have the following compact representation for $G_{k}^{L}$

[TABLE]

The following corollary concerning the Clarke regularity of all local minimizers could be of independent interest.

Corollary 23.

If at a point, SQ is satisfied and the empirical loss function $L$ has nonempty Fréchet subdifferential here, then the function $L$ is Clarke regular at that point. Consequently, with data in general position, $L$ is Clarke regular at every local minimizer.

5 Testing of Stationarity Concepts

To perform the stationarity test, we need the following quantitative regularities to characterize the curvature of the pieces in the empirical loss.

Assumption 24.

In this section, we further assume that for any $i\in[N]$ , the norm of data $\|\bm{x}_{i}\|_{2}\leqslant R$ and the function $\ell_{i}$ is $L_{\ell}$ -Lipschitz continuous with an $L_{\ell^{\prime}}$ -Lipschitz continuous gradient $\ell^{\prime}_{i}$

5.1 Exact Stationarity Test

As an immediate illustration of the results in Section 4, we record the following exact testing schemes for Clarke and Fréchet stationary points. Compared with the developments in Yun et al. (2018) which check the Fréchet stationarity from the primal perspective and use polyhedral geometry to avoid redundant computation, by using 17, our treatment for Fréchet stationarity is transparent and its correctness is self-evident.

Clarke stationarity.

Suppose that SQ is satisfied at the point $(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})$ . By 14, it is a Clarke stationarity point of $L$ if and only if, for any $k\in[H]$ ,

(a)

$0=\sum_{i=1}^{N}\rho_{i}\cdot\max\left\{\bm{w}_{k}^{\top}\bm{x}_{i},0\right\}$ ; 2. (b)

$\bm{0}\in\sum_{i\in[N]\backslash(\mathcal{I}_{k}^{+}\cup\mathcal{I}_{k}^{-})}u_{k}\rho_{i}\cdot\mathbf{1}_{\bm{w}_{k}^{\top}\bm{x}_{i}>0}\cdot\bm{x}_{i}+\sum_{j\in\mathcal{I}_{k}^{+}\cup\mathcal{I}_{k}^{-}}u_{k}\rho_{j}\bm{x}_{j}\cdot[0,1]$ .

Condition (a) is a simple equality test and condition (b) can be checked by solving a linear programming problem. Algorithm 1 is for testing $\varepsilon$ -Clarke stationary points.

Fréchet stationarity.

Suppose that SQ is satisfied at the point $(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})$ . By 17, it is a Fréchet stationarity point of $L$ if and only if, for any $k\in[H]$ ,

(a)

$0=\sum_{i=1}^{N}\rho_{i}\cdot\max\left\{\bm{w}_{k}^{\top}\bm{x}_{i},0\right\}$ ; 2. (b)

$\mathcal{I}_{k}^{-}=\emptyset$ ; 3. (c)

$\bm{0}\in\sum_{i\in[N]\backslash(\mathcal{I}_{k}^{+}\cup\mathcal{I}_{k}^{-})}u_{k}\rho_{i}\cdot\mathbf{1}_{\bm{w}_{k}^{\top}\bm{x}_{i}>0}\cdot\bm{x}_{i}+\sum_{j\in\mathcal{I}_{k}^{+}}u_{k}\rho_{j}\bm{x}_{j}\cdot[0,1]$ .

Similarly, all above conditions can be checked in polynomial time with Algorithm 2.

5.2 Robust Stationarity Test

In this subsection, we introduce our main algorithmic results. First, we formally define the notion of stationarities that we are aiming to check; see Davis and Drusvyatskiy (2019); Kornowski and Shamir (2022a); Tian et al. (2022) for results on finding near-approximately stationary points for Lipschitz functions.

Definition 25 (Near-Approximate Stationarity, NAS).

Given a locally Lipschitz function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , we say that the point $\bm{x}\in\mathbb{R}^{d}$ is an

•

$(\varepsilon,\delta)$ -Clarke NAS point, if $\textnormal{dist}\Big{(}\bm{0},\cup_{\bm{y}\in\mathbb{B}_{\delta}(\bm{x})}\partial_{C}f(\bm{y})\Big{)}\leqslant\varepsilon$ ;

•

$(\varepsilon,\delta)$ -Fréchet NAS point, if $\textnormal{dist}\Big{(}\bm{0},\cup_{\bm{y}\in\mathbb{B}_{\delta}(\bm{x})}\widehat{\partial}f(\bm{y})\Big{)}\leqslant\varepsilon$ .

We consider a constructive approach, that is, we certify the $(\varepsilon,\delta)$ -Clarke NAS of a point $\bm{x}$ for the function $f$ only if we find a point $\bm{y}\in\mathbb{B}_{\delta}(\bm{x})$ satisfying $\textnormal{dist}(\bm{0},\partial_{C}f(\bm{y}))\leqslant\varepsilon$ . Note that, in any time, if a point $\bm{y}\in\mathbb{B}_{\delta}(\bm{x})$ passes the exact stationarity test, say, with Algorithm 1, then $\bm{x}$ must be an $(\varepsilon,\delta)$ -Clarke NAS point. In other words, there is no false positive in the test. The question is that, if $\bm{x}$ is sufficiently closed to a Clarke stationary point, can we always find a point $\bm{y}$ near $\bm{x}$ such that $\bm{y}$ is $\varepsilon$ -Clarke stationary? That is to say, we need to control the false negative of our robust test. Without exploiting structures in the objective function, finding such a point is impossible in general (Tian and So, 2022, Theorem 2.7). Our technique is a new rounding scheme (see Algorithm 3), which is motivated by the notion of active manifold identification Lewis (2002); Lemaréchal et al. (2000) in the literature. This new rounding scheme is capable to identify the activation pattern of the target stationary point that $\bm{x}$ is sufficiently close to.

Now, suppose that $f$ is $L$ -smooth and a point $\bm{x}^{*}$ satisfies $\|\nabla f(\bm{x}^{*})\|\leqslant\varepsilon$ . Without knowing the concrete structure of $f$ , what we can say for any point $\bm{y}\in\mathbb{B}_{\delta}(\bm{x}^{*})$ is that $\|\nabla f(\bm{y})\|\leqslant\varepsilon+L\cdot\delta$ , which is the best result we can hope for our test, as we do not assume any concrete structure in the loss $\ell_{i}$ except their smoothness. Such an estimation cannot hold trivially for a nonsmooth function. Consider $f(x)=|x|$ and $x^{*}=0$ . For any $\delta>0$ and $0\neq y\in\mathbb{B}_{\delta}(x^{*})$ , we have $|f^{\prime}(y)|=1$ .

5.2.1 Testing Clarke NAS

We define two constants that will be used in the analysis.

Definition 26 (Clarke).

Given a point $(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})$ with a Euclidean norm $B\in[0,+\infty)$ , we define the following constants about the separation and curvature of pieces around this point:

•

Separation: $C_{\tau}^{\textnormal{Clarke}}\coloneqq\frac{1}{4R}\cdot\min\left\{\left|\bm{x}_{i}^{\top}\bm{w}^{*}_{k}\right|:i\in[N],k\in[H],\bm{x}_{i}^{\top}\bm{w}_{k}^{*}\neq 0\right\};$

•

Curvature: $C_{\mu}^{\textnormal{Clarke}}\coloneqq\textnormal{poly}(B,R,L_{\ell},L_{\ell^{\prime}},N,H)$ .222See Section D.1 for the exact value.

Remark 27.

If for any $i\in[N]$ and $k\in[H]$ , it holds $\bm{x}_{i}^{\top}\bm{w}_{k}^{*}=0$ , then we define the separation constant $C_{\tau}^{\textnormal{Clarke}}\coloneqq+\infty$ , as in the optimization of extended-real-valued functions, $\inf\emptyset=+\infty$ . It is notable that, while the separation constant $C_{\tau}^{\textnormal{Clarke}}$ is usually unknown when running the testing algorithm, the curvature constant $C_{\mu}^{\textnormal{Clarke}}$ can be easily estimated when the candidate network and the radius $\delta$ are given.

Theorem 28 (Robust Clarke test).

Let an $\varepsilon$ -Clarke stationary point $(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})$ satisfying SQ be given. For any $0<\delta\leqslant C_{\tau}^{\textnormal{Clarke}}$ and any

[TABLE]

if the output point $(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{w}}_{H})$ of Algorithm 3 satisfies SQ, then we have

[TABLE]

In 28, we show that for a point that is sufficiently closed to an $\varepsilon$ -Clarke stationary one, and a properly chosen parameter $\delta>0$ , one can correctly certify the near-approximate stationarity of this point in the style as if the function $L$ is smooth by calling Algorithm 4 with $\textsc{RTest}(\textsc{ETest-C},\textsc{Rnd-C},\cdots)$ . A natural question here is how to choose a proper parameter $\delta$ , as the separation constant $C_{\tau}^{\textnormal{Clarke}}$ is usually unknown. It turns out that a simple line search will work for that.

Remark 29 (Line search).

Set the initial value of radius $\delta$ to, say, $\delta_{0}=1$ . Then, in the $t$ -th iteration, run Algorithm 4 with parameter $\delta_{t}$ and set $\delta_{t+1}=\delta_{t}/2$ . Note that for a sufficiently small $\delta$ , the rounding scheme in Algorithm 3 becomes superfluous, as for any $i\in[N]$ and $k\in[H]$ such that $\bm{x}^{\top}_{i}\bm{w}_{k}\neq\bm{0}$ , we have $|\bm{x}_{i}^{\top}\bm{w}_{k}|>2R\cdot\delta$ for a small $\delta$ . Therefore, we can stop the line search within at most

[TABLE]

iterations. It is immediate that, if $(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})\in\mathbb{B}_{C_{\tau}^{\textnormal{Clarke}}/2}\big{(}(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})\big{)}$ , then there exists a radius $\delta_{t}\in[C_{\tau}^{\textnormal{Clarke}}/2,C_{\tau}^{\textnormal{Clarke}}]$ in the iteration sequence such that

[TABLE]

This search scheme also works for the Fréchet NAS test and we will not repeat that.

5.2.2 Testing Fréchet NAS

Unlike the Clarke case, we need the following extra nondegeneracy condition on $\ell_{i}$ to identify the pattern of $\{u_{k}^{*}\}_{k}$ and avoid the Fréchet subdifferential being empty.

Assumption 30.

Given a point $(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})$ , we assume that for any $i\in[N]$ such that $\min_{k\in[H]}|\bm{x}_{i}^{\top}\bm{w}_{k}^{*}|=0$ , we have $\ell_{i}^{\prime}\left(\sum_{k=1}^{H}u_{k}^{*}\cdot\max\left\{(\bm{w}_{k}^{*})^{\top}\bm{x}_{i},0\right\}\right)\neq 0$ .

The following two constants will be used in the analysis.

Definition 31 (Fréchet).

Given a point $(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})$ with a Euclidean norm $B\in[0,+\infty)$ , we define two constants concerning the separation and curvature of pieces around this point:

•

Separation: $C_{\tau}^{\textnormal{Fr\'{e}chet}}\coloneqq\min\left\{\min_{\begin{subarray}{c}i\in[N],k\in[H],\\ \bm{x}_{i}^{\top}\bm{w}_{k}^{*}\neq 0\\ \end{subarray}}\frac{\left|\bm{x}_{i}^{\top}\bm{w}^{*}_{k}\right|}{4R},\min_{\begin{subarray}{c}i\in[N],k\in[H],\\ \bm{x}_{i}^{\top}\bm{w}_{k}^{*}=0,u_{k}^{*}\cdot\rho_{i}^{*}>0\end{subarray}}\frac{u_{k}^{*}\cdot\rho_{i}^{*}}{L_{\ell^{\prime}}(4HRB^{2}+1)}\right\};$

•

Curvature: $C_{\mu}^{\textnormal{Fr\'{e}chet}}\coloneqq\textnormal{poly}(B,R,L_{\ell},L_{\ell^{\prime}},N,H)$ .333See Section D.2 for the exact value.

Then, for Fréchet NAS test, we have the following result similar to 28.

Theorem 32 (Robust Fréchet test).

Let an $\varepsilon$ -Fréchet stationary point $(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})$ satisfying SQ be given. For any $0<\delta\leqslant C_{\tau}^{\textnormal{Fr\'{e}chet}}$ and any

[TABLE]

if the output point $(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{w}}_{H})$ of Algorithm 5 satisfies SQ, then we have

[TABLE]

Appendix A Abs-Normal Form of Piecewise Differentiable Functions

We briefly review the abs-normal representation of a subclass of piecewise differentiable functions. See Griewank (2013); Griewank and Walther (2016) for details.

A.1 The General Framework

The abs-normal representation Griewank (2013) is a piecewise linearization scheme concerning a certain subclass of piecewise differentiable functions in the sense of Scholtes (2012). In this subclass, functions are defined as compositions of smooth functions and the absolute value function. By identifies $\max\{a,b\}=(a+b)/2+|a-b|/2,\min\{a,b\}=(a+b)/2-|a-b|/2,$ and $\max\{x,0\}=x/2+|x|/2$ , composition with these nonsmooth elemental functions can also be represented in the abs-normal form.

Let $\varphi:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be a function in such subclass. By numbering all input to the absolute value functions in the evaluation order as “switching variables” $z_{i}$ for $i\in\{1,\dots,s\}$ , the function $\bm{x}\mapsto y=\varphi(\bm{x})$ can be written in the following abs-normal form:

[TABLE]

where $\bm{x}\in\mathbb{R}^{d},\bm{p}\in\mathbb{R}_{+}^{s}$ , the smooth mapping $F:\mathbb{R}^{d}\times\mathbb{R}^{s}_{+}\rightarrow\mathbb{R}^{s}$ , and the smooth function $f:\mathbb{R}^{d}\times\mathbb{R}^{s}\rightarrow\mathbb{R}$ . As the numbering of $\{z_{i}\}_{i}$ is in the evaluation order, $z_{i}$ is a function of $z_{j}$ only if $j<i$ . In sum, we have

[TABLE]

where $\bm{z}(\bm{x})$ a successive evaluation of $\{z_{i}\}_{i=1}^{s}$ with given $\bm{x}$ . To see such an evaluation of $\bm{z}(\bm{x})$ is well-defined, note that $z_{1}=F_{1}(\bm{x})$ and for any $1<i\leqslant s$ ,

[TABLE]

We remark that, similar to the Difference of Convex (DC) decomposition in DC programming, the function $\varphi$ may have many different abs-normal decomposition. The following vectors and matrices are useful when study the function in abs-normal form:

[TABLE]

For any $\bm{\sigma}\in\{-1,1\}^{s}$ , we will denote by $\bm{\Sigma}\coloneqq\mathop{\textnormal{Diag}}(\bm{\sigma})\in\{-1,0,1\}^{s\times s}$ . Let us define (see also (Griewank and Walther, 2016, Equation (11)))

[TABLE]

which will play a key role in the definition of LIKQ (see 54).

A.2 Abs-Normal Form of Shallow ReLU Networks

We rewrite the empirical loss of the shallow ReLU network with absolute value functions as

[TABLE]

Then, as there are $N\cdot H$ absolute value evaluations in total, we define the switching variable $\bm{z}\in\mathbb{R}^{NH}$ and the smooth mapping $F$ as

[TABLE]

The smooth function $f$ in the abs-normal form can be written as

[TABLE]

where $\bm{p}\in\mathbb{R}_{+}^{NH}$ . Consequently, the matrix $\bm{L}=\bm{0}$ , which implies the function $L$ is “simply switched” in the sense of Griewank and Walther (2016). For the matrix $\bm{Z}$ and any $k\in[H],i\in[N]$ , the $(N(k-1)+i)$ -th row of $\bm{Z}\in\mathbb{R}^{NH\times H(d+1)}$ can be written as

[TABLE]

Appendix B Proofs for Section 3

B.1 The Problems

Problem 33 (3SAT).

Given a collection of clauses $\{C_{i}(\bm{x})\}_{i=1}^{n}$ on Boolean variables $\bm{x}\in\{0,1\}^{m}$ such that clause $C_{i}(\bm{x})$ is limited to a disjunction of at most three literals for any $1\leqslant i\leqslant n$ . Let the following formula of $C(\bm{x})$ in conjunctive normal form be given

[TABLE]

Is there an $\bm{x}\in\{0,1\}^{m}$ satisfying $C(\bm{x})=1$ ?

Problem 34 (Piecewise Linear Test, PLT).

Suppose $\varepsilon\in[0,\frac{1}{\sqrt{m}})$ and the input data $\{\bm{y}_{i}\}_{i=1}^{3n}\subseteq\mathbb{Z}^{m}$ be given. Let us define a function $f_{\textsf{PLT}}:\mathbb{R}^{m}\rightarrow\mathbb{R}$ as

[TABLE]

Is there a vector $\bm{g}\in\mathbb{R}^{m}$ satisfying $\|\bm{g}\|\leqslant\varepsilon$ and

[TABLE]

Its complement is given by

[TABLE]

Problem 35 (Neural Network Test, NNT).

Suppose $\varepsilon\in[0,\frac{1}{\sqrt{m}}]$ . Let the input data $\bm{Y}=\left[\begin{array}[]{c|c|c}\bm{y}_{1}&\cdots&\bm{y}_{3n}\end{array}\right]\subseteq\mathbb{Z}^{m\times 3n}$ be given. Let us define $f_{\textsf{NNT}}:\mathbb{R}^{3n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}$ as

[TABLE]

Is $(-\mathbf{1}_{3n},\bm{0}_{m})$ an $\varepsilon$ -Fréchet stationary point of $f_{\textsf{NNT}}$ , i.e., $\textnormal{dist}\big{(}\bm{0},\widehat{\partial}f_{\textsf{NNT}}(-\mathbf{1}_{3n},\bm{0}_{m})\big{)}\leqslant\varepsilon$ ?

Problem 36 (Abs-Normal Form Test, ANFT).

Suppose a piecewise linear function is given in the abs-linear form with vectors and matrices $\bm{a}\in\mathbb{R}^{n},\bm{b}\in\mathbb{R}^{s},\bm{Z}\in\mathbb{R}^{s\times n},\bm{L}\in\mathbb{R}^{s\times s}$ . Is there a definite signature vector $\bm{\sigma}\in\{-1,1\}^{s}$ such that the following system with respect to $\bm{\mu}_{\sigma}\in\mathbb{R}^{s}$ is incompatible

[TABLE]

B.2 Hardness of Piecewise Linear Test

Lemma 37.

34* (PLT) is co-NP-hard.*

Proof.

We have to show that $\overline{\textnormal{PLT}}$ is an element of the complexity class NP-hard. 3SAT in 33 is known to be strongly NP-complete Garey and Johnson (1979). We give a polynomial-time reduction from 3SAT to $\overline{\textnormal{PLT}}$ . Given any instance of 3SAT, we get clauses $\{C_{i}(\bm{x})\}_{i=1}^{n}$ for $\bm{x}\in\{0,1\}^{m}$ . We will refer literals in $C_{t}(\bm{x})$ by their positions. For example, given $C_{t}(\bm{x})=x_{i}\vee(^{\neg}x_{j})\vee x_{k}$ , we say the literal $x_{i}$ occurs in $C_{t}(\bm{x})$ at position $1$ , the literal ${}^{\neg}x_{j}$ occurs in $C_{t}(\bm{x})$ at position $2$ , and the literal $x_{k}$ occurs in $C_{t}(\bm{x})$ at position $3$ . We construct the data $\{\bm{y}_{i}\}_{i=1}^{3n}\subseteq\mathbb{Z}^{m}$ as follows

[TABLE]

Note the following positive $1$ -homogeneous function in the construction of PLT

[TABLE]

Suppose that for any $0\leqslant\|\bm{g}\|\leqslant\varepsilon$ , there exists $\bm{d}\in\mathbb{R}^{m}$ such that $f_{\textsf{PLT}}(\bm{d})<\langle\bm{g},\bm{d}\rangle$ . We will exhibit an $\bm{x}\in\{0,1\}^{m}$ such that the given 3SAT is satisfied. Let $\bm{g}=\bm{0}$ and there exists $\bm{d}\in\mathbb{R}^{m}$ such that $f_{\textsf{PLT}}(\bm{d})<0$ . For any $i\in[m]$ , let

[TABLE]

We show $C(\bm{x})=1$ . By $f_{\textsf{PLT}}(\bm{d})<0$ , we get for any $i\in[n]$

[TABLE]

which implies that there exists a $j^{\prime}\in\{1,2,3\}$ such that $\bm{d}^{\top}\bm{y}_{3(i-1)+j^{\prime}}>0$ . Let the index of the Boolean literal occurs in $C_{i}(\bm{x})$ at position $j^{\prime}$ be $k$ . Now we consider two cases. If $x_{k}$ occurs in $C_{i}(\bm{x})$ at position $j^{\prime}$ , then $\bm{y}_{3(i-1)+j^{\prime}}=\bm{e}_{k}$ . We get $\bm{d}^{\top}\bm{y}_{3(i-1)+j^{\prime}}=\bm{d}^{\top}\bm{e}_{k}=d_{k}>0$ . So, by definition, $x_{k}=1$ which implies $C_{i}(\bm{x})=1$ . Otherwise, if ${}^{\neg}x_{k}$ occurs in $C_{i}(\bm{x})$ at position $j^{\prime}$ , then $\bm{y}_{3(i-1)+j^{\prime}}=-\bm{e}_{k}$ . We get $\bm{d}^{\top}\bm{y}_{3(i-1)+j^{\prime}}=-\bm{d}^{\top}\bm{e}_{k}=-d_{k}>0$ . So ${}^{\neg}x_{k}=1$ by definition, which implies $C_{i}(\bm{x})=1$ . This shows that $C(\bm{x})=\bigwedge_{i=1}^{n}C_{i}(\bm{x})=1$ and the given 3SAT is satisfied.

Conversely, we show that if there exists a vector $\bm{g}$ such that $0\leqslant\|\bm{g}\|\leqslant\varepsilon$ and $\inf_{\bm{d}}f_{\textsf{PLT}}(\bm{d})\geqslant\langle\bm{g},\bm{d}\rangle$ , then 3SAT cannot be satisfied. Suppose to the contrary that there exists $\bm{x}\in\{0,1\}^{m}$ such that $C(\bm{x})=1$ . For any $i\in[m]$ , let

[TABLE]

As $\bigwedge_{i=1}^{n}C_{i}(\bm{x})=1$ , for any $i\in[n]$ , there exists a literal of clause $C_{i}(\bm{x})$ that is satisfied. Let the index of this literal be $k^{\prime}$ and the position of it in $C_{i}(\bm{x})$ be $j^{\prime}$ . We consider two cases. If literal $x_{k^{\prime}}$ occurs in $C_{i}(\bm{x})$ at position $j^{\prime}$ , then $\bm{y}_{3(i-1)+j^{\prime}}=\bm{e}_{k^{\prime}}$ . As $C_{i}(\bm{x})=1$ due to literal $x_{k^{\prime}}$ , we get $x_{k^{\prime}}=1$ and $d_{k^{\prime}}=1$ by definition. Then, for such $i\in[n]$ , we get

[TABLE]

Otherwise, if literal ${}^{\neg}x_{k^{\prime}}$ occurs in $C_{i}(\bm{x})$ at position $j^{\prime}$ , then $\bm{y}_{3(i-1)+j^{\prime}}=-\bm{e}_{k^{\prime}}$ . As $C_{i}(\bm{x})=1$ due to literal ${}^{\neg}x_{k^{\prime}}$ , we get $x_{k^{\prime}}=0$ and $d_{k^{\prime}}=-1$ by definition. Then, for any $i\in[n]$ , we get

[TABLE]

This gives

[TABLE]

a contradiction. Hence 34 is in the class co-NP-hard. ∎

While it is not clear whether the 34 with a positive $\varepsilon$ is an element of the complexity class co-NP, we show that, when $\varepsilon=0$ , 34 is in co-NP.

Lemma 38.

If $\varepsilon=0$ , then 34 is in the complexity class of co-NP.

Proof.

For $\varepsilon=0$ , we only need to test $f_{\textsf{PLT}}(\bm{d})\geqslant 0,\forall\bm{d}\in\mathbb{R}^{m}$ . Given any $\bm{d}\in\mathbb{R}^{m}$ checking whether $f_{\textsf{PLT}}(\bm{d})<0$ can be done in $O(mn\log n)$ time. If the answer to 34 is yes, by homogeneity in $f_{\textsf{PLT}}$ , there exist a direction $\bm{d}$ and a vector $\bm{s}\in\{1,2,3\}^{n}$ such that $f_{\textsf{PLT}}(\bm{d})\leqslant-1$ and $\bm{d}^{\top}\bm{y}_{3(i-1)+s_{i}}\geqslant 1$ for any $i\in[n]$ . There are only $3^{n}$ elements in the set $\{1,2,3\}^{n}$ and all resulting $\left[\begin{array}[]{c|c|c}\bm{y}_{s_{1}}&\cdots&\bm{y}_{3n-3+s_{n}}\end{array}\right]$ are integer matrix of polynomial length relative to the input size of 34. So the certificate $\bm{d}$ can be obtained by solving a linear program in polynomial time. Therefore, if there exists $\bm{d}\in\mathbb{R}^{m}$ such that $f_{\textsf{PLT}}(\bm{d})<0$ , then a nondeterministic algorithm can find $\bm{s}\in\{1,2,3\}^{n}$ and $\bm{d}^{\prime}\in\mathbb{Q}^{m}$ satisfying $f_{\textsf{PLT}}(\bm{d}^{\prime})\leqslant-1<0$ in polynomial time. Thus, 34 with $\varepsilon=0$ is an element of the complexity class co-NP. ∎

Proof of 10.

We first note that 34 can be written in the standard max-min form in polynomial time by the following elementary identify:

[TABLE]

Besides, it holds $f_{\textsf{PLT}}(\bm{d})=f_{\textsf{PLT}}(\bm{0})+f_{\textsf{PLT}}^{\prime}(\bm{0};\bm{d})=f_{\textsf{PLT}}^{\prime}(\bm{0};\bm{d})$ . By 3, we know $\textnormal{dist}\big{(}\bm{0},\widehat{\partial}f_{\textsf{PLT}}(\bm{0})\big{)}\leqslant\varepsilon$ if and only if there exists a vector $\bm{g}\in\mathbb{R}^{m}$ satisfying $0\leqslant\|\bm{g}\|\leqslant\varepsilon$ and $f_{\textsf{PLT}}(\bm{d})\geqslant\langle\bm{g},\bm{d}\rangle,\forall\bm{d}\in\mathbb{R}^{m}$ , which is the definition of 34. Note that if $\varepsilon=0$ , in the reduction from 3SAT in the proof of 37, all numerical parameters are bounded by a polynomial of the input size. The proof completes by 37.∎

B.3 Hardness of Abs-Normal Form Test

Proof of 11.

We first show that PLT in 34 can be written in the abs-normal form in polynomial time. For ease of notation, let $q_{i}(\bm{d})\coloneqq-\sum_{j=1}^{3}\max\left\{\bm{d}^{\top}\bm{y}_{3(i-1)+j},0\right\}$ for any $i\in[n]$ . Then, we can rewrite every $q_{i}$ in the abs-linear form as

[TABLE]

Note that the function $f_{\textsf{PLT}}$ can be expressed as

[TABLE]

which can be written in abs-normal form as

[TABLE]

In sum, we have

[TABLE]

Then, we know

[TABLE]

Then, the matrices $\bm{L},\bm{Z},\bm{a},\bm{b}$ can be computed in polynomial time.

We note that $\inf_{\bm{d}}f_{\textsf{PLT}}(\bm{d})\geqslant 0$ if and only if the function $f_{\textsf{PLT}}$ is first-order minimal in abs-normal form and this is shown in the discussion below (Griewank and Walther, 2019, Equation (2)) (see also (Griewank and Walther, 2016, p3)). Then, the answer of ANFT in 36 for the abs-normal form of $f_{\textsf{PLT}}$ is No if and only if $\bm{0}$ is a Fréchet stationary point of $f_{\textsf{PLT}}$ . Then, by 37, ANFT in 36 is NP-hard. To see ANFT is in NP, for any given $\bm{\sigma}\in\{-1,1\}^{s}$ , the computation of the vector $\bm{a}^{\top}+\bm{b}^{\top}\big{(}\mathop{\textnormal{Diag}}(\bm{\sigma})-\bm{L}\big{)}^{-1}\bm{Z}$ and the matrix $\big{(}\mathop{\textnormal{Diag}}(\bm{\sigma})-\bm{L}\big{)}^{-1}\bm{Z}$ can be done in polynomial time. Then, ANFT for a given $\bm{\sigma}$ reduces to check the infeasibility of a linear system, which is in P. In sum, we have shown ANFT in 36 is NP-complete, which implies a general test of FOM without kink qualification in (Griewank and Walther, 2019, Theorem 4.1) is co-NP-complete. ∎

B.4 Hardness of Neural Network Test

Lemma 39.

35* (NNT) is co-NP-hard. If $\varepsilon=0$ , 35 is co-NP-complete.*

Proof.

We first prove that $(-\mathbf{1}_{3n},\mathbf{0}_{m})$ is an $\varepsilon$ -Fréchet stationary point of $f_{\textsf{NNT}}$ if and only if there exists $\bm{g}^{w}\in\mathbb{B}_{\varepsilon}^{m}(\bm{0})$ such that $\inf_{\bm{d}\in\mathbb{R}^{m}}f_{\textsf{PLT}}(\bm{d})\geqslant\langle\bm{g}^{w},\bm{d}\rangle$ with the same input data $\{\bm{y}_{i}\}_{i=1}^{3n}\subseteq\mathbb{Z}^{m}$ . By (Rockafellar and Wets, 2009, Exercise 8.4) and $f_{\textsf{NNT}}$ is B-differentiable; see (Cui and Pang, 2021, Definition 4.1.1), we get $\textnormal{dist}\big{(}\bm{0},\widehat{\partial}f_{\textsf{NNT}}(-\mathbf{1}_{3n},\bm{0}_{m})\big{)}\leqslant\varepsilon$ if and only if there exists $(\bm{g}^{u},\bm{g}^{w})\in\mathbb{B}_{\varepsilon}^{3n+m}(\bm{0}_{3n+m})$ such that

[TABLE]

Using the chain rule of directional derivative for B-differentiable function (Cui and Pang, 2021, Proposition 4.1.2(a)), we have

[TABLE]

For any $\bm{g}^{u},\bm{g}^{w}$ , consider $\bm{d}^{w}=\bm{0}$ and $\bm{d}^{u}=\bm{g}^{u}$ . We get that Section B.4 holds if and only if $\bm{g}^{u}=\bm{0}_{3m}$ and $\inf_{\bm{d}^{w}\in\mathbb{R}^{m}}f_{\textsf{PLT}}(\bm{d}^{w})\geqslant\langle\bm{g}^{w},\bm{d}^{w}\rangle$ , which completes the proof by the co-NP-hardness of 34 in 37 and co-NP-completeness if $\varepsilon=0$ in 38. ∎

Proof of 12.

Note that 35 can be represented by the empirical loss of a convolutional neural network with $N=1$ and architecture

[TABLE]

where $\ell_{1}(t)=t$ and $\bm{Y}=\left[\begin{array}[]{c|c|c}\bm{y}_{1}&\cdots&\bm{y}_{3n}\end{array}\right]\subseteq\mathbb{Z}^{m\times 3n}$ . If $\varepsilon=0$ , in the reduction from 3SAT to PLT, then to NNT, all numerical parameters are bounded by a polynomial of the input size. The proof completes by 39. ∎

Appendix C Proofs for Section 4

C.1 Proof Roadmap

Recall the loss function $L$ of shallow ReLU neural network:

[TABLE]

Set constants $\rho_{i}\coloneqq\ell_{i}^{\prime}\left(\sum_{k=1}^{H}u_{k}\cdot\max\left\{\bm{w}_{k}^{\top}\bm{x}_{i},0\right\}\right)$ for any $i\in[N]$ . Let us first consider a partially linearized loss function $\overline{L}$ defined by

[TABLE]

By exploiting the smoothness of $\{\ell_{i}\}_{i=1}^{N}$ and a Lagrange scalarization technique in 46, we will show that

[TABLE]

Then, we focus on the linearized $\overline{L}$ . By separation of $\{(u_{k},\bm{w}_{k})\}_{k}$ and using again the Lagrange scalarization technique in form of 48, we have

[TABLE]

where (a) is due to (Rockafellar, 1985, Proposition 2.5) and (Rockafellar and Wets, 2009, Proposition 10.5); (b) is by 48. Therefore, it holds

[TABLE]

which implies that the validity of exact chain rule of $L$ rely on a careful study of $\overline{L}_{k}(u_{k},\cdot)$ . In particular, if we have the exact chain rule for any $k\in[H]$ as follows

[TABLE]

then we get the validity of exact chain rule for $L$ . That is $\partial_{\triangleleft}L(u_{1},\bm{w}_{1},\dots,u_{H},\bm{w}_{H})=$

[TABLE]

To prove Equation 1, we need a fine-grained analysis of $\overline{L}_{k}(u_{k},\cdot)$ . First, we isolate the nonsmooth part out by rewritting

[TABLE]

where we define

[TABLE]

What remaining is to study the subdifferential of this non-separable piecewise linear function $f_{k}$ for any $k\in[H]$ and figure out conditions, under which

[TABLE]

This will be done in Section C.4.

C.2 Technical Lemmas

Lemma 40 (Gordan, cf. (Bertsimas and Tsitsiklis, 1997, Exercise 4.26)).

Let $\bm{A}\in\mathbb{R}^{n\times m}$ be given. Then, exactly one of the following statements is true:

•

There exists an $\bm{x}\in\mathbb{R}^{m}$ such that $\bm{Ax}<\bm{0}$ .

•

There exists a $\bm{y}\in\mathbb{R}^{n}$ such that $\bm{A}^{\top}\bm{y}=\bm{0}$ with $\bm{y}\geqslant\bm{0},\bm{y}\neq\bm{0}$ .

Lemma 41.

Let $A,B,C$ be sets in $\mathbb{R}^{n}$ . Suppose further that $A$ is convex and closed, and $C$ is nonempty and bounded. If the strict inclusion $A\subsetneq B$ holds, then we can assert $A+C\subsetneq B+C$ .

Proof.

Let $\bm{x}_{b}\in B\backslash A$ . The claim is trivial when $A=\emptyset$ . Choose $\bm{x}_{a}^{\prime}\in A$ and set $\delta\coloneqq\|\bm{x}_{b}-\bm{x}_{a}^{\prime}\|$ . As $A$ is closed, the following $\bm{x}_{a}$ is well-defined

[TABLE]

Let $\bm{d}\coloneqq\bm{x}_{b}-\bm{x}_{a}$ . As $\bm{x}_{b}\notin A$ and $A$ is closed, we know $\|\bm{d}\|>0$ . By the optimality condition and convexity of $A$ , we know $\langle\bm{a}-\bm{x}_{a},\bm{d}\rangle\leqslant 0,\forall\bm{a}\in A$ , which implies $\langle\bm{d},\bm{a}\rangle\leqslant\langle\bm{d},\bm{x}_{a}\rangle,\forall\bm{a}\in A$ . As $C$ is bounded, we know $\langle\bm{c},\bm{d}\rangle\leqslant\|\bm{c}\|\cdot\|\bm{d}\|<+\infty,\forall\bm{c}\in C$ . Let $\bm{x}_{c}$ be

[TABLE]

where $0<\varepsilon<\|\bm{d}\|^{2}$ . We claim $\bm{x}_{b}+\bm{x}_{c}\notin A+C$ . Suppose not. Therefore, there exist $\bm{y}_{a}\in A,\bm{y}_{c}\in C$ such that $\bm{y}_{a}+\bm{y}_{c}=\bm{x}_{b}+\bm{x}_{c}$ . However, we compute

[TABLE]

which gives the contradiction. ∎

Remark 42.

Though the claim seems straightforward, 41 is indeed non-trivial. We record the following counterexamples when different conditions are removed.

•

$C$ * is empty: $A+C=B+C=\emptyset$ .*

•

$C$ * is unbounded: if $C=\mathbb{R}^{n}$ and $A,B$ are nonempty, then $A+C=B+C=\mathbb{R}^{n}$ .*

•

$A$ * is nonconvex: if $A=\mathbb{B}\backslash\mathbb{B}_{1/4},B=\mathbb{B},C=\mathbb{B}$ , then $A+C=B+C=\mathbb{B}_{2}$ .*

•

$A$ * is not closed: if $A=\mathbb{B}^{\circ},B=\mathbb{B},C=\mathbb{B}^{\circ}$ , then $A+C=B+C=\mathbb{B}^{\circ}_{2}$ .*

Lemma 43.

Let $\{\bm{p}_{i}\}_{i=1}^{n}$ be linearly independent. Define a convex set $C=\sum_{i=1}^{n}\bm{p}_{i}\cdot[0,1]$ . For any $\bm{s}\in\{0,1\}^{n}$ , the point $\bm{p}=\sum_{i=1}^{n}s_{i}\cdot\bm{p}_{i}$ is an extreme point of $C$ .

Proof.

Suppose not and $\bm{p}=\frac{1}{2}\bm{x}_{1}+\frac{1}{2}\bm{x}_{2}=\sum_{i=1}^{n}s_{i}\cdot\bm{p}_{i}$ with $\bm{p}\neq\bm{x}_{1}=\sum_{i=1}^{n}\alpha_{i}\cdot\bm{p}_{i}\in C$ and $\bm{p}\neq\bm{x}_{2}=\sum_{i=1}^{n}\beta_{i}\cdot\bm{p}_{i}\in C$ . We know $\alpha_{i}\in[0,1]$ and $\beta_{i}\in[0,1]$ for any $i\in[n]$ by definition. Thus, it holds

[TABLE]

As $\{\bm{p}_{i}\}_{i=1}^{n}$ are linearly independent, we know that, for any $i\in[n]$ , it holds $s_{i}=\left(\frac{\alpha_{i}+\beta_{i}}{2}\right)\in\{0,1\}$ . If $s_{i}=0$ , we have $\alpha_{i}=\beta_{i}=0$ . Meanwhile, we know $\alpha_{i}=\beta_{i}=1$ if $s_{i}=1$ . Therefore, it holds $\bm{x}_{1}=\bm{x}_{2}=\bm{p}$ , a contradiction. ∎

Lemma 44.

Let a function $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be $\bm{w}\mapsto-\sum_{j=1}^{m}\max\{\bm{y}_{j}^{\top}\bm{w},0\}$ . If there exists $j\in[m]$ such that $\bm{w}^{\top}\bm{y}_{j}=0$ and $\bm{y}_{j}\neq\bm{0}$ , then we have $\widehat{\partial}g(\bm{w})=\emptyset$ .

Proof.

Suppose not and let $\bm{u}\in\widehat{\partial}g(\bm{w})$ . We write

[TABLE]

where we define $g_{0}(\bm{w})\coloneqq-\sum_{k:\bm{w}^{\top}\bm{y}_{k}=0}\max\{\bm{w}^{\top}\bm{y}_{k},0\}$ . Then, by (Rockafellar and Wets, 2009, Exercise 8.8(c)), we have

[TABLE]

Let $\bm{u}^{\prime}=\bm{u}+\sum_{j:\bm{w}^{\top}\bm{y}_{j}\neq 0}\mathbf{1}_{\bm{w}^{\top}\bm{y}_{j}>0}\cdot\bm{y}_{j}$ and we know $\bm{u}^{\prime}\in\widehat{\partial}g_{0}(\bm{w})$ . By (Rockafellar and Wets, 2009, Exercise 8.4), for any $\bm{d}\in\mathbb{R}^{d}$ , it holds

[TABLE]

Let $\bm{d}=\bm{u}^{\prime}$ and we know $\|\bm{u}^{\prime}\|^{2}\leqslant g_{0}^{\prime}(\bm{w};\bm{u}^{\prime})\leqslant 0$ . Thus, $\bm{u}^{\prime}=\bm{0}$ . Let $\bm{d}$ be any $\bm{y}_{j}$ such that $\bm{w}^{\top}\bm{y}_{j}=0$ and $\bm{y}_{j}\neq 0$ . Then, we have

[TABLE]

a contradiction. ∎

Definition 45 (Bouligand subdifferential, c.f. (Cui and Pang, 2021, Definition 4.3.1)).

Given a point $\bm{x}$ , the Bouligand subdifferential of a locally Lipschitz function $f$ at $\bm{x}$ is defined by

[TABLE]

C.3 Partial Linearization via Lagrange Scalarization

The following theorem is a powerful and general principle.

Theorem 46 (Partial linearization).

Let a point $\bm{x}\in\mathbb{R}^{d}$ and a locally Lipschitz $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be given in form of composition $f(\bm{x})=h\circ G(\bm{x})$ , where the gradient of $h:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is locally Lipschitz near $G(\bm{x})$ and $G:\mathbb{R}^{d}\rightarrow\mathbb{R}^{n}$ is locally Lipschitz near $\bm{\bm{x}}$ . Suppose $h$ and $G$ are directionally differentiable. Then, we have

[TABLE]

Proof.

Let the partially linearized $f$ at $\bm{x}$ be $\bar{f}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ defined as

[TABLE]

For the limiting subdifferential version, the claim directly follows from a margin function chain rule (Mordukhovich and Shao, 1996, Theorem 6.5). The Clarke subdifferential version directly follows from the relation between Clarke and limiting subdifferential (Rockafellar and Wets, 2009, Theorem 8.49) and (Mordukhovich and Shao, 1996, Theorem 6.5). However, as the proof of (Mordukhovich and Shao, 1996, Theorem 6.5) uses a perturbation argument to approximate $\partial f$ with $\varepsilon$ -Fréchet subdifferential, the machinery is somehow complicated. Here we give an elementary proof for the Clarke version from the primal perspective using tools from convex analysis. We show $f^{\circ}(\bm{x};\bm{v})=\bar{f}^{\circ}(\bm{x};\bm{v})$ for any $\bm{v}\in\mathbb{R}^{d}$ . Note that the Clarke generalized subderivative can be written as

[TABLE]

where the difference quotient function $\Delta_{t}f(\bm{x}^{\prime}):\mathbb{R}^{d}\rightarrow\mathbb{R}$ of $f$ at $\bm{x}^{\prime}$ and direction $\bm{v}$ is defined by

[TABLE]

We assume $h$ is $L_{h}$ -smooth near $g(\bm{x})$ and $G$ is $L_{G}$ -Lipschitz near $\bm{x}$ . We will use the following estimation (see (Nesterov, 2003, Lemma 1.2.3)) if $h$ is $L_{h}$ -smooth at $\bm{z}\in\mathbb{R}^{n}$ :

[TABLE]

To prove $f^{\circ}(\bm{x};\bm{v})\geqslant\bar{f}^{\circ}(\bm{x};\bm{v})$ , we compute as follows

[TABLE]

Therefore, for any $\bm{v}\in\mathbb{R}^{d}$ , we know

[TABLE]

where in $(i)$ we use $\sup f-g\geqslant\sup f-\sup g$ . For the converse direction $f^{\circ}(\bm{x};\bm{v})\leqslant\bar{f}^{\circ}(\bm{x};\bm{v})$ , we just compute similarly. We have proved $f^{\circ}(\bm{x};\bm{v})=\bar{f}^{\circ}(\bm{x};\bm{v}),\forall\bm{v}\in\mathbb{R}^{d}$ . The claim follows from the correspondence between sublinear $f^{\circ}$ and convex $\partial_{C}f$ (Clarke, 1990, Proposition 2.1.5).

Now we show the relation holds for Fréchet subdifferential. As $h$ and $G$ are locally Lipschitz and directional differentiable, they are Bouligand-differentiable (B-differentiable) according to (Cui and Pang, 2021, Definition 4.1.1). Then, by (Cui and Pang, 2021, Proposition 4.1.2(a)), we know that

[TABLE]

where the directional derivative $G^{\prime}(\bm{x};\bm{d})$ is defined element-wise as $\left(G^{\prime}_{i}(\bm{x};\bm{v})\right)_{i=1}^{n}$ according to (Cui and Pang, 2021, Definition 1.1.4). Thus, combined with $\bar{f}^{\prime}(\bm{x};\bm{d})=\left\langle\nabla h\big{(}G(\bm{x})\big{)},G^{\prime}(\bm{x};d)\right\rangle$ , we have shown $\bar{f}^{\prime}(\bm{x};\bm{d})=f^{\prime}(\bm{x};\bm{d})$ for any $\bm{d}$ , which implies

[TABLE]

by (Rockafellar and Wets, 2009, Exercise 8.4) (note that for B-differentiable $f$ , the subderivative $df(\bm{x})(\bm{d})$ in (Rockafellar and Wets, 2009, Exercise 8.4) is equal to the directional derivative $f^{\prime}(\bm{x};\bm{d})$ by (Rockafellar and Wets, 2009, Exercise 9.15)). ∎

Remark 47.

46* is fundamentally different from the classic exact chain rule as the exact chain rule does not hold even for very simple function. Consider $h(a,b)=a-b$ and $G(x)=(|x|,|x|)$ . We have $\partial_{C}[h\circ G](0)=\{0\}\subsetneq[-1,1]+[-1,1]=[-2,2]$ . In contrast, by 46, we have $\partial_{C}[h\circ G](0)=\partial_{C}[|\cdot|-|\cdot|](0)=\{0\}$ . One should compare 46 with (Clarke, 1990, Theorem 2.3.9, Theorem 2.3.10). Besides, 46 implies (Clarke, 1990, Theorem 2.3.9(ii)).*

Corollary 48.

Let $f:\mathbb{R}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ be $f(u,\bm{x})=u\cdot g(\bm{x})$ , where $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a Lipschitz function. Then, we have $\partial_{\triangleleft}f(u,\bm{x})=\{g(\bm{x})\}\times\partial_{\triangleleft}[u\cdot g](\bm{x})$ .

Proof.

Let $h:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}$ be $h(a,b)=a\cdot b$ . It is easy to see $h$ is smooth at any $(a,b)$ . Let $C(u,\bm{x})=(u,g(\bm{x})),\bar{u}=u$ and $\bar{\bm{x}}=\bm{x}$ . As $f(u,\bm{x})=h\circ C(u,\bm{x})$ , by 46, we know

[TABLE]

as required. ∎

C.4 Exact Chain Rule of a Non-Separable Piecewise Linear Function

In this section, we consider the validity of the exact subdifferential chain rule of a simple piecewise-linear function, which is defined by

[TABLE]

C.4.1 Chain Rule for Clarke Subdifferential

Theorem 49 (Clarke).

Suppose $\bm{x}_{i}^{\top}\bm{w}=\bm{y}_{j}^{\top}\bm{w}=0$ for any $i\in[n],j\in[m]$ . We have the exact Clarke subdifferential chain rule

[TABLE]

if and only if $\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}=\{\bm{0}\}.$

Proof.

We have divided the proof into 50 and 51. ∎

Lemma 50 (Necessary).

If there exists $\bm{v}\in\mathbb{R}^{d}$ such that

[TABLE]

then $\partial_{C}f_{\textsf{PL}}(\bm{w})\subsetneq G_{\textsf{PL}}^{C}.$

Proof.

We first prove that assuming certain regularity on $\{\bm{x}_{i}\}_{i=1}^{n}$ and $\{\bm{y}_{j}\}_{j=1}^{m}$ is without loss of generality. Let the indices set $\mathcal{J}_{x}\subseteq[n]$ be a selection from $\{\bm{x}_{i}\}_{i=1}^{n}$ such that $\left\{\bm{x}_{i}\right\}_{i\in\mathcal{J}_{x}}$ are linearly independent and satisfy

[TABLE]

Similarly, we define $\mathcal{J}_{y}\subseteq[m]$ for $\{\bm{y}_{j}\}_{j=1}^{m}$ . Then, we write

[TABLE]

where

[TABLE]

By the fuzzy sum rule (Clarke, 1990, Proposition 2.3.3), we know

[TABLE]

where we define $G_{\textsf{PL}2}^{C}\coloneqq\sum_{i\in\mathcal{J}_{x}}\bm{x}_{i}\cdot[0,1]+\sum_{j\in\mathcal{J}_{y}}(-\bm{y}_{j})\cdot[0,1]$ . Thus, to prove $\partial_{C}f_{\textsf{PL}}(\bm{w})\subsetneq G_{\textsf{PL}}^{C}$ , by 41 and (Clarke, 1990, Proposition 2.1.2(a)), we only need to show $\partial_{C}f_{\textsf{PL}2}(\bm{w})\subsetneq G_{\textsf{PL}2}^{C}$ . So, by abuse of notation and focus on $f_{\textsf{PL}2}$ , we assume $\{\bm{x}_{i}\}_{i=1}^{n}$ are linearly independent. Similarly, we assume $\{\bm{y}_{j}\}_{j=1}^{m}$ are linearly independent. In the following, we may use 41 and above argument implicitly to assume regularity for simplicity. As $\bm{v}\in\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}$ , we write

[TABLE]

It is safe to assume $a_{i}\neq 0,b_{j}\neq 0,\forall i\in[n],j\in[m]$ . Fix $\bm{y}_{m}$ . We can further assume $\{\bm{x}_{i}\}_{i=1}^{n}\cup\{\bm{y}_{j}\}_{j=1}^{m-1}$ are linearly independent. To see this, suppose to the contrary $\{\bm{x}_{i}\}_{i=1}^{n}\cup\{\bm{y}_{j}\}_{j=1}^{m-1}$ are not linearly independent, we get $\bm{0}=\sum_{i=1}^{n}p_{i}\cdot\bm{x}_{i}+\sum_{j=1}^{m-1}q_{j}\cdot\bm{y}_{j}.$ We know that there exist $j^{\prime}\in[m-1]$ such that $q_{j^{\prime}}\neq 0$ , as otherwise by linear independence of $\{\bm{x}_{i}\}_{i=1}^{m}$ , for any $i\in[n]$ , it holds that $p_{i}=0$ , hence that $\{\bm{x}_{i}\}_{i=1}^{n}\cup\{\bm{y}_{j}\}_{j=1}^{m-1}$ are linearly independent. As $q_{j^{\prime}}\neq 0$ , we have

[TABLE]

Plug in to Equation 2 and $\bm{y}_{j^{\prime}}$ is removed. Repeat this procedure and by abuse of notation, we have $\{\bm{x}_{i}\}_{i=1}^{n}\cup\{\bm{y}_{j}\}_{j=1}^{m-1}$ are linearly independent. After that, we exam $\{a_{i}\}_{i}$ and $\{b_{j}\}_{j}$ . We remove $\bm{x}_{i}$ if $a_{i}=0$ and remove $\bm{y}_{j}$ if $b_{j}=0$ , which is without of generality by 41. It is possible that all $\{\bm{y}_{j}\}_{j=1}^{m-1}$ are removed and we get $m=1$ and $\bm{y}_{m}\in\textnormal{span}\big{(}\{\bm{x}_{i}\}_{i=1}^{n}\big{)}$ . But as $\bm{y}_{m}\neq\bm{0}$ , we always have $n\geqslant 1$ . Then, we can write

[TABLE]

with $\alpha_{i}\neq 0,\beta_{j}\neq 0$ for any $i\in[n],j\in[m-1]$ . Note that, for such $\{\bm{x}_{i}\}_{i=1}^{n}$ and $\{\bm{y}_{j}\}_{j=1}^{m-1}$ , we have the exact chain rule

[TABLE]

by using (Clarke, 1990, Theorem 2.3.10) and linear independence. We proceed to show that $\partial_{C}f_{\textsf{PL}}(\bm{w})\subsetneq G_{\textsf{PL}}^{C}$ by exhibiting an element in $G_{\textsf{PL}}^{C}\backslash\partial_{C}f_{\textsf{PL}}(\bm{w})$ . Let $\bm{\theta}\in\mathbb{R}_{+}^{n+m}$ and we define

[TABLE]

Note that

[TABLE]

By Gordan’s Theorem in 40, we have certified the nonexistence of direction $\bm{d}\in\mathbb{R}^{d}$ such that

[TABLE]

By $-\bm{A}^{\top}\bm{\theta}=\bm{0}$ , similarly, we certify the nonexistence of direction $\bm{d}\in\mathbb{R}^{d}$ such that

[TABLE]

Let the Bouligand subdifferential of $f_{\textsf{PL}}$ at $\bm{w}$ be $\partial_{B}f_{\textsf{PL}}(\bm{w})$ ; see (Cui and Pang, 2021, Definition 4.3.1). Define

[TABLE]

By (Cui and Pang, 2021, Proposition 4.4.8(c)) and the nonexistences of $\bm{d}$ for (4) and (5), we have proved that $\bm{\nabla}_{1},\bm{\nabla}_{2}\notin\partial_{B}f_{\textsf{PL}}(\bm{w})$ . Let us define a set

[TABLE]

Besides, using (Cui and Pang, 2021, Proposition 4.4.8(c)), we have $\partial_{B}f_{\textsf{PL}}(\bm{w})\subseteq\overline{G}_{\textsf{PL}}^{C}\backslash\{\bm{\nabla}_{1},\bm{\nabla}_{2}\}$ . Then, with (Rockafellar and Wets, 2009, Theorem 9.61), it follows that

[TABLE]

Therefore, to prove $\partial_{C}f_{\textsf{PL}}(\bm{w})\subsetneq G_{\textsf{PL}}^{C}$ , we only need to show

[TABLE]

To this end, we define two sets satisfying $\overline{G}_{\textsf{PL}}^{C}=P_{1}\cup P_{2}$ as

[TABLE]

Thus, we can write $\overline{G}_{\textsf{PL}}^{C}\backslash\{\bm{\nabla}_{1},\bm{\nabla}_{2}\}\subseteq\left(P_{1}\backslash\{\bm{\nabla}_{1}\}\right)\cup\left(P_{2}\backslash\{\bm{\nabla}_{2}\}\right)$ . It is evident that $\bm{\nabla}_{1}\in G_{\textsf{PL}}$ . If $\bm{\nabla}_{1}\in\textnormal{Conv}\left(\overline{G}_{\textsf{PL}}^{C}\backslash\{\bm{\nabla}_{1},\bm{\nabla}_{2}\}\right)$ , we have

[TABLE]

We now show that it must be $\lambda=1$ by considering three cases:

Case 1.

$\exists i\in[n]:\alpha_{i}>0$ . Without loss of generality, we assume $\alpha_{1}>0$ . Note that for any $\bm{g}^{P_{2}}\in\textnormal{Conv}\left(P_{2}\backslash\{\bm{\nabla}_{2}\}\right)$ , using the representation of $\bm{y}_{m}$ in Equation 3, we have

[TABLE]

where $\gamma_{k}\in[0,1],\forall k\in[n+m-1]$ . Similarly, we write $\bm{g}^{P_{1}}\in\textnormal{Conv}\left(P_{1}\backslash\{\bm{\nabla}_{1}\}\right)$ as

[TABLE]

where $\mu_{k}\in[0,1],\forall k\in[n+m-1]$ . Therefore, we know

[TABLE]

As $\{\bm{x}_{i}\}_{i=1}^{n}\cup\{\bm{y}_{j}\}_{j=1}^{m-1}$ are linearly independent, it holds

[TABLE]

If $0\leqslant\lambda<1$ , we have

[TABLE]

which gives the contradiction.

Case 2.

$\forall i\in[n]:\alpha_{i}<0$ but $\exists j\in[m-1]:\beta_{j}>0$ . Suppose $\beta_{1}>0$ . Then, we write

[TABLE]

Note that $\{\bm{x}_{i}\}_{i=1}^{n}\cup\{\bm{y}_{j}\}_{j=1}^{m}$ are linearly independent. By abuse of notation and swapping $\bm{y}_{m}$ and $\bm{y}_{1}$ , we still write $\bm{y}_{m}=\sum_{i=1}^{n}\alpha_{i}\bm{x}_{i}+\sum_{j=1}^{m-1}\beta_{j}\bm{y}_{j}$ . Then, we have $\forall i\in[n]:\alpha_{i}>0$ and the situation reduces to the Case 1.

Case 3.

$\forall i\in[n],j\in[m-1]:\alpha_{i}<0,\beta_{j}<0$ . In that case, we have $\bm{\nabla}_{1}=\bm{0}$ . By a similar manipulation as these in Case 1, we have

[TABLE]

As $\{\bm{x}_{i}\}_{i=1}^{n}\cup\{\bm{y}_{j}\}_{j=1}^{m-1}$ are linearly independent, it holds

[TABLE]

If $0\leqslant\lambda<1$ , we have

[TABLE]

which gives the contradiction.

Therefore, we have shown $\lambda=1$ which implies $\bm{\nabla}_{1}\in\textnormal{Conv}\left(P_{1}\backslash\{\bm{\nabla}_{1}\}\right).$ However, as $\{\bm{x}_{i}\}_{i=1}^{n}\cup\{\bm{y}_{j}\}_{j=1}^{m-1}$ are linearly independent, $\bm{\nabla}_{1}$ is an extreme point of $\textnormal{Conv}(P_{1})$ by 43. Thus, we know $\bm{\nabla}_{1}\notin\textnormal{Conv}\left(P_{1}\backslash\{\bm{\nabla}_{1}\}\right)$ by definition, a contradiction. ∎

Lemma 51 (Sufficient).

If the following condition holds

[TABLE]

then $\partial_{C}f_{\textsf{PL}}(\bm{w})=G_{\textsf{PL}}^{C}.$

Proof.

We first do a general preparation that will be reused in other developments. Let $\bm{X}=\left[\begin{array}[]{c|c|c}\bm{x}_{1}&\cdots&\bm{x}_{n}\end{array}\right]\in\mathbb{R}^{d\times n}$ and $\bm{Y}=\left[\begin{array}[]{c|c|c}\bm{y}_{1}&\cdots&\bm{y}_{m}\end{array}\right]\in\mathbb{R}^{d\times m}$ be given. The thin-SVD of $\bm{X}$ can be written as $\bm{X}=\bm{U}_{x}\bm{\Sigma}_{x}\bm{V}_{x}^{\top}$ with $\bm{U}_{x}\in\mathop{\textnormal{St}}(d,r_{x}),\bm{\Sigma}_{x}\in\mathbb{R}^{r_{x}\times r_{x}},\bm{V}_{x}\in\mathop{\textnormal{St}}(n,r_{x})$ , and $r_{x}=\mathop{\textnormal{rank}}(\bm{X})$ . Similarly, for $\bm{Y}$ , we have $\bm{Y}=\bm{U}_{y}\bm{\Sigma}_{y}\bm{V}_{y}^{\top}$ with $\bm{U}_{y}\in\mathop{\textnormal{St}}(d,r_{y}),\bm{\Sigma}_{y}\in\mathbb{R}^{r_{y}\times r_{y}},\bm{V}_{y}\in\mathop{\textnormal{St}}(m,r_{y})$ , and $r_{y}=\mathop{\textnormal{rank}}(\bm{Y})$ . As $\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}=\{\bm{0}\}$ , we know $\bm{U}_{x}^{\top}\bm{U}_{y}=\bm{0}$ . Therefore, we can write

[TABLE]

where $\bm{z}=\bm{U}^{\top}\bm{w}$ and $\bm{U}\coloneqq\left[\begin{array}[]{c|c}\bm{U}_{x}&\bm{U}_{y}\end{array}\right]\in\mathop{\textnormal{St}}(d,r_{x}+r_{y}).$

Let an auxiliary function $h_{\textsf{PL}}:\mathbb{R}^{r_{x}}\times\mathbb{R}^{r_{y}}\rightarrow\mathbb{R}$ be

[TABLE]

As $h_{\textsf{PL}}$ is separable with respect to $\bm{z}_{1}$ and $\bm{z}_{2}$ , by (Rockafellar, 1985, Proposition 2.5) and (Rockafellar and Wets, 2009, Proposition 10.5), we know

[TABLE]

Note that $f_{\textsf{PL}}(\bm{w})=h_{\textsf{PL}}(\bm{U}_{x}^{\top}\bm{w},\bm{U}_{y}^{\top}\bm{w})$ . We compute

[TABLE]

where (a) is using (Clarke, 1990, Theorem 2.3.10), (Rockafellar and Wets, 2009, Exercise 10.7), and $\bm{U}$ is full column rank; (b) is from $\partial_{\triangleleft}h_{\textsf{PL}}(\bm{z_{1}},\bm{z_{2}})=\partial_{\triangleleft}h_{\textsf{PL}1}(\bm{z}_{1})\times\partial_{\triangleleft}[-h_{\textsf{PL}2}](\bm{z}_{2})$ ; (c) is using the reasoning in (a) for $h_{1},h_{2}$ separately.

In particular for Clarke subdifferential, we know $\partial_{C}[-h_{\textsf{PL}2}]=-\partial_{C}[h_{\textsf{PL}2}]$ using (Clarke, 1990, Proposition 2.3.1). As $h_{\textsf{PL}2}$ is convex, $\partial_{C}[h_{\textsf{PL}2}]$ is equal to the convex subdifferential of $h_{\textsf{PL}2}$ by (Clarke, 1990, Proposition 2.2.7). Then, by (Hiriart-Urruty and Lemaréchal, 2004, §D, Corollary 4.3.2), a direct computation gives

[TABLE]

as required. ∎

Proof of 14.

According to the argument in Section C.1, we only need to consider the Clarke subdifferential $\partial_{C}f_{k}(\bm{w}_{k})$ for every $k\in[H]$ . It is showed in 49 that we have

[TABLE]

if and only if the following span qualification is satisfied:

[TABLE]

Then, put all $k\in[H]$ cases together, and 14 is proved. ∎

C.4.2 Chain Rule for Limiting Subdifferential

Theorem 52 (Limiting).

Suppose $\bm{x}_{i}^{\top}\bm{w}=\bm{y}_{j}^{\top}\bm{w}=0$ and $\bm{y}_{j}\neq\bm{0}$ for any $i\in[n],j\in[m]$ . We have the exact limiting subdifferential chain rule

[TABLE]

if and only if $\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}=\{\bm{0}\}.$

Proof.

(Sufficient) We begin with the general argument in the proof of 51 until Equation $\diamondsuit$ . After that, we will focus on the proof of

[TABLE]

For the ease of notation, we denote $q(\bm{w})\coloneqq-h_{\textsf{PL}2}\left(\bm{U}_{y}^{\top}\bm{w}\right)=-\sum_{j=1}^{m}\max\{\bm{y}_{j}^{\top}\bm{w},0\}$ . Note that by the definition of limiting subdifferential (see 4), we have

[TABLE]

Let $\bm{g}\in\partial q(\bm{w})$ . Then, there exist $\{\bm{w}_{\nu}\}_{\nu}$ and $\{\bm{g}_{\nu}\}_{\nu}$ such that $\bm{w}_{\nu}\rightarrow\bm{w},\bm{g}_{\nu}\in\widehat{\partial}q(\bm{w}_{\nu}),$ and $\bm{g}_{\nu}\rightarrow\bm{g}$ . We can assume for any $\nu$ and any $j\in[m]$ , we have $\bm{w}_{\nu}^{\top}\bm{y}_{j}\neq 0$ , as otherwise, by 44, $\widehat{\partial}q(\bm{w}_{k})=\emptyset$ and $\bm{g}_{k}$ is undefined. Then, for any $\nu$ , the function $q$ is strictly differentiable at $\bm{w}_{\nu}$ , which implies

[TABLE]

As $G_{\textsf{PL}2}^{L}$ is a finite set, it is trivially closed with the usual Euclidean metric. We have $\partial q(\bm{w})\subseteq G_{\textsf{PL}2}^{L}$ . For the reverse direction, let $\bm{g}^{\prime}\in G_{\textsf{PL}2}^{L}$ . Then, there exists $\bm{d}$ such that

[TABLE]

with $\bm{d}^{\top}\bm{y}_{j}\neq\bm{0}$ for any $j\in[m]$ . Let $\bm{w}_{\nu}=\bm{w}+\bm{d}/\nu$ . We get $\bm{w}_{\nu}^{\top}\bm{y}_{j}=\nu^{-1}\bm{d}^{\top}\bm{y}_{j}\neq 0$ for any $j\in[m]$ . Then, we know the function $q$ is strictly differentiable at $\bm{w}_{\nu}$ and $\{\bm{g}_{\nu}\}=\widehat{\partial}q(\bm{w}_{\nu})$ . Thus, for any $\nu$ , we get $\bm{g}_{\nu}=\bm{g}^{\prime}$ . Consequently, we get $\bm{g}^{\prime}\in\widehat{\partial}q(\bm{w})$ and $G_{\textsf{PL}2}^{L}\subseteq\partial q(\bm{w})$ .

(Necessary) Suppose $\bm{0}\neq\bm{v}\in\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}$ and

[TABLE]

Then, by taking a convex hull on both size and using (Rockafellar and Wets, 2009, Theorem 8.49), we get

[TABLE]

which is a contradiction to 50. ∎

Proof of 16.

According to the argument in Section C.1, we only need to consider the limiting subdifferential $\partial f_{k}(\bm{w}_{k})$ for every $k\in[H]$ . It is showed in 52 that we have

[TABLE]

if and only if the following span qualification is satisfied:

[TABLE]

Then, put all $k\in[H]$ cases together, and 16 is proved. ∎

C.4.3 Chain Rule for Fréchet Subdifferential

Theorem 53 (Fréchet).

Suppose $\bm{x}_{i}^{\top}\bm{w}=\bm{y}_{j}^{\top}\bm{w}=0$ and $\bm{y}_{j}\neq\bm{0}$ for any $i\in[n],j\in[m]$ . For any given $\bm{w}$ such that $\widehat{\partial}f_{\textsf{PL}}(\bm{w})\neq\emptyset$ , we have the following exact chain rule

[TABLE]

if and only if $\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}=\{\bm{0}\}.$

Proof.

(Sufficient) We begin with the general argument in the proof of 51 until Equation $\diamondsuit$ . We will focus on the proof of

[TABLE]

For the ease of notation, we denote $q(\bm{w})\coloneqq-h_{\textsf{PL}2}\left(\bm{U}_{y}^{\top}\bm{w}\right)=-\sum_{j=1}^{m}\max\{\bm{y}_{j}^{\top}\bm{w},0\}$ . Then, by 44, we know that if there exists $j\in[m]$ such that $\bm{w}^{\top}\bm{y}_{j}=0$ and $\bm{y}_{j}\neq\bm{0}$ , then we have $\widehat{\partial}q(\bm{w})=\emptyset$ . If $m=0$ , then $q(\bm{w})=0$ and $G_{\textsf{PL}2}^{F}=\{\bm{0}\}$ . The claim follows trivially.

(Necessary) Suppose $\bm{0}\neq\bm{v}\in\textnormal{span}\big{(}\left\{\bm{x}_{i}\right\}_{i=1}^{n}\big{)}\cap\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}$ . There exists $\bm{y}_{j}\neq\bm{0}$ as otherwise $\bm{v}\notin\{\bm{0}\}\supseteq\textnormal{span}\big{(}\left\{\bm{y}_{j}\right\}_{j=1}^{m}\big{)}$ . Then, we get $m>0$ and $G_{\textsf{PL}}^{F}=\emptyset$ . Thus, from the assumption that $\widehat{\partial}f_{\textsf{PL}}(\bm{w})\neq\emptyset$ , we know $\widehat{\partial}f_{\textsf{PL}}(\bm{w})\supsetneq G_{\textsf{PL}}^{F}=\emptyset$ by definition. ∎

Proof of 17.

According to the argument in Section C.1, we only need to consider the Fréchet subdifferential $\widehat{\partial}f_{k}(\bm{w}_{k})$ for every $k\in[H]$ . It is showed in 53 that we have

[TABLE]

if and only if the following span qualification is satisfied:

[TABLE]

Then, put all $k\in[H]$ cases together, and 17 is proved. ∎

C.5 Proofs for Section 4.3

Definition 54 (Regularities).

We consider the following regularity conditions:

•

General position data (Yun et al., 2018, Assumption 2): No $d$ data points $\{\tilde{\bm{x}}_{i}\}_{i}\subseteq\mathbb{R}^{d-1}$ lie on the same affine hyperplane, which is equivalent to the nonexistence of $\bm{w}\in\mathbb{R}^{d}$ and index set $\mathcal{J}\subseteq[N]$ with $|\mathcal{J}|\geqslant d$ such that $\bm{w}^{\top}\bm{x}_{j}=0$ for any $j\in\mathcal{J}$ .

•

Linear Independence Kink Qualification (LIKQ) (Griewank and Walther, 2016, Definition 2), (Griewank and Walther, 2019, Definition 2.6): Let the $j$ -th row of the matrix $\nabla\bm{z}^{\sigma}$ in Appendix A be $\bm{v}_{j}^{\top}$ . We define the following index set

[TABLE]

LIKQ is satisfied if the vectors $\{\bm{v}_{i}\}_{i\in\alpha}$ are linearly independent.

•

Linearly Independent Activated Data (LIAD): Let the index set $\mathcal{J}_{k}\coloneqq\{j:\bm{w}_{k}^{\top}\bm{x}_{j}=0\}$ . For any fixed $k\in[H]$ , the data points $\{\bm{x}_{i}\}_{i\in\mathcal{J}_{k}}$ are linearly independent.

Proof of 19.

For the relation general position $\Longrightarrow$ LIAD, it directly follows from (Yun et al., 2018, Lemma 1). By the analysis in Section A.2, we know LIKQ is satisfied for the empirical loss of two-layer ReLU network if and only if

[TABLE]

are linearly independent. It is easy to see that LIKQ holds if and only if, for any given $k\in[H]$ , the data points $\{\bm{x}_{i}\}_{i\in\{j:\bm{w}_{k}^{\top}\bm{x}_{j}=0\}}$ are linearly independent. Thus, we have the relation LIAD $\iff$ LIKQ. Note that $\mathcal{I}_{k}^{+}\cup\mathcal{I}_{k}^{-}=\{j:\bm{w}_{k}^{\top}\bm{x}_{j}=0\}$ . If $\{\bm{x}_{j}\}_{j\in\mathcal{I}_{k}^{+}\cup\mathcal{I}_{k}^{-}}$ are linearly independent, then it is evident that

[TABLE]

which implies LIAD $\Longrightarrow$ SQ. ∎

Proof of 23.

Under SQ, if the Fréchet subdifferential is nonempty, we get $\mathcal{I}_{k}^{-}=\emptyset$ for any $k\in[H]$ . By 14 and 17, we have $\partial_{C}L$ and $\widehat{\partial}L$ are equal at that point. Then, Clarke regularity follows from 7. By 19, if the data points are in general position, then they satisfy SQ. Using (Rockafellar and Wets, 2009, Theorem 10.1), the Fréchet subdifferential is nonempty at every local minimizer, which completes the proof. ∎

Appendix D Proofs for Section 5

D.1 Testing Clarke NAS

Proof of 28.

We consider an $\varepsilon$ -Clarke stationary point $(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})$ with

[TABLE]

By 14, we know there exists $\bm{g}^{*}\in\partial_{C}L(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})$ such that

[TABLE]

Note that $\widehat{u}_{i}=u_{i}$ for any $i\in[H]$ in the returned vector $(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{w}}_{H})$ of Algorithm 3. In this subsection, we will write $u_{i}$ rather than $\widehat{u}_{i}$ for simplicity. Given a positive radius $\delta\in(0,C_{\tau}^{\textnormal{Clarke}}]$ , we aim to show that, for any

[TABLE]

we can certify that the rounded point returned by Algorithm 3 satisfies

[TABLE]

where $C_{\mu}^{\textnormal{Clarke}}<+\infty$ is a constant depending on the curvature that we will discuss later.

We define the following shorthands for convenience

[TABLE]

Recall the definition of the rounded $\{\widehat{\bm{w}}_{k}\}_{k}$ and we define indices sets $\mathcal{J}_{k}^{<},\mathcal{J}_{k}^{=},\mathcal{J}_{k}^{>}$ as

[TABLE]

We consider the following quantity related to the point $\bm{w}_{k}^{*}$ for any $k\in[H]$ :

[TABLE]

Note that $0<\delta\leqslant C_{\tau}^{\textnormal{Clarke}}\leqslant\frac{\tau_{k}}{4R}$ . For any $i\in[N]$ such that $\bm{x}_{i}^{\top}\bm{w}_{k}^{*}>0$ , we have

[TABLE]

Thus, we know $\left\{i:\bm{x}_{i}^{\top}\bm{w}_{k}^{*}>0\right\}\subseteq\mathcal{J}_{k}^{>}$ . Similarly, for any $i\in[N]$ such that $\bm{x}_{i}^{\top}\bm{w}_{k}^{*}<0$ , we have

[TABLE]

which implies $\left\{i:\bm{x}_{i}^{\top}\bm{w}_{k}^{*}<0\right\}\subseteq\mathcal{J}_{k}^{<}$ . We have, for any $i\in[N]$ such that $\bm{x}_{i}^{\top}\bm{w}_{k}^{*}=0$ , it holds

[TABLE]

So, we know $\left\{i:\bm{x}_{i}^{\top}\bm{w}_{k}^{*}=0\right\}\subseteq\mathcal{J}_{k}^{=}$ . As $\mathcal{J}_{k}^{<},\mathcal{J}_{k}^{=},\mathcal{J}_{k}^{>}$ are disjoint and $[N]=\mathcal{J}_{k}^{<}\sqcup\mathcal{J}_{k}^{=}\sqcup\mathcal{J}_{k}^{>}$ , we know

[TABLE]

Meanwhile, as $\widehat{\bm{w}}_{k}$ is feasible to the quadratic program in Algorithm 3, we get

[TABLE]

which implies $\mathcal{I}_{k}^{+}(\bm{w}_{k}^{*})\cup\mathcal{I}_{k}^{-}(\bm{w}_{k}^{*})=\mathcal{J}_{k}^{=}=\mathcal{I}_{k}^{+}(\widehat{\bm{w}}_{k})\cup\mathcal{I}_{k}^{-}(\widehat{\bm{w}}_{k})$ and $\mathbf{1}_{\bm{x}_{i}^{\top}\bm{w}_{k}^{*}>0}=\mathbf{1}_{i\in\mathcal{J}_{k}^{>}}=\mathbf{1}_{\bm{x}_{i}^{\top}\widehat{\bm{w}}_{k}>0}$ for any $k\in[H]$ . It is evident that

[TABLE]

as, for any $k\in[H]$ , $\bm{w}_{k}^{*}$ is feasible to the quadratic program for computing $\widehat{\bm{w}}_{k}$ in Algorithm 3. Therefore, we know

[TABLE]

By triangle inequality, it holds that

[TABLE]

Using 14, we get

[TABLE]

where we define $\overline{L}_{k}(\widehat{\bm{w}}_{k})=\sum_{i=1}^{N}u_{k}\widehat{\rho}_{i}\cdot\max\{\bm{x}_{i}^{\top}\widehat{\bm{w}}_{k},0\}$ . We first compute

[TABLE]

We now upper bound the second term $\sum_{k=1}^{H}\left|g_{k}^{*}-\sum_{i=1}^{N}\widehat{\rho}_{i}\cdot\max\left\{\widehat{\bm{w}}_{k}^{\top}\bm{x}_{i},0\right\}\right|$ . Note that

[TABLE]

Now we estimate these two terms. For $T_{1}^{k}$ , we compute

[TABLE]

For $T_{2}^{k}$ , we see that

[TABLE]

Summarizing, we have

[TABLE]

We proceed to upper bound $\sum_{k=1}^{H}\textnormal{dist}\Big{(}\bm{g}^{*}_{k},\partial_{C}\overline{L}_{k}(\widehat{\bm{w}}_{k})\Big{)}$ .

By 14, we know that there exist $\xi_{j}\in[0,1],\forall j\in\mathcal{I}_{k}^{+}(\bm{w}_{k}^{*})\cup\mathcal{I}_{k}^{-}(\bm{w}_{k}^{*})$ such that the Clarke subgradient $\bm{g}_{k}^{*}\in\partial_{C}\overline{L}_{k}(\bm{w}_{k}^{*})$ can be written as

[TABLE]

Now, we are well prepared to upper bound $\textnormal{dist}\big{(}\bm{g}_{k}^{*},\partial_{C}\overline{L}_{k}(\widehat{\bm{w}}_{k})\big{)}$ . Let

[TABLE]

which, by 14, belongs to the Clarke subdifferential $\partial_{C}\overline{L}_{k}(\widehat{\bm{w}}_{k})$ . We upper bound

[TABLE]

with

[TABLE]

Then, we have

[TABLE]

In sum, we have proved that

[TABLE]

where $C_{\mu}^{\textnormal{Clarke}}\coloneqq C_{4}+C_{5}=\textnormal{poly}(B,R,L_{\ell},L_{\ell^{\prime}},N,H)$ . ∎

D.2 Testing Fréchet NAS

Proof of 32.

Some steps in the computation are similar to these in the proof of 28 in Section D.1, and we may skip them for simplicity. We consider an $\varepsilon$ -Fréchet stationary point $(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})$ with $\left\|(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})\right\|\leqslant B.$ By 17, there exists a regular subgradient $\bm{g}^{*}\in\widehat{\partial}L(u_{1}^{*},\bm{w}_{1}^{*},\dots,u_{H}^{*},\bm{w}_{H}^{*})$ such that

[TABLE]

Given a positive radius $\delta\in(0,C_{\tau}^{\textnormal{Fr\'{e}chet}}]$ , we aim to show that, for any

[TABLE]

we can certify the rounded point returned by Algorithm 5 satisfying

[TABLE]

where $C_{\mu}^{\textnormal{Fr\'{e}chet}}<+\infty$ is a constant depending on the curvature that we will discuss later.

Similar to Section D.1, we define the following shorthands for convenience

[TABLE]

We consider the following quantity related to the point $\bm{w}_{k}^{*}$ for any $k\in[H]$ :

[TABLE]

We use the same indices sets $\mathcal{J}_{k}^{<},\mathcal{J}_{k}^{=},\mathcal{J}_{k}^{>}$ for computing the rounded $\{\widehat{\bm{w}}_{k}\}_{k}$ as those in Section D.1. Note that $0<\delta\leqslant C_{\tau}^{\textnormal{Fr\'{e}chet}}\leqslant\frac{\tau_{k}}{4R}$ . The argument in Section D.1 shows that

[TABLE]

Meanwhile, as $\widehat{\bm{w}}_{k}$ is feasible to the quadratic program in Algorithm 5, we get

[TABLE]

which implies $\mathcal{I}_{k}^{+}(\bm{w}_{k}^{*})\cup\mathcal{I}_{k}^{-}(\bm{w}_{k}^{*})=\mathcal{J}_{k}^{=}=\mathcal{I}_{k}^{+}(\widehat{\bm{w}}_{k})\cup\mathcal{I}_{k}^{-}(\widehat{\bm{w}}_{k})$ and $\mathbf{1}_{\bm{x}_{i}^{\top}\bm{w}_{k}^{*}>0}=\mathbf{1}_{i\in\mathcal{J}_{k}^{>}}=\mathbf{1}_{\bm{x}_{i}^{\top}\widehat{\bm{w}}_{k}>0}$ for any $k\in[H]$ . It is evident that

[TABLE]

as, for any $k\in[H]$ , $\bm{w}_{k}^{*}$ is feasible to the quadratic program for computing $\widehat{\bm{w}}_{k}$ in Algorithm 5.

However, the identification of $\bm{w}_{k}^{*}$ is not sufficient to bound $\textnormal{dist}\Big{(}\bm{g}^{*}_{k},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\Big{)}$ , as the index set $\mathcal{I}_{k}^{-}(\widehat{\bm{w}}_{k})$ may not be an empty set, which, by 17, implies $G_{k}^{F}=\emptyset$ and $\textnormal{dist}\Big{(}\bm{g}^{*}_{k},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\Big{)}=+\infty$ . We are thus looking for the identification of $\{u_{k}^{*}\}_{k=1}^{H}$ . We define a constant $C_{u}\coloneqq L_{\ell^{\prime}}(4HRB^{2}+1)$ and consider the following quantity related to the point $u_{k}^{*}$ for any $k\in[H]$ :

[TABLE]

Note that $0<\delta\leqslant C_{\tau}^{\textnormal{Fr\'{e}chet}}\leqslant\frac{\tau_{k}^{\prime}}{4C_{u}}$ . Fix any $k\in[H]$ and we consider two cases. If $u_{k}^{*}\neq 0$ , by 17 and 30, we know $u_{k}^{*}\cdot\rho_{i}^{*}>0$ for any $i\in\mathcal{J}_{k}^{=}$ . Then, for any $i\in\mathcal{J}_{k}^{=}$ , we have

[TABLE]

which by the rounding step of $\widehat{u}_{k}$ in Algorithm 5 implies if $u_{k}^{*}\neq 0$ , then $\widehat{u}_{k}={u}_{k}$ and

[TABLE]

If $u_{k}^{*}=0$ , we can see that

[TABLE]

which implies $\widehat{u}_{k}=0$ by rounding step of $\widehat{u}_{k}$ in Algorithm 5. Thus, we have proved that for any $k\in[H]$ and $i\in\mathcal{J}_{k}^{=}$ , we get $\widehat{u}_{k}\cdot\widehat{\rho}_{i}\geqslant 0$ , hence that $\mathcal{I}_{k}^{-}(\widehat{\bm{w}}_{k})=\mathcal{I}_{k}^{-}(\bm{w}_{k}^{*})=\emptyset$ , and finally that $\mathcal{I}_{k}^{+}(\widehat{\bm{w}}_{k})=\mathcal{I}_{k}^{+}(\bm{w}_{k}^{*})$ . By 17, we conclude that $\widehat{\partial}L(\widehat{u}_{1},\widehat{\bm{w}}_{1},\dots,\widehat{u}_{H},\widehat{\bm{u}}_{H})\neq\emptyset$ .

Summarizing, we have

[TABLE]

This shows by triangle inequality that

[TABLE]

Using 17, we get

[TABLE]

where we define $\overline{L}_{k}(\widehat{\bm{w}}_{k})=\sum_{i=1}^{N}\widehat{u}_{k}\widehat{\rho}_{i}\cdot\max\{\bm{x}_{i}^{\top}\widehat{\bm{w}}_{k},0\}$ . We first compute

[TABLE]

We now upper bound the second term $\sum_{k=1}^{H}\left|g_{k}^{*}-\sum_{i=1}^{N}\widehat{\rho}_{i}\cdot\max\left\{\widehat{\bm{w}}_{k}^{\top}\bm{x}_{i},0\right\}\right|$ . A computation similar to that in Section D.1 shows that

[TABLE]

where $C_{2}\coloneqq BR\cdot C_{1}$ and $C_{3}\coloneqq 2L_{\ell^{\prime}}NR$ .

We proceed to upper bound $\sum_{k=1}^{H}\textnormal{dist}\Big{(}\bm{g}^{*}_{k},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\Big{)}$ . By 17, we know that there exist $\xi_{j}\in[0,1],\forall j\in\mathcal{I}_{k}^{+}(\bm{w}_{k}^{*})$ such that $\bm{g}_{k}^{*}\in\widehat{\partial}\overline{L}_{k}(\bm{w}_{k}^{*})$ can be written as

[TABLE]

Now, we are well prepared to upper bound $\textnormal{dist}\big{(}\bm{g}_{k}^{*},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\big{)}$ . Let

[TABLE]

which, by 17, belongs to the Fréchet subdifferential $\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})$ . We proceed to upper bound $\textnormal{dist}\big{(}\bm{g}_{k}^{*},\widehat{\partial}\overline{L}_{k}(\widehat{\bm{w}}_{k})\big{)}\leqslant\|\widehat{\bm{g}}_{k}-\bm{g}_{k}^{*}\|$ with

[TABLE]

Then, we have

[TABLE]

In sum, we have proved that

[TABLE]

where $C_{\mu}^{\textnormal{Fr\'{e}chet}}\coloneqq C_{4}+C_{5}=\textnormal{poly}(B,R,L_{\ell},L_{\ell^{\prime}},N,H)$ . ∎

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ahmadi and Zhang (2022) A. A. Ahmadi and J. Zhang. On the complexity of finding a local minimizer of a quadratic function over a polytope. Mathematical Programming , 195(1-2):783–792, 2022.
2Arora et al. (2019) S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning , pages 322–332. PMLR, 2019.
3Bertsimas and Tsitsiklis (1997) D. Bertsimas and J. N. Tsitsiklis. Introduction to Linear Optimization , volume 6. Athena Scientific Belmont, MA, 1997.
4Bubeck et al. (2020) S. Bubeck, R. Eldan, Y. T. Lee, and D. Mikulincer. Network size and size of the weights in memorization with two-layers neural networks. In Advances in Neural Information Processing Systems , volume 33, pages 4977–4986, 2020.
5Burke et al. (2002) J. V. Burke, A. S. Lewis, and M. L. Overton. Approximating subdifferentials by random sampling of gradients. Mathematics of Operations Research , 27(3):567–584, 2002.
6Clarke (1990) F. H. Clarke. Optimization and Nonsmooth Analysis . SIAM, 1990.
7Cui and Pang (2021) Y. Cui and J.-S. Pang. Modern Nonconvex Nondifferentiable Optimization . SIAM, 2021.
8Davis and Drusvyatskiy (2019) D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization , 29(1):207–239, 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Testing Stationarity Concepts for ReLU Networks:

Abstract

1 Introduction

1.1 Our Results and Techniques

Hardness.

Assumption 1** (Blanket assumptions).**

Regularity.

Robust algorithms.

Notation.

Organization.

2 Preliminaries

Definition 2** (Clarke subdifferential).**

Definition 3** (Fréchet subdifferential).**

Definition 4** (Limiting subdifferential).**

Fact 5** (Rockafellar and Wets (2009, Theorem 8.6, 8.49, 10.1)).**

Definition 6** (Stationarity concepts).**

Definition 7** (Clarke regularity).**

Fact 8** (Calculus rules).**

Remark 9**.**

3 Hardness of Stationarity Testing

Theorem 10** (Testing of piecewise linear functions).**

Corollary 11** (Testing of abs-normal form).**

Corollary 12** (Testing of loss of nonsmooth networks).**

4 Regularity Conditions

4.1 Setup

Definition 13**.**

4.2 Main Results

Theorem 14** (Clarke chain rule).**

Remark 15**.**

Theorem 16** (Limiting chain rule).**

Theorem 17** (Fréchet chain rule).**

4.3 Discussion

Definition 18** (Regularities).**

Proposition 19** (Regularity comparison).**

Example 20** (SQ ⇏\nRightarrow⇏ LIAD).**

Example 21** (LIAD ⇏\nRightarrow⇏ general position).**

Remark 22** (GkLG^{L}_{k}GkL​ for general position data).**

Corollary 23**.**

5 Testing of Stationarity Concepts

Assumption 24**.**

5.1 Exact Stationarity Test

Clarke stationarity.

Fréchet stationarity.

5.2 Robust Stationarity Test

Definition 25** (Near-Approximate Stationarity, NAS).**

5.2.1 Testing Clarke NAS

Definition 26** (Clarke).**

Remark 27**.**

Theorem 28** (Robust Clarke test).**

Remark 29** (Line search).**

5.2.2 Testing Fréchet NAS

Assumption 30**.**

Definition 31** (Fréchet).**

Theorem 32** (Robust Fréchet test).**

Appendix A Abs-Normal Form of Piecewise Differentiable Functions

A.1 The General Framework

A.2 Abs-Normal Form of Shallow ReLU Networks

Appendix B Proofs for Section 3

B.1 The Problems

Problem 33** (3SAT).**

Problem 34** (Piecewise Linear Test, PLT).**

Problem 35** (Neural Network Test, NNT).**

Problem 36** (Abs-Normal Form Test, ANFT).**

B.2 Hardness of Piecewise Linear Test

Lemma 37**.**

Proof.

Lemma 38**.**

Proof.

Proof of 10.

B.3 Hardness of Abs-Normal Form Test

Proof of 11.

B.4 Hardness of Neural Network Test

Lemma 39**.**

Assumption 1 (Blanket assumptions).

Definition 2 (Clarke subdifferential).

Definition 3 (Fréchet subdifferential).

Definition 4 (Limiting subdifferential).

Fact 5 (Rockafellar and Wets (2009, Theorem 8.6, 8.49, 10.1)).

Definition 6 (Stationarity concepts).

Definition 7 (Clarke regularity).

Fact 8 (Calculus rules).

Remark 9.

Theorem 10 (Testing of piecewise linear functions).

Corollary 11 (Testing of abs-normal form).

Corollary 12 (Testing of loss of nonsmooth networks).

Definition 13.

Theorem 14 (Clarke chain rule).

Remark 15.

Theorem 16 (Limiting chain rule).

Theorem 17 (Fréchet chain rule).

Definition 18 (Regularities).

Proposition 19 (Regularity comparison).

Example 20 (SQ $\nRightarrow$ LIAD).

Example 21 (LIAD $\nRightarrow$ general position).

Remark 22 ( $G^{L}_{k}$ for general position data).

Corollary 23.

Assumption 24.

Definition 25 (Near-Approximate Stationarity, NAS).

Definition 26 (Clarke).

Remark 27.

Theorem 28 (Robust Clarke test).

Remark 29 (Line search).

Assumption 30.

Definition 31 (Fréchet).

Theorem 32 (Robust Fréchet test).

Problem 33 (3SAT).

Problem 34 (Piecewise Linear Test, PLT).

Problem 35 (Neural Network Test, NNT).

Problem 36 (Abs-Normal Form Test, ANFT).

Lemma 37.

Lemma 38.

Lemma 39.

Lemma 40 (Gordan, cf. (Bertsimas and Tsitsiklis, 1997, Exercise 4.26)).

Lemma 41.

Remark 42.

Lemma 43.

Lemma 44.

Definition 45 (Bouligand subdifferential, c.f. (Cui and Pang, 2021, Definition 4.3.1)).

Theorem 46 (Partial linearization).

Remark 47.

Corollary 48.

Theorem 49 (Clarke).

Lemma 50 (Necessary).

Lemma 51 (Sufficient).

Theorem 52 (Limiting).

Theorem 53 (Fréchet).

Definition 54 (Regularities).