Composite Optimization Algorithms for Sigmoid Networks

Huixiong Chen; Qi Ye

arXiv:2303.00589·math.OC·July 10, 2023

Composite Optimization Algorithms for Sigmoid Networks

Huixiong Chen, Qi Ye

PDF

Open Access

TL;DR

This paper introduces composite optimization algorithms tailored for sigmoid networks, transforming the training process into a convex composite optimization problem, with proven convergence guarantees and practical effectiveness demonstrated through numerical experiments.

Contribution

It develops novel composite optimization algorithms based on linearized proximal methods and ADMM for sigmoid networks, ensuring convergence even in non-convex, non-smooth cases.

Findings

01

Algorithms converge to global optima under certain conditions.

02

Numerical results show robust performance on function fitting and digit recognition.

03

Provides guidelines for network size based on training data.

Abstract

In this paper, we use composite optimization algorithms to solve sigmoid networks. We equivalently transfer the sigmoid networks to a convex composite optimization and propose the composite optimization algorithms based on the linearized proximal algorithms and the alternating direction method of multipliers. Under the assumptions of the weak sharp minima and the regularity condition, the algorithm is guaranteed to converge to a globally optimal solution of the objective function even in the case of non-convex and non-smooth problems. Furthermore, the convergence results can be directly related to the amount of training data and provide a general guide for setting the size of sigmoid networks. Numerical experiments on Franke's function fitting and handwritten digit recognition show that the proposed algorithms perform satisfactorily and robustly.

Tables3

Table 1. Table 1: The performance of regression on Franke’s function (using quadratic loss).

No noise	2.9525e-3	1.4736e-2	2.7790e-3	1.1547e-2
	LPA		GLPA
	RMS-error	Max-error	RMS-error	Max-error
Gaussian noise	3.4364e-3	1.2678e-2	3.7613e-3	1.5765e-2

Table 2. Table 2: The performance of regression on Franke’s function (using absolute loss).

	GLPA
	RMS-error	max-error
No noise	2.2093e-4	8.4516e-4
Gaussian noise	8.4138e-4	4.3988e-3

Table 3. Table 3: The performance of classification on handwritten digit (using hinge loss).

Classified	GLPA		SGDM (RMSProp, Adam)
Digits	Training errors	Test errors	Training errors	Test errors
0 - 1	0 / 252	0 / 108	0 / 252	0 / 108
2 - 5	0 / 251	0 / 108	0 / 251	0 / 108
3 - 7	0 / 253	0 / 109	0 / 253	0 / 109
6 - 9	0 / 252	1 / 109	0 / 252	1 / 109

Equations88

σ (a) = \frac{1}{1 + e ^{- a}},

σ (a) = \frac{1}{1 + e ^{- a}},

f (x) = i = 1 \sum q w_{i} σ (v_{i} \cdot x + u_{i}) + w_{0},

f (x) = i = 1 \sum q w_{i} σ (v_{i} \cdot x + u_{i}) + w_{0},

θ = (w, v, u, w_{0})^{T} \in R^{n},

θ = (w, v, u, w_{0})^{T} \in R^{n},

D = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{m}, y_{m}) ∣ x_{i} \in R^{d}, y_{i} \in R},

D = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{m}, y_{m}) ∣ x_{i} \in R^{d}, y_{i} \in R},

θ \in R^{n} min E (θ) := \frac{1}{m} i = 1 \sum m L (f (x_{i}; θ), y_{i}),

θ \in R^{n} min E (θ) := \frac{1}{m} i = 1 \sum m L (f (x_{i}; θ), y_{i}),

θ \in R^{n} min E (θ) = L (F (θ)),

θ \in R^{n} min E (θ) = L (F (θ)),

F (θ) = f (x_{1}; θ) - y_{1} f (x_{2}; θ) - y_{2} ⋮ f (x_{m}; θ) - y_{m}, L (z) = \frac{1}{m} ∥ z ∥_{p}^{p}, where p = 1 or 2.

F (θ) = f (x_{1}; θ) - y_{1} f (x_{2}; θ) - y_{2} ⋮ f (x_{m}; θ) - y_{m}, L (z) = \frac{1}{m} ∥ z ∥_{p}^{p}, where p = 1 or 2.

F (θ) = y_{1} f (x_{1}; θ) y_{2} f (x_{2}; θ) ⋮ y_{m} f (x_{m}; θ), L (z) = \frac{1}{m} i = 1 \sum m (1 - z_{i})_{+} = \frac{1}{m} ∥ (1 - z)_{+} ∥_{1},

F (θ) = y_{1} f (x_{1}; θ) y_{2} f (x_{2}; θ) ⋮ y_{m} f (x_{m}; θ), L (z) = \frac{1}{m} i = 1 \sum m (1 - z_{i})_{+} = \frac{1}{m} ∥ (1 - z)_{+} ∥_{1},

L (z) = \frac{1}{m} i = 1 \sum m L (z_{i}),

L (z) = \frac{1}{m} i = 1 \sum m L (z_{i}),

Δ θ_{k} := Δ θ \in R^{n} ar g min {L (F (θ_{k}) + F^{'} (θ_{k}) Δ θ) + \frac{1}{2 t} ∥Δ θ ∥^{2}},

Δ θ_{k} := Δ θ \in R^{n} ar g min {L (F (θ_{k}) + F^{'} (θ_{k}) Δ θ) + \frac{1}{2 t} ∥Δ θ ∥^{2}},

θ \in R^{n} min E (θ) = \frac{1}{m} ∥ F (θ) ∥^{2} .

θ \in R^{n} min E (θ) = \frac{1}{m} ∥ F (θ) ∥^{2} .

Δ θ \in R^{n} min \frac{1}{m} ∥ F (θ_{k}) + F^{'} (θ_{k}) Δ θ ∥^{2} + \frac{1}{2 t} ∥Δ θ ∥^{2},

Δ θ \in R^{n} min \frac{1}{m} ∥ F (θ_{k}) + F^{'} (θ_{k}) Δ θ ∥^{2} + \frac{1}{2 t} ∥Δ θ ∥^{2},

\frac{2}{m} F^{'} (θ_{k})^{T} (F (θ_{k}) + F^{'} (θ_{k}) Δ θ_{k}) + \frac{1}{t} Δ θ_{k} = 0 .

\frac{2}{m} F^{'} (θ_{k})^{T} (F (θ_{k}) + F^{'} (θ_{k}) Δ θ_{k}) + \frac{1}{t} Δ θ_{k} = 0 .

Δ θ_{k} = - (\frac{2}{m} F^{'} (θ_{k})^{T} F^{'} (θ_{k}) + \frac{1}{t} I)^{- 1} (\frac{2}{m} F^{'} (θ_{k})^{T} F (θ_{k})) ≜ - B_{k}^{- 1} \nabla E (θ_{k}),

Δ θ_{k} = - (\frac{2}{m} F^{'} (θ_{k})^{T} F^{'} (θ_{k}) + \frac{1}{t} I)^{- 1} (\frac{2}{m} F^{'} (θ_{k})^{T} F (θ_{k})) ≜ - B_{k}^{- 1} \nabla E (θ_{k}),

Δ θ_{k} = A (L, F, F^{'}, t, θ_{k}) .

Δ θ_{k} = A (L, F, F^{'}, t, θ_{k}) .

μ \in R^{m}, Δ θ \in R^{n} min

μ \in R^{m}, Δ θ \in R^{n} min

μ - F (θ_{k}) - F^{'} (θ_{k}) Δ θ = 0 .

L_{ρ} (μ, Δ θ, λ) = L (μ)

L_{ρ} (μ, Δ θ, λ) = L (μ)

μ^{i + 1}

μ^{i + 1}

Δ θ^{i + 1}

λ^{i + 1}

μ^{i + 1}

μ^{i + 1}

= μ \in R^{m} ar g min {L (μ) + \frac{ρ}{2} ∥ μ - a^{i} ∥^{2}},

= μ \in R^{m} ar g min {j = 1 \sum m (\frac{1}{m} L (μ_{j}) + \frac{ρ}{2} (μ_{j} - a_{j}^{i})^{2})},

= (μ_{j} \in R ar g min {\frac{1}{m} L (μ_{j}) + \frac{ρ}{2} (μ_{j} - a_{j}^{i})^{2}})_{j = 1}^{m},

= (Φ_{1/ m ρ} (a_{j}^{i}))_{j = 1}^{m},

\Phi_{\kappa}(a)=\left\{\begin{array}[]{cl}a-\kappa,&a>\kappa,\\[7.0pt] 0,&|a|\leq\kappa,\\[7.0pt] a+\kappa,&a<-\kappa.\end{array}\right.

\Phi_{\kappa}(a)=\left\{\begin{array}[]{cl}a-\kappa,&a>\kappa,\\[7.0pt] 0,&|a|\leq\kappa,\\[7.0pt] a+\kappa,&a<-\kappa.\end{array}\right.

\Phi_{\kappa}(a)=\left\{\begin{array}[]{cc}a,&a>1,\\[7.0pt] 1,&1-\kappa\leq a\leq 1,\\[7.0pt] a+\kappa,&a<1-\kappa.\end{array}\right.

\Phi_{\kappa}(a)=\left\{\begin{array}[]{cc}a,&a>1,\\[7.0pt] 1,&1-\kappa\leq a\leq 1,\\[7.0pt] a+\kappa,&a<1-\kappa.\end{array}\right.

Δ θ^{i + 1} = Δ θ \in R^{n} ar g min {\frac{1}{2 t} ∥Δ θ ∥^{2} + \frac{ρ}{2} ∥ μ^{i + 1} - F (θ_{k}) - F^{'} (θ_{k}) Δ θ + \frac{1}{ρ} λ^{i} ∥^{2}},

Δ θ^{i + 1} = Δ θ \in R^{n} ar g min {\frac{1}{2 t} ∥Δ θ ∥^{2} + \frac{ρ}{2} ∥ μ^{i + 1} - F (θ_{k}) - F^{'} (θ_{k}) Δ θ + \frac{1}{ρ} λ^{i} ∥^{2}},

Δ θ^{i + 1} = (ρ F^{'} (θ_{k})^{T} F^{'} (θ_{k}) + \frac{1}{t} I)^{- 1} (ρ F^{'} (θ_{k})^{T} (μ^{i + 1} - F (θ_{k}) + \frac{1}{ρ} λ^{i})) .

Δ θ^{i + 1} = (ρ F^{'} (θ_{k})^{T} F^{'} (θ_{k}) + \frac{1}{t} I)^{- 1} (ρ F^{'} (θ_{k})^{T} (μ^{i + 1} - F (θ_{k}) + \frac{1}{ρ} λ^{i})) .

r^{i} = μ^{i} - F (θ_{k}) - F^{'} (θ_{k}) Δ θ^{i},

r^{i} = μ^{i} - F (θ_{k}) - F^{'} (θ_{k}) Δ θ^{i},

s^{i} = ρ F^{'} (θ_{k}) (Δ θ^{i} - Δ θ^{i - 1}),

s^{i} = ρ F^{'} (θ_{k}) (Δ θ^{i} - Δ θ^{i - 1}),

Δ θ_{k} = Δ θ \in R^{n} ar g min {L (F (θ_{k}) + F^{'} (θ_{k}) Δ θ) + \frac{1}{2 t} ∥Δ θ ∥^{2}};

Δ θ_{k} = Δ θ \in R^{n} ar g min {L (F (θ_{k}) + F^{'} (θ_{k}) Δ θ) + \frac{1}{2 t} ∥Δ θ ∥^{2}};

L (F (θ_{k} + η Δ θ_{k})) - L (F (θ_{k})) \leq c η (L (F (θ_{k}) + F^{'} (θ_{k}) Δ θ_{k}) + \frac{1}{2 t} ∥Δ θ_{k} ∥^{2} - L (F (θ_{k})));

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Face and Expression Recognition · Machine Learning and ELM

Full text

Composite Optimization Algorithms for Sigmoid Networks

\nameHuixiong Chen \[email protected]

\addrSchool of Mathematical Sciences

South China Normal University

Guangzhou 510631, China

\nameQi Ye \[email protected]

\addrSchool of Mathematical Sciences

South China Normal University

Guangzhou 510631, China

Abstract

In this paper, we use composite optimization algorithms to solve sigmoid networks. We equivalently transfer the sigmoid networks to a convex composite optimization and propose the composite optimization algorithms based on the linearized proximal algorithms and the alternating direction method of multipliers. Under the assumptions of the weak sharp minima and the regularity condition, the algorithm is guaranteed to converge to a globally optimal solution of the objective function even in the case of non-convex and non-smooth problems. Furthermore, the convergence results can be directly related to the amount of training data and provide a general guide for setting the size of sigmoid networks. Numerical experiments on Franke’s function fitting and handwritten digit recognition show that the proposed algorithms perform satisfactorily and robustly.

Keywords: sigmoid network, composite optimization, non-convex non-smooth algorithm, global convergence, adaptive network size

1 Introduction

The neural network is an important and popular branch of machine learning. People have already developed many useful and well-studied neural network models, such as artificial neural networks, convolutional neural networks, recurrent neural networks, and deep neural networks. Neural networks have been widely used in pattern recognition, image processing, computer vision, neuroinformatics, bioinformatics, and other various fields with great success (LeCun et al. 2015; Abiodun et al. 2018).

When the neural networks are used in practical tasks, they are commonly trained by the error BackPropagation (BP) algorithm which is the most distinguished and successful neural network learning algorithm up to now. The BP algorithm is based on the gradient descent strategy that updates the parameters to the negative gradient direction of the target. To accelerate the learning process, stochastic gradient descent (SGD) with momentum and adaptive methods including adaptive gradient (AdaGrad), root mean square prop (RMSProp), adaptive moment estimation (Adam), and so on have emerged one after another and made a huge impact. As we all know, most of these first-order methods can converge to the critical point only if the objective function is convex or smooth. But for non-convex and non-smooth functions, it remains ambiguous how to find the convergence to even first- or second-order critical points (Burke et al. 2005). Typical cases are sigmoid networks with absolute or hinge loss functions. The BP algorithm can solve these non-convex and non-smooth problems as well, but they are not consistent with the convergence properties of the algorithm. Moreover, it is still non-trivial to find globally optimal solutions for traditional neural network algorithms. We take the state-of-the-art Adam as an example. Its theory is poorly understood in the literature, and it suffers from several deficiencies. For instance, Adam may miss globally optimal solutions (Wilson et al. 2017), and it can be shown that it does not converge on some simple test problems (Reddi et al. 2018).

In this paper, we use composite optimization algorithms to solve sigmoid networks; see Algorithms 3 and 3 for details. The algorithm is guaranteed to (even globally) converge to a globally optimal solution of the objective function even in the case of non-convex and non-smooth problems. That is the main contribution of this paper. The start of our work stems from the finding that sigmoid networks (2.1) can be equivalently transformed into a convex composite optimization (2.2), where the inner function is smooth and the outer function is convex. This provides a new perspective on sigmoid networks. In fact, composite optimization problems arise in many applications in engineering, such as compressed sensing, image processing, machine learning, and artificial intelligence (Boyd et al. 2011; Hong et al. 2017). The composite optimization is an area at the cutting edge of mathematical optimization, and how to efficiently solve composite optimization problems has been a popular subject. For the sigmoid networks with the structure (2.2), the traditional first-order methods do not take advantage of the convex property of the outer function, so sometimes they have certain limitations in practical applications. However, composite optimization methods can fully exploit the information in the structure for algorithm design. There are many iterative algorithms with theoretical foundations for the optimization (2.2), such as the famous Gauss-Newton method (GNM, Burke and Ferris 1995), the proximal descent algorithm (ProxDescent, Lewis and Wright 2016), and the linearized proximal algorithms (LPA, Hu et al. 2016). The basic idea of these algorithms is to transfer a complex optimization problem to a sequence of simple optimization problems whose optimal solutions are easy to compute or have explicit formulas. The LPA is one of the most advanced algorithms in convex composite optimization. It can transform a non-convex and possibly non-smooth problem into a series of unconstrained strongly convex optimization subproblems, which has an attractive computational advantage. The LPA has also been applied to sensor network localization, gene regulatory network inference, and other engineering problems with great success (Hu et al. 2016, 2020; Wang et al. 2017). Therefore, we use the LPA to solve sigmoid networks in this paper.

Under the assumptions of the weak sharp minima and the regularity condition, we establish the convergence behavior of the algorithms for sigmoid networks; see Theorems 3 and 5 for details. Furthermore, we prove that the weak sharp minima is often satisfied for sigmoid networks, and the full row rank of the Jacobian matrix of the inner function, namely $\text{\rm rank}(F^{\prime}(\bar{\boldsymbol{\theta}}))=m$ , where $m$ is the amount of training data, is a sufficient condition of the regularity condition. Hence the convergence results can be directly related to the amount of training data; see Corollaries 4 and 6 for details. This conclusion is of great theoretical and applied significance, especially since it can provide a general guide for setting the size of sigmoid networks. By the full row rank of the Jacobian matrix, we obtain a lower bound on the network size in Corollary 8. We call this lower bound the “adaptive network size”. In this paper, our numerical experiments verify that the adaptive network size is sufficient to construct an ideal sigmoid network that solves the problem effectively. Hence Corollary 8 does provide a good guide for setting the size of sigmoid networks. The essence is to guarantee that the number of parameters in neural networks is not smaller than the amount of training data, and that a sufficient number of parameters ensure the feasibility of the networks. It can also serve as a general guide for setting the size of neural networks. That is another contribution of this paper.

Our work is also motivated by the lack of convex composite optimization algorithms and related software packages for neural networks. To the best of our knowledge, the introduction of convex composite optimization into the area of neural networks has not been addressed in the literature before. This paper is the first piece of work combining neural networks and convex composite optimization.

We organize the paper as follows. In section 2, we introduce the three-layer sigmoid networks and transfer the problem to a convex composite optimization. In section 3, we use the LPA-type algorithms to solve sigmoid networks and employ the alternating direction method of multipliers (ADMM) to solve the non-smooth convex subproblems. In section 4, we prove some convergence properties of the proposed algorithms. In section 5, the numerical experiments are demonstrated including Franke’s function fitting and handwritten digit recognition. Finally, we conclude with an outlook in section 6.

2 Sigmoid Networks

To begin with, we introduce the two-layer real-output sigmoid network, which is known as ‘universal approximators’ (Anthony and Bartlett 1999). Using the standard sigmoid function $\sigma:\mathbb{R}\rightarrow(0,1)$ of the form

[TABLE]

the sigmoid network computes a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ of the form

[TABLE]

where $w_{i}\in\mathbb{R}~{}(i=0,1,\dots,q)$ are the output weights, $\boldsymbol{v}_{i}\in\mathbb{R}^{d}$ and $u_{i}\in\mathbb{R}~{}(i=1,2,\dots,q)$ are the input weights. We define these adjustable parameters by

[TABLE]

where $\boldsymbol{w}=(w_{1},w_{2},\dots,w_{q}),\boldsymbol{v}=(\boldsymbol{v}_{1},\boldsymbol{v}_{2},\dots,\boldsymbol{v}_{q}),\boldsymbol{u}=(u_{1},u_{2},\dots,u_{q})$ . In the following paragraphs, we replace $f(\boldsymbol{x})$ with $f(\boldsymbol{x};\boldsymbol{\theta})$ . Given a training dataset

[TABLE]

the goal of using this network for a supervised learning problem is to find parameters that minimize some measure of the error of the network output over the training dataset, that is,

[TABLE]

where $L:\mathbb{R}\times\mathbb{R}\rightarrow[0,\infty)$ is a loss function. To simplify the discussions, we focus on three convex loss functions including the quadratic loss function, $(f(\boldsymbol{x};\boldsymbol{\theta}),y)\mapsto(f(\boldsymbol{x};\boldsymbol{\theta})-y)^{2}$ , the absolute loss function, $(f(\boldsymbol{x};\boldsymbol{\theta}),y)\mapsto|f(\boldsymbol{x};\boldsymbol{\theta})-y|$ , and the hinge loss function, $(f(\boldsymbol{x};\boldsymbol{\theta}),y)\mapsto(1-yf(\boldsymbol{x};\boldsymbol{\theta}))_{+}$ .

The model of the sigmoid network is usually non-convex and non-smooth. Interestingly, we discover that this problem can be seen as a convex composite optimization problem of the form

[TABLE]

where the inner function $F:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ is smooth, and the outer function $\mathbb{L}:\mathbb{R}^{m}\rightarrow\mathbb{R}$ is convex. Specifically, for the absolute or quadratic loss functions, we can set

[TABLE]

In general, we replace $\|\cdot\|_{2}$ with $\|\cdot\|$ . For the hinge loss function, we can set

[TABLE]

where $\boldsymbol{z}_{+}$ denotes the componentwise non-negative part of $\boldsymbol{z}$ . As we can see, all the outer functions are separable and have the form

[TABLE]

where $\mathtt{L}:\mathbb{R}\rightarrow[0,\infty)$ is a convex function. It is the special property of $\mathbb{L}$ in sigmoid networks.

3 Composite Optimization Algorithms for Sigmoid Networks

In this section, we show how to solve the sigmoid networks based on the composite optimization algorithms including the linearized proximal algorithms (LPA) and the alternating direction method of multipliers (ADMM).

3.1 LPA for Sigmoid Networks. The LPA is one of the most advanced algorithms in convex composite optimization. It is proposed under the inspiration of the GNM and the proximal point algorithm (PPA), and maintains the same convergence rate as that but also overcomes some of their disadvantages. Each subproblem of the LPA is constructed from a linearized approximation to the composite function and a regularization term at the current iterate. Since the subproblem is an unconstrained strongly convex optimization problem whose optimal solution is global and unique, it is easier to solve than that of the GNM. Consequently, the LPA has an attractive computational advantage, although it is generally not a descent algorithm. Moreover, there are some connections of the LPA with other algorithms mentioned in this paper. The ProxDescent for solving (2.2) is a special case of the LPA. As the descent directions are used, the ProxDescent is a descent algorithm. The case when the inner function is simply identity mapping has a long history. The iteration $\boldsymbol{\theta}_{k+1}=\boldsymbol{\theta}_{k}+\Delta\boldsymbol{\theta}_{k}$ , where $\Delta\boldsymbol{\theta}_{k}$ minimizes the function $\Delta\boldsymbol{\theta}\mapsto h(\boldsymbol{\theta}_{k}+\Delta\boldsymbol{\theta})+\frac{1}{2t}\|\Delta\boldsymbol{\theta}\|^{2}$ , is the well-known PPA.

Applying the LPA directly to (2.2), we get the following algorithm for sigmoid networks.

Algorithm 1 . LPA for sigmoid networks.

0: Model $f$ , training dataset $D=\{(\boldsymbol{x}_{i},y_{i})\}_{i=1}^{m}$ , outer function $\mathbb{L}$ , inner function $F$ .

1: Initialization: $t>0$ , $\boldsymbol{\theta}_{0}\in\mathbb{R}^{n}$ , $k\leftarrow 0$ , accept $\leftarrow$ false;

2: while not accept do

3: calculate the search direction

[TABLE]

where $F^{\prime}(\boldsymbol{\theta})=\left(\nabla^{T}_{\boldsymbol{\theta}}f(\boldsymbol{x}_{i};\boldsymbol{\theta})\right)_{i=1}^{m}\in\mathbb{R}^{m\times n}$ is the Jacobian matrix of $F(\boldsymbol{\theta})$ ;

4: if $\Delta\boldsymbol{\theta}_{k}=\boldsymbol{0}$ then

5: accept $\leftarrow$ true;

6: end if

7: $\boldsymbol{\theta}_{k+1}\leftarrow\boldsymbol{\theta}_{k}+\Delta\boldsymbol{\theta}_{k}$ ;

8: $k\leftarrow k+1$ ;

9: end while

10: $\boldsymbol{\theta}^{*}\leftarrow\boldsymbol{\theta}_{k}$ .

10: $f(\boldsymbol{x};\boldsymbol{\theta}^{*})$ .

The focus of Algorithm 3 is how to solve the subproblem (3.1) accurately and efficiently. Now, we discuss some numerical algorithms for the special loss functions.

For the quadratic loss function, Algorithm 3 is reduced to the well-known Levenberg- Marquardt method for solving the following nonlinear least squares problem of the form

[TABLE]

The smooth convex subproblem can be written as

[TABLE]

and its necessary and sufficient optimality conditions imply that

[TABLE]

Hence the closed formula of the search direction is given by

[TABLE]

where $B_{k}$ is always a positive-definite and invertible matrix, and $\nabla E(\boldsymbol{\theta})$ is the gradient of the objective function in the original problem. Thus, the iteration $\boldsymbol{\theta}_{k+1}=\boldsymbol{\theta}_{k}-B_{k}^{-1}\nabla E(\boldsymbol{\theta}_{k})$ can be regarded as a variant of gradient descent algorithm, where $B_{k}^{-1}$ is an adaptive learning rate. Moreover, the stopping criterion $\Delta\boldsymbol{\theta}_{k}=\boldsymbol{0}$ shows that $\nabla E(\boldsymbol{\theta}_{k})=\boldsymbol{0}$ , which is the first-order necessary condition of the original problem. In section 4, we will give a first-order sufficient condition of the original problem in Theorem 7.

3.2 ADMM for Non-Smooth Convex Subproblems. For the non-smooth convex loss functions, the subproblem of Algorithm 3 is more complex, but luckily it is convex. There are many widely used convex optimization methods and heuristic algorithms to solve it, such as gradient or subgradient methods, approximation or composite optimization methods (Bertsekas 2015), and simulated annealing algorithms. Moreover, there are many related software packages to implement these algorithms, such as CVXPY in Python, qpOASES in C $++$ , and CVX toolbox in Matlab. So it is not difficult to calculate the search direction from the subproblem. We use a mapping A to represent a specific algorithm to solve the subproblem, then the search direction can be presented in

[TABLE]

Here we use the ADMM to solve the subproblem. The ADMM is a simple scheme that often works well and has a good reliability with a wide range of applications, especially for convex problems. It is also easy to understand and implement for many composite optimization problems with complex structures (Boyd et al. 2011).

The subproblem (3.1) can be seen as an equivalent problem of the form

[TABLE]

The augmented Lagrangian function of the above problem is

[TABLE]

where $\rho>0$ is the penalty parameter. The ADMM consists of the iterations

[TABLE]

The calculation for $\boldsymbol{\mu}^{i+1}$ is as follows.

[TABLE]

where $\boldsymbol{a}^{i}=F(\boldsymbol{\theta}_{k})+F^{\prime}(\boldsymbol{\theta}_{k})\Delta\boldsymbol{\theta}^{i}-\frac{1}{\rho}\boldsymbol{\lambda}^{i}$ , and $\Phi_{\kappa}$ is the proximity operator of $\mathtt{L}$ with the penalty $\frac{1}{\kappa}$ (Boyd et al. 2011). Specifically, for the absolute loss function, the proximity operator $\Phi$ , also called the soft thresholding operator, is defined as

[TABLE]

For the hinge loss function, the proximity operator $\Phi$ is defined as

[TABLE]

The calculation for $\Delta\boldsymbol{\theta}^{i+1}$ is as follows. Since

[TABLE]

by its necessary and sufficient optimality conditions, we obtain that

[TABLE]

As we can see, the iterations of the ADMM for the non-smooth convex subproblems have explicit formulas, which is one of the advantages of the ADMM. Defining the primal residual of the optimality conditions at iteration $i$ as

[TABLE]

and the dual residual at iteration $i$ as

[TABLE]

we set the stopping criterion as $\|\boldsymbol{r}^{i}\|\approx 0$ and $\|\boldsymbol{s}^{i}\|\approx 0$ .

Algorithm A∗ . ADMM for non-smooth convex subproblems.

0: Numbers $m$ and $t$ , matrices $F(\boldsymbol{\theta}_{k})$ and $F^{\prime}(\boldsymbol{\theta}_{k})$ , non-smooth convex function $\mathtt{L}$ .

1: Initialization: $\rho>0$ , $\Delta\boldsymbol{\theta}^{0}\in\mathbb{R}^{n}$ , $\boldsymbol{\lambda}^{0}\in\mathbb{R}^{m}$ , $\epsilon>0$ , $i\leftarrow 0$ ;

2: repeat

3: $i\leftarrow i+1$ ;

4: calculate $\boldsymbol{\mu}^{i}$ from (3);

5: calculate $\Delta\boldsymbol{\theta}^{i}$ from (3.4);

6: calculate $\boldsymbol{\lambda}^{i}$ from (3.2);

7: calculate $\boldsymbol{r}^{i}$ and $\boldsymbol{s}^{i}$ from (3.5) and (3.6), respectively;

8: until $\|\boldsymbol{r}^{i}\|<\epsilon$ and $\|\boldsymbol{s}^{i}\|<\epsilon$ ;

9: $\Delta\boldsymbol{\theta}_{k}\leftarrow\Delta\boldsymbol{\theta}^{i}$ .

9: $\Delta\boldsymbol{\theta}_{k}$ .

3.3 A Globalization Strategy for Algorithm 3. Moreover, we show the following algorithm by employing the globalized LPA (GLPA) that adopts a backtracking line-search as a globalization strategy. The choice of the stepsize is based on the virtue of the backtracking line-search, which guarantees the monotone decrease of the objective function at each iteration. As a result, it ensures that the GLPA is a descent algorithm. In the algorithm implementation, the backtracking strategy finds the first point satisfying the inequality (3.7) by continuously decreasing the trial stepsize in an exponential way. That makes the stepsize with the descent property as large as possible.

Algorithm 2 . GLPA for sigmoid networks.

0: Model $f$ , training dataset $D=\{(\boldsymbol{x}_{i},y_{i})\}_{i=1}^{m}$ , outer function $\mathbb{L}$ , inner function $F$ .

1: Initialization: $t>0$ , $c,\tau\in(0,1)$ , $\boldsymbol{\theta}_{0}\in\mathbb{R}^{n}$ , $k\leftarrow 0$ , accept $\leftarrow$ false;

2: while not accept do

3: calculate the search direction

[TABLE]

4: if $\Delta\boldsymbol{\theta}_{k}=\boldsymbol{0}$ then

5: accept $\leftarrow$ true;

6: end if

7: $\eta\leftarrow 1/\tau$ ;

8: repeat

9: $\eta\leftarrow\tau\eta$

10: until

[TABLE]

11: $\eta_{k}\leftarrow\eta$ ;

12: $\boldsymbol{\theta}_{k+1}\leftarrow\boldsymbol{\theta}_{k}+\eta_{k}\Delta\boldsymbol{\theta}_{k}$ ;

13: $k\leftarrow k+1$ ;

14: end while

15: $\boldsymbol{\theta}^{*}\leftarrow\boldsymbol{\theta}_{k}$ .

15: $f(\boldsymbol{x};\boldsymbol{\theta}^{*})$ .

4 Convergence Analysis

In this section, we prove some convergence properties of the proposed algorithms under the assumptions of the weak sharp minima and the regularity condition or full row rank of the Jacobian matrix, a stronger condition. Before giving the main results, we introduce the following useful definitions and lemmas.

4.1 Theoretical Foundations of LPA-type Algorithms. Here we consider the convex composite optimization of the form

[TABLE]

where the inner function $G:\mathbb{R}^{b}\rightarrow\mathbb{R}^{l}$ is continuously differentiable, and the outer function $h:\mathbb{R}^{l}\rightarrow\mathbb{R}$ is convex. It is a more general mathematical form of the problem (2.2).

First, we introduce the concept of the Lipschitz continuous gradient, which has played an important role in investigating the convergence behavior of many optimization algorithms. For a differentiable function $G$ and $\Omega\subseteq\mathbb{R}^{b}$ , if there exists an $K>0$ such that

[TABLE]

we say that $G$ is K-smooth or has a Lipschitz continuous gradient with modulus $K$ on $\Omega$ .

Next, we give the notion of the weak sharp minima introduced in (Burke and Ferris 1993), which has far-reaching consequences for the convergence analysis of many iterative procedures. For a function $h$ , the minimum value and the set of minima for $h$ , denoted by $h_{\text{min}}$ and $C_{h}$ , are defined by

[TABLE]

Let $C_{h}\subseteq S\subseteq\mathbb{R}^{l}$ , if there exist $\alpha>0$ and $\beta\geq 1$ such that

[TABLE]

where $\text{dist}(\boldsymbol{z},C):=\inf_{\boldsymbol{c}\in C}\|\boldsymbol{z}-\boldsymbol{c}\|$ , then we say that $C_{h}$ is the set of weak sharp minima of order $\beta$ for $h$ on $S$ with modulus $\alpha$ .

We now introduce the regularity condition proposed in (Burke and Ferris 1995), which is a crucial assumption applied to establish the convergence of several convex composite optimization algorithms. Let $h$ and $G$ be defined by (4.1), then a point $\bar{\boldsymbol{\omega}}\in\mathbb{R}^{b}$ is said to be a regular point of the inclusion $G(\boldsymbol{\omega})\in C_{h}$ if

[TABLE]

where ker( $W$ ) $:=\{\boldsymbol{y}:W\boldsymbol{y}=\boldsymbol{0}\}$ is the nullspace of $W$ , and $Z^{\ominus}:=\{\boldsymbol{y}:\left<\boldsymbol{y},\boldsymbol{z}\right>\leq 0,\forall\boldsymbol{z}\in Z\}$ is the negative polar of $Z$ .

In the following lemmas, we give the local convergence of the LPA and the global convergence of the GLPA for solving optimization (4.1). They are based on three main conditions including Lipschitz continuous gradient, weak sharp minima and quasi-regularity or regularity condition. Note that the definition of quasi-regularity condition will only be described in the proof of Theorem 3. Since this condition is hard to verify in practice, we replace it with the regularity condition in the related theorem for sigmoid networks.

Lemma 1.

(Hu et al. 2016, Corollary 14)* Let $\bar{\boldsymbol{\omega}}\in\mathbb{R}^{b}$ satisfy $G(\bar{\boldsymbol{\omega}})\in C_{h}$ , and let $C_{h}$ be the set of weak sharp minima of order $\beta$ for $h$ near $G(\bar{\boldsymbol{\omega}})$ with constant $\alpha$ . Suppose that $G$ is continuously differentiable with a Lipschitz continuous gradient $G^{\prime}$ near $\bar{\boldsymbol{\omega}}$ , and that $\bar{\boldsymbol{\omega}}$ is a quasi-regular point of the inclusion with constant $\delta$ . Suppose further that $\beta\in[1,2)$ or the stepsize $t>\frac{2\delta^{2}}{\alpha}$ (if $\beta=2$ ). Then there exists a neighborhood $N(\bar{\boldsymbol{\omega}})$ of $\bar{\boldsymbol{\omega}}$ such that, for any $\boldsymbol{\omega}_{0}\in N(\bar{\boldsymbol{\omega}})$ , the sequence $\{\boldsymbol{\omega}_{k}\}$ generated by the LPA with initial point $\boldsymbol{\omega}_{0}$ converges at a rate of $\frac{2}{\beta}$ to a solution $\boldsymbol{\omega}^{*}$ satisfying $G(\boldsymbol{\omega}^{*})\in C_{h}$ .

Lemma 2.

(Hu et al. 2016, Theorem 18)* Let $\{\boldsymbol{\omega}_{k}\}$ be a sequence generated by the GLPA and assume that $\{\boldsymbol{\omega}_{k}\}$ has a cluster point $\boldsymbol{\omega}^{*}$ . Suppose that $\beta\in[1,2)$ and that $C_{h}$ be the set of weak sharp minima of order $\beta$ for $h$ near $G(\boldsymbol{\omega}^{*})$ . Suppose further that $G$ is continuously differentiable with a Lipschitz continuous gradient $G^{\prime}$ near $\boldsymbol{\omega}^{*}$ , and that $\boldsymbol{\omega}^{*}$ is a regular point of the inclusion. Then $G(\boldsymbol{\omega}^{*})\in C_{h}$ , and $\{\boldsymbol{\omega}_{k}\}$ converges to $\boldsymbol{\omega}^{*}$ at a rate of $\frac{2}{\beta}$ .*

Note that $\beta\in[1,2)$ in Lemma 2 is lightly different from $\beta\in[1,2]$ in Lemma 1, but both of them can find a globally optimal solution to optimization (4.1) since that $G(\boldsymbol{\omega}^{*})\in C_{h}$ , equivalently, $h(G(\boldsymbol{\omega}^{*}))=h_{\text{min}}={(h\circ G)}_{\text{min}}$ .

4.2 Convergence Analysis for Sigmoid Networks. Let $B(\boldsymbol{z},r)$ denote an open ball of radius $r$ centered at $\boldsymbol{z}$ , then we establish the local convergence of Algorithm 3 by virtue of Lemma 1.

Theorem 3.

(Local Convergence). Let $\beta\in[1,2]$ and $r>0$ . Let $\{\boldsymbol{\theta}_{k}\}$ be a sequence generated by Algorithm 3, and $\bar{\boldsymbol{\theta}}\in\mathbb{R}^{n}$ be such that $F(\bar{\boldsymbol{\theta}})\in C_{\hskip 0.70004pt\mathbb{L}}$ and $C_{\hskip 0.70004pt\mathbb{L}}$ is the set of weak sharp minima of order $\beta$ for $\mathbb{L}$ on $B(F(\bar{\boldsymbol{\theta}}),r)$ . If $\bar{\boldsymbol{\theta}}$ is a regular point of the inclusion, then there exist $t_{0}\geq 0$ and $\bar{r}>0$ such that for any $t>t_{0}$ and initial point $\boldsymbol{\theta}_{0}\in B(\bar{\boldsymbol{\theta}},\bar{r})$ , the sequence $\{\boldsymbol{\theta}_{k}\}$ converges at a rate of $\frac{2}{\beta}$ to a globally optimal solution $\boldsymbol{\theta}^{*}$ and $F(\boldsymbol{\theta}^{*})\in C_{\hskip 0.70004pt\mathbb{L}}$ .

Proof.

According to the assumptions of Lemma 1, we need to verify the following four conditions.

(i)

Quasi-regularity condition. By Proposition 3.3 in (Burke and Ferris 1995), we know that any regular point of the inclusion $F(\boldsymbol{\theta})\in C_{\hskip 0.70004pt\mathbb{L}}$ is also a quasi-regular point. Since $\bar{\boldsymbol{\theta}}$ is a regular point, $\bar{\boldsymbol{\theta}}$ is also a quasi-regular point of the inclusion $F(\boldsymbol{\theta})\in C_{\hskip 0.70004pt\mathbb{L}}$ , namely there exist $\delta>0$ and $r_{0}>0$ such that

[TABLE]

where $\Pi(\boldsymbol{\theta}):=\{\Delta\boldsymbol{\theta}\in\mathbb{R}^{n}:F(\boldsymbol{\theta})+F^{\prime}(\boldsymbol{\theta})\Delta\boldsymbol{\theta}\in C_{\hskip 0.70004pt\mathbb{L}}\}$ is the solution set of the linearized inclusion $F(\boldsymbol{\theta})+F^{\prime}(\boldsymbol{\theta})\Delta\boldsymbol{\theta}\in C_{\hskip 0.70004pt\mathbb{L}}$ . 2. (ii)

Weak sharp minima. In particular, we set $r_{0}\in(0,r)$ . Naturally, $C_{\hskip 0.70004pt\mathbb{L}}$ is the set of local weak sharp minima of order $\beta$ for $\mathbb{L}$ on $B(F(\bar{\boldsymbol{\theta}}),r_{0})$ with constant $\alpha$ for some $\alpha>0$ , due to the assumption and definition of the weak sharp minima. 3. (iii)

Lipschitz continuous gradient. Note that a differentiable function with a Lipschitz continuous gradient is second-order differentiable almost everywhere on $\Omega$ . If $G$ is a second-order differentiable function, by the differential mean value theorem, it is obvious that the $K$ -smoothness of $G$ is equivalent to the boundedness of $G^{\prime\prime}$ , that is, $\|G^{\prime\prime}(\boldsymbol{\omega})\|\leq K$ for each $\boldsymbol{\omega}\in\Omega$ . On the other hand, since $F$ defined by (2.2) is smooth on $\mathbb{R}^{n}$ , $F^{\prime\prime}$ is continuous on $\mathbb{R}^{n}$ . Naturally, $F^{\prime\prime}$ is bounded on the bounded subset $B(\bar{\boldsymbol{\theta}},r_{0})$ . Therefore, $F$ is continuously differentiable with a Lipschitz continuous gradient $F^{\prime}$ on $B(\bar{\boldsymbol{\theta}},r_{0})$ . 4. (iv)

Large stepsize. If $\beta=2$ , we set $t_{0}=\frac{2\delta^{2}}{\alpha}$ ; otherwise, set $t_{0}=0$ .

Hence, Lemma 1 is applicable and the conclusion follows. ∎

Furthermore, we analyze the convergence properties of Algorithm 3 for the three common sigmoid networks.

Corollary 4.

*Let $\{\boldsymbol{\theta}_{k}\}$ be a sequence generated by Algorithm 3, and $\bar{\boldsymbol{\theta}}\in\mathbb{R}^{n}$ be such that $F(\bar{\boldsymbol{\theta}})\in C_{\hskip 0.70004pt\mathbb{L}}$ . If $F^{\prime}(\bar{\boldsymbol{\theta}})$ has full row rank, then there exists an $\bar{r}>0$ such that for any initial point $\boldsymbol{\theta}_{0}\in B(\bar{\boldsymbol{\theta}},\bar{r})$ , we have

(i)

*for the sigmoid networks with the quadratic loss function, the sequence $\{\boldsymbol{\theta}_{k}\}$ linearly converges to a globally optimal solution $\boldsymbol{\theta}^{*}$ and $F(\boldsymbol{\theta}^{*})=\boldsymbol{0}$ , if $t$ is sufficiently large. *

** 2. (ii)

for the sigmoid networks with the absolute loss function, the sequence $\{\boldsymbol{\theta}_{k}\}$ quadratically converges to a globally optimal solution $\boldsymbol{\theta}^{*}$ and $F(\boldsymbol{\theta}^{*})=\boldsymbol{0}$ .

** 3. (iii)

for the sigmoid networks with the hinge loss function, the sequence $\{\boldsymbol{\theta}_{k}\}$ quadratically converges to a globally optimal solution $\boldsymbol{\theta}^{*}$ and $F(\boldsymbol{\theta}^{*})\geq\boldsymbol{1}$ .

Proof.

According to the assumptions of Theorem 3, we need to verify the following two conditions.

(a)

Regularity condition. Since the system of linear equations $W\boldsymbol{y}=\boldsymbol{0}$ has only zero solution if and only if the matrix $W$ has full column rank, $F^{\prime}(\bar{\boldsymbol{\theta}})$ with full row rank is equivalent to $\text{ker}(F^{\prime}(\bar{\boldsymbol{\theta}})^{T})=\{\boldsymbol{0}\}$ . Then, it follows that

[TABLE]

Therefore, the regularity condition is satisfied. 2. (b)

Weak sharp minima. Note that $\mathbb{L}_{\rm min}=0$ ; $C_{\hskip 0.70004pt\mathbb{L}}=\{\boldsymbol{0}\}$ for the quadratic or absolute loss functions, and $C_{\hskip 0.70004pt\mathbb{L}}\geq\boldsymbol{1}$ for the hinge loss function.

(i)

In the case when $\mathbb{L}(\boldsymbol{z})=\frac{1}{m}\|\boldsymbol{z}\|^{2}$ , $\mathbb{L}(\boldsymbol{z})=\mathbb{L}_{\rm min}+\frac{1}{m}\text{dist}^{2}(\boldsymbol{z},C_{\hskip 0.70004pt\mathbb{L}})$ for each $\boldsymbol{z}\in\mathbb{R}^{m}$ . By the definition of weak sharp minima, we know that $C_{\hskip 0.70004pt\mathbb{L}}=\{\boldsymbol{0}\}$ is the set of weak sharp minima of order $2$ for $\mathbb{L}$ on $\mathbb{R}^{m}$ with modulus $\frac{1}{m}$ .

(ii)

In the case when $\mathbb{L}(\boldsymbol{z})=\frac{1}{m}\|\boldsymbol{z}\|_{1}$ , $\mathbb{L}(\boldsymbol{z})\geq\frac{1}{m}\|\boldsymbol{z}\|=\mathbb{L}_{\rm min}+\frac{1}{m}\text{dist}(\boldsymbol{z},C_{\hskip 0.70004pt\mathbb{L}})$ for each $\boldsymbol{z}\in\mathbb{R}^{m}$ . In the same way, it shows that $C_{\hskip 0.70004pt\mathbb{L}}=\{\boldsymbol{0}\}$ is the set of weak sharp minima of order $1$ for $\mathbb{L}$ on $\mathbb{R}^{m}$ with modulus $\frac{1}{m}$ .

(iii)

In the case when $\mathbb{L}(\boldsymbol{z})=\frac{1}{m}\|(\boldsymbol{1}-\boldsymbol{z})_{+}\|_{1}$ , $\mathbb{L}(\boldsymbol{z})\geq\frac{1}{m}\|(\boldsymbol{1}-\boldsymbol{z})_{+}\|=\mathbb{L}_{\rm min}+\frac{1}{m}\text{dist}(\boldsymbol{z},C_{\hskip 0.70004pt\mathbb{L}})$ for each $\boldsymbol{z}\in\mathbb{R}^{m}$ , which implies that $C_{\hskip 0.70004pt\mathbb{L}}\geq\boldsymbol{1}$ is the set of weak sharp minima of order $1$ for $\mathbb{L}$ on $\mathbb{R}^{m}$ with modulus $\frac{1}{m}$ . Therefore, the local weak sharp minima is satisfied for the three common sigmoid networks.

Hence, Theorem 3 is applicable and the conclusion follows. ∎

As we have seen, the weak sharp minima is often satisfied for sigmoid networks, and its order determines the convergence rate of the algorithm. To our surprise, a first-order algorithm even has a second-order convergence rate. In the following paragraphs, we establish the global convergence of Algorithm 3 by virtue of Lemma 2.

Theorem 5.

(Global Convergence). Let $\beta\in[1,2)$ and $r>0$ . Let $\{\boldsymbol{\theta}_{k}\}$ be a sequence generated by Algorithm 3, and $\{\boldsymbol{\theta}_{k}\}$ have a cluster point $\boldsymbol{\theta}^{*}$ such that $C_{\hskip 0.70004pt\mathbb{L}}$ be the set of weak sharp minima of order $\beta$ for $\mathbb{L}$ on $B(F(\boldsymbol{\theta}^{*}),r)$ . If $\boldsymbol{\theta}^{*}$ is a regular point of the inclusion, then $\{\boldsymbol{\theta}_{k}\}$ converges at a rate of $\frac{2}{\beta}$ to a globally optimal solution $\boldsymbol{\theta}^{*}$ and $F(\boldsymbol{\theta}^{*})\in C_{\hskip 0.70004pt\mathbb{L}}$ .

Proof.

According to the assumptions of Lemma 2, we need to verify the following three conditions.

(i)

Regularity condition. Since the cluster point $\boldsymbol{\theta}^{*}$ is a regular point of the inclusion $F(\boldsymbol{\theta})\in C_{\hskip 0.70004pt\mathbb{L}}$ , the regularity condition is satisfied. 2. (ii)

Weak sharp minima. Since $C_{\hskip 0.70004pt\mathbb{L}}$ is the set of weak sharp minima of order $\beta$ for $\mathbb{L}$ on $B(F(\boldsymbol{\theta}^{*}),r)$ for some $r>0$ and $\beta\in[1,2)$ , the local weak sharp minima is satisfied. 3. (iii)

Lipschitz continuous gradient. By (iii) in the proof of Theorem 3, we know that $F$ is continuously differentiable with a Lipschitz continuous gradient $F^{\prime}$ on $B(\boldsymbol{\theta}^{*},r)$ .

Hence, Lemma 2 is applicable and the conclusion follows. ∎

We can see that Algorithm 3 has the same conclusion and convergence rate as Algorithm 3 under the same assumptions. Next, we show the global convergence of Algorithm 3 for two non-convex and non-smooth sigmoid networks.

Corollary 6.

*Let $\{\boldsymbol{\theta}_{k}\}$ be a sequence generated by Algorithm 3 for the sigmoid networks with absolute or hinge loss functions, and $\{\boldsymbol{\theta}_{k}\}$ have a cluster point $\boldsymbol{\theta}^{*}$ . If $F^{\prime}(\boldsymbol{\theta}^{*})$ has full row rank, then $\{\boldsymbol{\theta}_{k}\}$ quadratically converges to a globally optimal solution $\boldsymbol{\theta}^{*}$ and $F(\boldsymbol{\theta}^{*})\in C_{\hskip 0.70004pt\mathbb{L}}$ . *

Proof.

According to the assumptions of Theorem 5, we need to verify the following two conditions.

(a)

Regularity condition. By (a) in the proof of Corollary 4, the full row rank of $F^{\prime}(\boldsymbol{\theta}^{*})$ implies that the cluster point $\boldsymbol{\theta}^{*}$ is a regular point of the inclusion. Therefore, the regularity condition is satisfied. 2. (b)

Weak sharp minima. By (b) in the proof of Corollary 4, we know that $C_{\hskip 0.70004pt\mathbb{L}}$ is the set of weak sharp minima of order 1 for $\mathbb{L}$ on $\mathbb{R}^{m}$ with modulus $\frac{1}{m}$ . Therefore, the local weak sharp minima is satisfied for the two sigmoid networks.

Hence, Theorem 5 is applicable and the conclusion follows. ∎

Note that $F^{\prime}(\bar{\boldsymbol{\theta}})$ with full row rank, namely $\text{\rm rank}(F^{\prime}(\bar{\boldsymbol{\theta}}))=m$ , where $m$ is the amount of training data, is the sufficient condition of the regularity condition; and it is also the necessary condition when $C_{\hskip 0.70004pt\mathbb{L}}$ is a singleton set and $F^{\prime}(\bar{\boldsymbol{\theta}})\in C_{\hskip 0.70004pt\mathbb{L}}$ . Hence the convergence results can be directly related to the amount of training data. Next, we show the following convergence property of the LPA-type algorithms in a finite number of iterations.

Theorem 7.

(Sufficient Condition). If the LPA-type algorithm stops at the $k$ th iteration with $\text{\rm rank}(G^{\prime}(\boldsymbol{\omega}_{k}))=l$ , then $\boldsymbol{\omega}_{k}$ is a globally optimal solution to the convex composite optimization (4.1) and $G(\boldsymbol{\omega}_{k})\in C_{h}$ .

Proof.

Since the subproblem of the LPA-type algorithms is an unconstrained convex optimization problem, its necessary and sufficient optimality conditions imply that

[TABLE]

where $\partial h(\boldsymbol{z})$ is the subdifferential of the convex function $h(\boldsymbol{z})$ . The stopping criterion $\Delta\boldsymbol{\omega}_{k}=\boldsymbol{0}$ of the algorithms shows that

[TABLE]

By $\text{rank}(G^{\prime}(\boldsymbol{\omega}_{k}))=l$ , equivalently, the full column rank of $G^{\prime}(\boldsymbol{\omega}_{k})^{T}$ , it follows that

[TABLE]

By the necessary and sufficient optimality conditions of the convex optimization, it shows that $G(\boldsymbol{\omega}_{k})$ is a globally optimal solution to $h$ , equivalently, $G(\boldsymbol{\omega}_{k})\in C_{h}$ . Hence the proof is complete. ∎

Theorem 7 also shows that $\text{rank}(F^{\prime}(\boldsymbol{\theta}_{k}))=m$ is the first-order sufficient condition of sigmoid networks when the LPA-type algorithm stops at the $k$ th iteration. It is no surprise that there is a unified conclusion on the non-convex and possibly non-smooth sigmoid networks, thanks to the unified composite optimization framework and the convex subproblem.

We have seen that the full row rank is a critical condition for the convergence analysis of sigmoid networks. This condition is of great theoretical and applied significance, especially since it can provide a general guide for setting the network size. In order to guarantee the reliability of the algorithm, we can ensure that $F^{\prime}(\bar{\boldsymbol{\theta}})\in\mathbb{R}^{m\times n}$ is of full row rank, which implies that $n=(d+2)q+1\geq m$ , where $d$ is the dimension of the input, and $q$ is the number of hidden neurons. So we have the following corollary.

Corollary 8.

If $\hskip 1.00006pt\text{rank}(F^{\prime}(\bar{\boldsymbol{\theta}}))=m$ , then we have a lower bound on the network size given by

[TABLE]

Clearly, the lower bound on the network size is directly proportional to the amount of training data and inversely proportional to the dimension of the input. That is, the lower bound on the network size is adapted to the problem size, so we call this lower bound the “adaptive network size”. Moreover, each row of the Jacobian matrix $F^{\prime}(\boldsymbol{\theta})$ is the gradient of the fitting function $f(\boldsymbol{x};\boldsymbol{\theta})$ at the corresponding data point. In a general sense, as the number of hidden neurons increases, the information contained in the gradient increases. As a result, the rank of the Jacobian matrix will also increase or be equal to $m$ . Thus, the full row rank of $F^{\prime}(\bar{\boldsymbol{\theta}})$ can be satisfied in a theoretical sense by choosing the network size sufficiently large. In conclusion, the LPA-type algorithms are almost always reliable.

5 Numerical Experiment

Sigmoid networks are often used to solve regression and classification tasks, so we shall use our algorithms for both tasks. We train the sigmoid networks on the training dataset and demonstrate the performance on the test dataset. Note that we will use the adaptive network size, namely the lower bound on the network size given by Corollary 8, to build the sigmoid networks, which is sufficient to solve problems effectively.

5.1 Regression on Scattered Data. Franke’s function is a standard test function for 2D scattered data fitting of the form

[TABLE]

and its graph in the unit square in $\mathbb{R}^{2}$ is shown on the left of Figure LABEL:Haltonp. One can see that Franke’s function is a complex function with two Gaussian peaks and a small trough. We generate 289 training data points and 121 test data points using the Halton sequence. The points are uniformly distributed in the unit square in $\mathbb{R}^{2}$ , and the result is shown on the right of Figure LABEL:Haltonp.

Considering the observational errors, we also add small white Gaussian noise to the training data to reflect the real case, that is, $y_{i}=g(x^{1}_{i},x^{2}_{i})+|\xi_{i}|,~{}\text{and}~{}\xi_{i}\sim N(0,\tilde{\sigma}^{2})$ , where $N(0,\tilde{\sigma}^{2})$ is a Gaussian distribution with a mean of [math] and a standard deviation of $\tilde{\sigma}$ . All numerical experiments are implemented in Python 3.9. We generate the positive Gaussian noise using $\frac{1}{\sqrt{2\pi}\tilde{\sigma}}\cdot\text{uniform}(0,1)$ . The performance measure we choose for the regression task is the root mean squared error (RMS-error):

[TABLE]

where $\tilde{y}_{i}$ is the predicted value and $y_{i}$ is the actual value.

When implementing the LPA-type algorithms for the sigmoid networks with a quadratic loss function, we set $\tilde{\sigma}=100$ , $\boldsymbol{\theta}_{0}=\boldsymbol{0}$ , and the stopping criterion as $\|\Delta\boldsymbol{\theta}_{k}\|<$ 1e-2. For the inequality (3.7) in Algorithm 3, we set $\tau=0.5$ , $c=1$ e-3, and the maximum number of iterations for the backtracking line-search as $10$ (indeed, one iteration is enough in most cases, that is, $\eta_{k}=1$ is often used). According to (4.2), we can set $q\geq 72$ to guarantee the reliability of the algorithms. For the case when $q=72$ and $t=$ 1e5, the performance of the algorithms is shown in Table 1 and Figure 2.

As we can see, the LPA-type algorithms solve the regression tasks well, and they are robust even when the data is perturbed by the noise with a mean of 2.0094e-3 and a maximum of 3.9894e-3. The results show that the training loss is less than 5.2940e-6 for all test cases. In other words, our algorithms can obtain an ideal solution for this task. We find that the monotonic decrease of the objective function occurs at almost every iteration of the LPA. It is almost a descent algorithm. Through multiple experiments, we also find that the performance of the LPA depends on the choice of the initial point, but the GLPA is not affected by this. Thus, we conjecture that the GLPA for sigmoid networks with the quadratic loss function can converge globally under certain conditions. This will be explored in our future work.

Indeed, the LPA-type algorithms using small-scale networks can solve the problem as well. The illustration is shown on the left of Figure 2. Moreover, the performance of the algorithms is also affected by the stepsize of the subproblem. This is shown on the right of Figure 2.

Corollary 6 shows that Algorithm 3 using absolute or hinge loss functions can converge globally. For simplicity, the rest of this section is devoted to demonstrating the performance of Algorithm 3. When implementing the GLPA for the sigmoid networks with an absolute loss function, we still use the same parameter values as in the previous experiments. For Algorithm 3, we set $\epsilon=\rho=$ 1e-2, $\Delta\boldsymbol{\theta}^{0}=\boldsymbol{0}$ , $\boldsymbol{\lambda}^{0}=\boldsymbol{0}$ , and the maximum number of ADMM iterations as 20. For the case when $q=72$ and $t=$ 1e5, the performance of the algorithm is shown in Table 2 and Figure 4.

The training loss in both experiments is less than 7.7930e-7, which shows that the GLPA obtains a better solution for sigmoid networks. Obviously, this result is more in line with the actual needs of regression tasks.

5.2 Classification on Handwritten Digits. The digits dataset from scikit-learn contains 1797 samples, each with 64 elements corresponding to an image of 8 $\times$ 8 pixels, and with target attribute 0, 1, $\dots$ , 9. Some of the samples are shown in Figure 5.

We create four binary classification tasks, each to classify two digits: 0 and 1; 2 and 5; 3 and 7; 6 and 9. For each task, we take 70% of the selected samples as the training data and the rest as the test data. Here we run four algorithms on these tasks, including the GLPA and three other popular and practical tools in the machine learning community, SGDM, RMSProp and Adam. We also use the same parameter settings for the GLPA as the previous experiments. The only difference is that we set $q=4$ by Corollary 8 and the number of ADMM iterations does not exceed 10. For the other algorithms, implemented with PyTorch, we set the learning rate as 1e-3, the momentum as 0.9, and the number of iterations as 1000. For the case when $q=4$ , the running results of the four algorithms are shown in Table 3 and Figure LABEL:hinge_loss.

Three observations are indicated by the running results: (i) The small training loss shows that the GLPA can obtain excellent solutions to classification problems, and the training loss of the GLPA is generally smaller than the other algorithms. (ii) The GLPA has a much smaller number of iterations, thanks to its quadratic convergence rate in this case. It is striking that a first-order algorithm (GLPA) even has a second-order convergence rate. (iii) The adaptive network size given by Corollary 8 is sufficient to construct an ideal sigmoid network that solves the problem effectively. Hence Corollary 8 does provide a good guide for setting the size of sigmoid networks.

The essence of Corollary 8 is to guarantee that the number of parameters in neural networks is not smaller than the amount of training data, and that a sufficient number of parameters ensure the feasibility of the networks. In our view, it is as if the information of a data point could be extracted by a single parameter in the model. Inspired by this, we think it can also serve as a general guide for setting the size of neural networks. It is well known that how to set the number of hidden neurons in neural networks is still an open problem, and it is usually adjusted by trial and error in practice. As stated above, we suggest that the number of hidden neurons can be specified by trial and error starting from the adaptive network size, which can avoid certain blindness at the beginning of the trial. This general rule deserves to be tried and further verified in practice.

6 Future Work

Although we only show the composite optimization algorithms for the three-layer sigmoid networks, our algorithms are also applicable to the more complex sigmoid networks, such as the sigmoid networks with multiple hidden layers, with multiple outputs, and with output layer neurons that are processed with sigmoid functions. In the design of model (2.2), the convexity of the outer function $\mathbb{L}$ is due to the convex loss function $L$ , and the smoothness of the inner function $F$ is due to the smooth fitting function $f$ . So the algorithms can be used to solve the sigmoid networks whenever we maintain the convexity of $L$ and the smoothness of $f$ (note that $f$ is always smooth in sigmoid networks). It is not difficult to solve the general sigmoid networks with convex loss functions using our algorithms by setting the same form of $\mathbb{L}$ and $F$ as the case of one hidden layer. As a matter of fact, the composite structure (2.2) can provide a unified framework for the development and analysis of sigmoid networks, especially for the non-convex and non-smooth optimization problems. Moreover, the various composite structures in neural networks pose more challenges for the study of composite optimization algorithms. The breakthrough of composite optimization algorithms will also drive the development of neural network learning algorithms. Last but not least, the convergence results of convex composite optimization (4.1) in the literature all seem to be established on $G(\boldsymbol{\omega}^{*})\in C_{h}$ . While the more general convergence theorems should be established possibly on $G(\boldsymbol{\omega}^{*})\notin C_{h}$ , which is still an open problem in the area of composite optimization. In view of this, we will explore this issue further.

Acknowledgments

The research was supported in part by the National Natural Science Foundation of China under grants 12071157 and 12026602, and the Natural Science Foundation of Guangdong 2020B1515310013. Qi Ye is the corresponding author.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abiodun et al. 2018 Oludare I. Abiodun, Aman Jantan, Abiodun E. Omolara, Kemi V. Dada, Nachaat A. Mohamed, and Humaira Arshad. State-of-the-art in artificial neural network applications: A survey. Heliyon , 4(11):e 00938, 2018.
2Anthony and Bartlett 1999 Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations . Cambridge University Press, Cambridge, 1999.
3Bertsekas 2015 Dimitri P. Bertsekas. Convex Optimization Algorithms . Athena Scientific, Belmont, MA, 2015.
4Boyd et al. 2011 Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning , 3(1):1–122, 2011.
5Burke and Ferris 1993 James V. Burke and Michael C. Ferris. Weak sharp minima in mathematical programming. SIAM Journal on Control and Optimization , 31(5):1340–1359, 1993.
6Burke and Ferris 1995 James V. Burke and Michael C. Ferris. A gauss-newton method for convex composite optimization. Mathematical Programming , 71(2):179–194, 1995.
7Burke et al. 2005 James V. Burke, Adrian S. Lewis, and Michael L. Overton. A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization , 15(3):751–779, 2005.
8Hong et al. 2017 Byung-Woo Hong, Ja-Keoung Koo, Hendrik Dirks, and Martin Burger. Adaptive regularization in convex composite optimization for variational imaging problems. In Pattern Recognition: 39th German Conference, GCPR 2017, Basel, Switzerland, September 12–15, 2017, Proceedings 39 , pages 268–280. Springer, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Composite Optimization Algorithms for Sigmoid Networks

Abstract

1 Introduction

2 Sigmoid Networks

3 Composite Optimization Algorithms for Sigmoid Networks

4 Convergence Analysis

Lemma 1**.**

Lemma 2**.**

Theorem 3**.**

Proof.

Corollary 4**.**

Proof.

Theorem 5**.**

Proof.

Corollary 6**.**

Proof.

Theorem 7**.**

Proof.

Corollary 8**.**

5 Numerical Experiment

6 Future Work

Acknowledgments

Lemma 1.

Lemma 2.

Theorem 3.

Corollary 4.

Theorem 5.

Corollary 6.

Theorem 7.

Corollary 8.