Composite Optimization Algorithms for Sigmoid Networks
Huixiong Chen, Qi Ye

TL;DR
This paper introduces composite optimization algorithms tailored for sigmoid networks, transforming the training process into a convex composite optimization problem, with proven convergence guarantees and practical effectiveness demonstrated through numerical experiments.
Contribution
It develops novel composite optimization algorithms based on linearized proximal methods and ADMM for sigmoid networks, ensuring convergence even in non-convex, non-smooth cases.
Findings
Algorithms converge to global optima under certain conditions.
Numerical results show robust performance on function fitting and digit recognition.
Provides guidelines for network size based on training data.
Abstract
In this paper, we use composite optimization algorithms to solve sigmoid networks. We equivalently transfer the sigmoid networks to a convex composite optimization and propose the composite optimization algorithms based on the linearized proximal algorithms and the alternating direction method of multipliers. Under the assumptions of the weak sharp minima and the regularity condition, the algorithm is guaranteed to converge to a globally optimal solution of the objective function even in the case of non-convex and non-smooth problems. Furthermore, the convergence results can be directly related to the amount of training data and provide a general guide for setting the size of sigmoid networks. Numerical experiments on Franke's function fitting and handwritten digit recognition show that the proposed algorithms perform satisfactorily and robustly.
| LPA | GLPA | |||
|---|---|---|---|---|
| RMS-error | Max-error | RMS-error | Max-error | |
| No noise | 2.9525e-3 | 1.4736e-2 | 2.7790e-3 | 1.1547e-2 |
| Gaussian noise | 3.4364e-3 | 1.2678e-2 | 3.7613e-3 | 1.5765e-2 |
| GLPA | ||
|---|---|---|
| RMS-error | max-error | |
| No noise | 2.2093e-4 | 8.4516e-4 |
| Gaussian noise | 8.4138e-4 | 4.3988e-3 |
| Classified | GLPA | SGDM (RMSProp, Adam) | ||
|---|---|---|---|---|
| Digits | Training errors | Test errors | Training errors | Test errors |
| 0 - 1 | 0 / 252 | 0 / 108 | 0 / 252 | 0 / 108 |
| 2 - 5 | 0 / 251 | 0 / 108 | 0 / 251 | 0 / 108 |
| 3 - 7 | 0 / 253 | 0 / 109 | 0 / 253 | 0 / 109 |
| 6 - 9 | 0 / 252 | 1 / 109 | 0 / 252 | 1 / 109 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Face and Expression Recognition · Machine Learning and ELM
Composite Optimization Algorithms for Sigmoid Networks
\nameHuixiong Chen \[email protected]
\addrSchool of Mathematical Sciences
South China Normal University
Guangzhou 510631, China
\nameQi Ye \[email protected]
\addrSchool of Mathematical Sciences
South China Normal University
Guangzhou 510631, China
Abstract
In this paper, we use composite optimization algorithms to solve sigmoid networks. We equivalently transfer the sigmoid networks to a convex composite optimization and propose the composite optimization algorithms based on the linearized proximal algorithms and the alternating direction method of multipliers. Under the assumptions of the weak sharp minima and the regularity condition, the algorithm is guaranteed to converge to a globally optimal solution of the objective function even in the case of non-convex and non-smooth problems. Furthermore, the convergence results can be directly related to the amount of training data and provide a general guide for setting the size of sigmoid networks. Numerical experiments on Franke’s function fitting and handwritten digit recognition show that the proposed algorithms perform satisfactorily and robustly.
Keywords: sigmoid network, composite optimization, non-convex non-smooth algorithm, global convergence, adaptive network size
1 Introduction
The neural network is an important and popular branch of machine learning. People have already developed many useful and well-studied neural network models, such as artificial neural networks, convolutional neural networks, recurrent neural networks, and deep neural networks. Neural networks have been widely used in pattern recognition, image processing, computer vision, neuroinformatics, bioinformatics, and other various fields with great success (LeCun et al. 2015; Abiodun et al. 2018).
When the neural networks are used in practical tasks, they are commonly trained by the error BackPropagation (BP) algorithm which is the most distinguished and successful neural network learning algorithm up to now. The BP algorithm is based on the gradient descent strategy that updates the parameters to the negative gradient direction of the target. To accelerate the learning process, stochastic gradient descent (SGD) with momentum and adaptive methods including adaptive gradient (AdaGrad), root mean square prop (RMSProp), adaptive moment estimation (Adam), and so on have emerged one after another and made a huge impact. As we all know, most of these first-order methods can converge to the critical point only if the objective function is convex or smooth. But for non-convex and non-smooth functions, it remains ambiguous how to find the convergence to even first- or second-order critical points (Burke et al. 2005). Typical cases are sigmoid networks with absolute or hinge loss functions. The BP algorithm can solve these non-convex and non-smooth problems as well, but they are not consistent with the convergence properties of the algorithm. Moreover, it is still non-trivial to find globally optimal solutions for traditional neural network algorithms. We take the state-of-the-art Adam as an example. Its theory is poorly understood in the literature, and it suffers from several deficiencies. For instance, Adam may miss globally optimal solutions (Wilson et al. 2017), and it can be shown that it does not converge on some simple test problems (Reddi et al. 2018).
In this paper, we use composite optimization algorithms to solve sigmoid networks; see Algorithms 3 and 3 for details. The algorithm is guaranteed to (even globally) converge to a globally optimal solution of the objective function even in the case of non-convex and non-smooth problems. That is the main contribution of this paper. The start of our work stems from the finding that sigmoid networks (2.1) can be equivalently transformed into a convex composite optimization (2.2), where the inner function is smooth and the outer function is convex. This provides a new perspective on sigmoid networks. In fact, composite optimization problems arise in many applications in engineering, such as compressed sensing, image processing, machine learning, and artificial intelligence (Boyd et al. 2011; Hong et al. 2017). The composite optimization is an area at the cutting edge of mathematical optimization, and how to efficiently solve composite optimization problems has been a popular subject. For the sigmoid networks with the structure (2.2), the traditional first-order methods do not take advantage of the convex property of the outer function, so sometimes they have certain limitations in practical applications. However, composite optimization methods can fully exploit the information in the structure for algorithm design. There are many iterative algorithms with theoretical foundations for the optimization (2.2), such as the famous Gauss-Newton method (GNM, Burke and Ferris 1995), the proximal descent algorithm (ProxDescent, Lewis and Wright 2016), and the linearized proximal algorithms (LPA, Hu et al. 2016). The basic idea of these algorithms is to transfer a complex optimization problem to a sequence of simple optimization problems whose optimal solutions are easy to compute or have explicit formulas. The LPA is one of the most advanced algorithms in convex composite optimization. It can transform a non-convex and possibly non-smooth problem into a series of unconstrained strongly convex optimization subproblems, which has an attractive computational advantage. The LPA has also been applied to sensor network localization, gene regulatory network inference, and other engineering problems with great success (Hu et al. 2016, 2020; Wang et al. 2017). Therefore, we use the LPA to solve sigmoid networks in this paper.
Under the assumptions of the weak sharp minima and the regularity condition, we establish the convergence behavior of the algorithms for sigmoid networks; see Theorems 3 and 5 for details. Furthermore, we prove that the weak sharp minima is often satisfied for sigmoid networks, and the full row rank of the Jacobian matrix of the inner function, namely , where is the amount of training data, is a sufficient condition of the regularity condition. Hence the convergence results can be directly related to the amount of training data; see Corollaries 4 and 6 for details. This conclusion is of great theoretical and applied significance, especially since it can provide a general guide for setting the size of sigmoid networks. By the full row rank of the Jacobian matrix, we obtain a lower bound on the network size in Corollary 8. We call this lower bound the “adaptive network size”. In this paper, our numerical experiments verify that the adaptive network size is sufficient to construct an ideal sigmoid network that solves the problem effectively. Hence Corollary 8 does provide a good guide for setting the size of sigmoid networks. The essence is to guarantee that the number of parameters in neural networks is not smaller than the amount of training data, and that a sufficient number of parameters ensure the feasibility of the networks. It can also serve as a general guide for setting the size of neural networks. That is another contribution of this paper.
Our work is also motivated by the lack of convex composite optimization algorithms and related software packages for neural networks. To the best of our knowledge, the introduction of convex composite optimization into the area of neural networks has not been addressed in the literature before. This paper is the first piece of work combining neural networks and convex composite optimization.
We organize the paper as follows. In section 2, we introduce the three-layer sigmoid networks and transfer the problem to a convex composite optimization. In section 3, we use the LPA-type algorithms to solve sigmoid networks and employ the alternating direction method of multipliers (ADMM) to solve the non-smooth convex subproblems. In section 4, we prove some convergence properties of the proposed algorithms. In section 5, the numerical experiments are demonstrated including Franke’s function fitting and handwritten digit recognition. Finally, we conclude with an outlook in section 6.
2 Sigmoid Networks
To begin with, we introduce the two-layer real-output sigmoid network, which is known as ‘universal approximators’ (Anthony and Bartlett 1999). Using the standard sigmoid function of the form
[TABLE]
the sigmoid network computes a function of the form
[TABLE]
where are the output weights, and are the input weights. We define these adjustable parameters by
[TABLE]
where . In the following paragraphs, we replace with . Given a training dataset
[TABLE]
the goal of using this network for a supervised learning problem is to find parameters that minimize some measure of the error of the network output over the training dataset, that is,
[TABLE]
where is a loss function. To simplify the discussions, we focus on three convex loss functions including the quadratic loss function, , the absolute loss function, , and the hinge loss function, .
The model of the sigmoid network is usually non-convex and non-smooth. Interestingly, we discover that this problem can be seen as a convex composite optimization problem of the form
[TABLE]
where the inner function is smooth, and the outer function is convex. Specifically, for the absolute or quadratic loss functions, we can set
[TABLE]
In general, we replace with . For the hinge loss function, we can set
[TABLE]
where denotes the componentwise non-negative part of . As we can see, all the outer functions are separable and have the form
[TABLE]
where is a convex function. It is the special property of in sigmoid networks.
3 Composite Optimization Algorithms for Sigmoid Networks
In this section, we show how to solve the sigmoid networks based on the composite optimization algorithms including the linearized proximal algorithms (LPA) and the alternating direction method of multipliers (ADMM).
3.1 LPA for Sigmoid Networks. The LPA is one of the most advanced algorithms in convex composite optimization. It is proposed under the inspiration of the GNM and the proximal point algorithm (PPA), and maintains the same convergence rate as that but also overcomes some of their disadvantages. Each subproblem of the LPA is constructed from a linearized approximation to the composite function and a regularization term at the current iterate. Since the subproblem is an unconstrained strongly convex optimization problem whose optimal solution is global and unique, it is easier to solve than that of the GNM. Consequently, the LPA has an attractive computational advantage, although it is generally not a descent algorithm. Moreover, there are some connections of the LPA with other algorithms mentioned in this paper. The ProxDescent for solving (2.2) is a special case of the LPA. As the descent directions are used, the ProxDescent is a descent algorithm. The case when the inner function is simply identity mapping has a long history. The iteration , where minimizes the function , is the well-known PPA.
Applying the LPA directly to (2.2), we get the following algorithm for sigmoid networks.
Algorithm 1 . LPA for sigmoid networks.
0: Model , training dataset , outer function , inner function .
1: Initialization: , , , accept false;
2: while not accept do
3: calculate the search direction
[TABLE]
where is the Jacobian matrix of ;
4: if then
5: accept true;
6: end if
7: ;
8: ;
9: end while
10: .
10: .
The focus of Algorithm 3 is how to solve the subproblem (3.1) accurately and efficiently. Now, we discuss some numerical algorithms for the special loss functions.
For the quadratic loss function, Algorithm 3 is reduced to the well-known Levenberg- Marquardt method for solving the following nonlinear least squares problem of the form
[TABLE]
The smooth convex subproblem can be written as
[TABLE]
and its necessary and sufficient optimality conditions imply that
[TABLE]
Hence the closed formula of the search direction is given by
[TABLE]
where is always a positive-definite and invertible matrix, and is the gradient of the objective function in the original problem. Thus, the iteration can be regarded as a variant of gradient descent algorithm, where is an adaptive learning rate. Moreover, the stopping criterion shows that , which is the first-order necessary condition of the original problem. In section 4, we will give a first-order sufficient condition of the original problem in Theorem 7.
3.2 ADMM for Non-Smooth Convex Subproblems. For the non-smooth convex loss functions, the subproblem of Algorithm 3 is more complex, but luckily it is convex. There are many widely used convex optimization methods and heuristic algorithms to solve it, such as gradient or subgradient methods, approximation or composite optimization methods (Bertsekas 2015), and simulated annealing algorithms. Moreover, there are many related software packages to implement these algorithms, such as CVXPY in Python, qpOASES in C, and CVX toolbox in Matlab. So it is not difficult to calculate the search direction from the subproblem. We use a mapping A to represent a specific algorithm to solve the subproblem, then the search direction can be presented in
[TABLE]
Here we use the ADMM to solve the subproblem. The ADMM is a simple scheme that often works well and has a good reliability with a wide range of applications, especially for convex problems. It is also easy to understand and implement for many composite optimization problems with complex structures (Boyd et al. 2011).
The subproblem (3.1) can be seen as an equivalent problem of the form
[TABLE]
The augmented Lagrangian function of the above problem is
[TABLE]
where is the penalty parameter. The ADMM consists of the iterations
[TABLE]
The calculation for is as follows.
[TABLE]
where , and is the proximity operator of with the penalty (Boyd et al. 2011). Specifically, for the absolute loss function, the proximity operator , also called the soft thresholding operator, is defined as
[TABLE]
For the hinge loss function, the proximity operator is defined as
[TABLE]
The calculation for is as follows. Since
[TABLE]
by its necessary and sufficient optimality conditions, we obtain that
[TABLE]
As we can see, the iterations of the ADMM for the non-smooth convex subproblems have explicit formulas, which is one of the advantages of the ADMM. Defining the primal residual of the optimality conditions at iteration as
[TABLE]
and the dual residual at iteration as
[TABLE]
we set the stopping criterion as and .
Algorithm A∗ . ADMM for non-smooth convex subproblems.
0: Numbers and , matrices and , non-smooth convex function .
1: Initialization: , , , , ;
2: repeat
3: ;
4: calculate from (3);
5: calculate from (3.4);
6: calculate from (3.2);
7: calculate and from (3.5) and (3.6), respectively;
8: until and ;
9: .
9: .
3.3 A Globalization Strategy for Algorithm 3. Moreover, we show the following algorithm by employing the globalized LPA (GLPA) that adopts a backtracking line-search as a globalization strategy. The choice of the stepsize is based on the virtue of the backtracking line-search, which guarantees the monotone decrease of the objective function at each iteration. As a result, it ensures that the GLPA is a descent algorithm. In the algorithm implementation, the backtracking strategy finds the first point satisfying the inequality (3.7) by continuously decreasing the trial stepsize in an exponential way. That makes the stepsize with the descent property as large as possible.
Algorithm 2 . GLPA for sigmoid networks.
0: Model , training dataset , outer function , inner function .
1: Initialization: , , , , accept false;
2: while not accept do
3: calculate the search direction
[TABLE]
4: if then
5: accept true;
6: end if
7: ;
8: repeat
9:
10: until
[TABLE]
11: ;
12: ;
13: ;
14: end while
15: .
15: .
4 Convergence Analysis
In this section, we prove some convergence properties of the proposed algorithms under the assumptions of the weak sharp minima and the regularity condition or full row rank of the Jacobian matrix, a stronger condition. Before giving the main results, we introduce the following useful definitions and lemmas.
4.1 Theoretical Foundations of LPA-type Algorithms. Here we consider the convex composite optimization of the form
[TABLE]
where the inner function is continuously differentiable, and the outer function is convex. It is a more general mathematical form of the problem (2.2).
First, we introduce the concept of the Lipschitz continuous gradient, which has played an important role in investigating the convergence behavior of many optimization algorithms. For a differentiable function and , if there exists an such that
[TABLE]
we say that is K-smooth or has a Lipschitz continuous gradient with modulus on .
Next, we give the notion of the weak sharp minima introduced in (Burke and Ferris 1993), which has far-reaching consequences for the convergence analysis of many iterative procedures. For a function , the minimum value and the set of minima for , denoted by and , are defined by
[TABLE]
Let , if there exist and such that
[TABLE]
where , then we say that is the set of weak sharp minima of order for on with modulus .
We now introduce the regularity condition proposed in (Burke and Ferris 1995), which is a crucial assumption applied to establish the convergence of several convex composite optimization algorithms. Let and be defined by (4.1), then a point is said to be a regular point of the inclusion if
[TABLE]
where ker() is the nullspace of , and is the negative polar of .
In the following lemmas, we give the local convergence of the LPA and the global convergence of the GLPA for solving optimization (4.1). They are based on three main conditions including Lipschitz continuous gradient, weak sharp minima and quasi-regularity or regularity condition. Note that the definition of quasi-regularity condition will only be described in the proof of Theorem 3. Since this condition is hard to verify in practice, we replace it with the regularity condition in the related theorem for sigmoid networks.
Lemma 1**.**
(Hu et al. 2016, Corollary 14)* Let satisfy , and let be the set of weak sharp minima of order for near with constant . Suppose that is continuously differentiable with a Lipschitz continuous gradient near , and that is a quasi-regular point of the inclusion with constant . Suppose further that or the stepsize (if ). Then there exists a neighborhood of such that, for any , the sequence generated by the LPA with initial point converges at a rate of to a solution satisfying .
Lemma 2**.**
(Hu et al. 2016, Theorem 18)* Let be a sequence generated by the GLPA and assume that has a cluster point . Suppose that and that be the set of weak sharp minima of order for near . Suppose further that is continuously differentiable with a Lipschitz continuous gradient near , and that is a regular point of the inclusion. Then , and converges to at a rate of .*
Note that in Lemma 2 is lightly different from in Lemma 1, but both of them can find a globally optimal solution to optimization (4.1) since that , equivalently, .
4.2 Convergence Analysis for Sigmoid Networks. Let denote an open ball of radius centered at , then we establish the local convergence of Algorithm 3 by virtue of Lemma 1.
Theorem 3**.**
(Local Convergence). Let and . Let be a sequence generated by Algorithm 3, and be such that and is the set of weak sharp minima of order for on . If is a regular point of the inclusion, then there exist and such that for any and initial point , the sequence converges at a rate of to a globally optimal solution and .
Proof.
According to the assumptions of Lemma 1, we need to verify the following four conditions.
- (i)
Quasi-regularity condition. By Proposition 3.3 in (Burke and Ferris 1995), we know that any regular point of the inclusion is also a quasi-regular point. Since is a regular point, is also a quasi-regular point of the inclusion , namely there exist and such that
[TABLE]
where is the solution set of the linearized inclusion . 2. (ii)
Weak sharp minima. In particular, we set . Naturally, is the set of local weak sharp minima of order for on with constant for some , due to the assumption and definition of the weak sharp minima. 3. (iii)
Lipschitz continuous gradient. Note that a differentiable function with a Lipschitz continuous gradient is second-order differentiable almost everywhere on . If is a second-order differentiable function, by the differential mean value theorem, it is obvious that the -smoothness of is equivalent to the boundedness of , that is, for each . On the other hand, since defined by (2.2) is smooth on , is continuous on . Naturally, is bounded on the bounded subset . Therefore, is continuously differentiable with a Lipschitz continuous gradient on . 4. (iv)
Large stepsize. If , we set ; otherwise, set .
Hence, Lemma 1 is applicable and the conclusion follows. ∎
Furthermore, we analyze the convergence properties of Algorithm 3 for the three common sigmoid networks.
Corollary 4**.**
*Let be a sequence generated by Algorithm 3, and be such that . If has full row rank, then there exists an such that for any initial point , we have
- (i)
*for the sigmoid networks with the quadratic loss function, the sequence linearly converges to a globally optimal solution and , if is sufficiently large. *
** 2. (ii)
for the sigmoid networks with the absolute loss function, the sequence quadratically converges to a globally optimal solution and .
** 3. (iii)
for the sigmoid networks with the hinge loss function, the sequence quadratically converges to a globally optimal solution and .
Proof.
According to the assumptions of Theorem 3, we need to verify the following two conditions.
- (a)
Regularity condition. Since the system of linear equations has only zero solution if and only if the matrix has full column rank, with full row rank is equivalent to . Then, it follows that
[TABLE]
Therefore, the regularity condition is satisfied. 2. (b)
Weak sharp minima. Note that ; for the quadratic or absolute loss functions, and for the hinge loss function.
(i)
In the case when , for each . By the definition of weak sharp minima, we know that is the set of weak sharp minima of order for on with modulus .
(ii)
In the case when , for each . In the same way, it shows that is the set of weak sharp minima of order for on with modulus .
(iii)
In the case when , for each , which implies that is the set of weak sharp minima of order for on with modulus . Therefore, the local weak sharp minima is satisfied for the three common sigmoid networks.
Hence, Theorem 3 is applicable and the conclusion follows. ∎
As we have seen, the weak sharp minima is often satisfied for sigmoid networks, and its order determines the convergence rate of the algorithm. To our surprise, a first-order algorithm even has a second-order convergence rate. In the following paragraphs, we establish the global convergence of Algorithm 3 by virtue of Lemma 2.
Theorem 5**.**
(Global Convergence). Let and . Let be a sequence generated by Algorithm 3, and have a cluster point such that be the set of weak sharp minima of order for on . If is a regular point of the inclusion, then converges at a rate of to a globally optimal solution and .
Proof.
According to the assumptions of Lemma 2, we need to verify the following three conditions.
- (i)
Regularity condition. Since the cluster point is a regular point of the inclusion , the regularity condition is satisfied. 2. (ii)
Weak sharp minima. Since is the set of weak sharp minima of order for on for some and , the local weak sharp minima is satisfied. 3. (iii)
Lipschitz continuous gradient. By (iii) in the proof of Theorem 3, we know that is continuously differentiable with a Lipschitz continuous gradient on .
Hence, Lemma 2 is applicable and the conclusion follows. ∎
We can see that Algorithm 3 has the same conclusion and convergence rate as Algorithm 3 under the same assumptions. Next, we show the global convergence of Algorithm 3 for two non-convex and non-smooth sigmoid networks.
Corollary 6**.**
*Let be a sequence generated by Algorithm 3 for the sigmoid networks with absolute or hinge loss functions, and have a cluster point . If has full row rank, then quadratically converges to a globally optimal solution and . *
Proof.
According to the assumptions of Theorem 5, we need to verify the following two conditions.
- (a)
Regularity condition. By (a) in the proof of Corollary 4, the full row rank of implies that the cluster point is a regular point of the inclusion. Therefore, the regularity condition is satisfied. 2. (b)
Weak sharp minima. By (b) in the proof of Corollary 4, we know that is the set of weak sharp minima of order 1 for on with modulus . Therefore, the local weak sharp minima is satisfied for the two sigmoid networks.
Hence, Theorem 5 is applicable and the conclusion follows. ∎
Note that with full row rank, namely , where is the amount of training data, is the sufficient condition of the regularity condition; and it is also the necessary condition when is a singleton set and . Hence the convergence results can be directly related to the amount of training data. Next, we show the following convergence property of the LPA-type algorithms in a finite number of iterations.
Theorem 7**.**
(Sufficient Condition). If the LPA-type algorithm stops at the th iteration with , then is a globally optimal solution to the convex composite optimization (4.1) and .
Proof.
Since the subproblem of the LPA-type algorithms is an unconstrained convex optimization problem, its necessary and sufficient optimality conditions imply that
[TABLE]
where is the subdifferential of the convex function . The stopping criterion of the algorithms shows that
[TABLE]
By , equivalently, the full column rank of , it follows that
[TABLE]
By the necessary and sufficient optimality conditions of the convex optimization, it shows that is a globally optimal solution to , equivalently, . Hence the proof is complete. ∎
Theorem 7 also shows that is the first-order sufficient condition of sigmoid networks when the LPA-type algorithm stops at the th iteration. It is no surprise that there is a unified conclusion on the non-convex and possibly non-smooth sigmoid networks, thanks to the unified composite optimization framework and the convex subproblem.
We have seen that the full row rank is a critical condition for the convergence analysis of sigmoid networks. This condition is of great theoretical and applied significance, especially since it can provide a general guide for setting the network size. In order to guarantee the reliability of the algorithm, we can ensure that is of full row rank, which implies that , where is the dimension of the input, and is the number of hidden neurons. So we have the following corollary.
Corollary 8**.**
If , then we have a lower bound on the network size given by
[TABLE]
Clearly, the lower bound on the network size is directly proportional to the amount of training data and inversely proportional to the dimension of the input. That is, the lower bound on the network size is adapted to the problem size, so we call this lower bound the “adaptive network size”. Moreover, each row of the Jacobian matrix is the gradient of the fitting function at the corresponding data point. In a general sense, as the number of hidden neurons increases, the information contained in the gradient increases. As a result, the rank of the Jacobian matrix will also increase or be equal to . Thus, the full row rank of can be satisfied in a theoretical sense by choosing the network size sufficiently large. In conclusion, the LPA-type algorithms are almost always reliable.
5 Numerical Experiment
Sigmoid networks are often used to solve regression and classification tasks, so we shall use our algorithms for both tasks. We train the sigmoid networks on the training dataset and demonstrate the performance on the test dataset. Note that we will use the adaptive network size, namely the lower bound on the network size given by Corollary 8, to build the sigmoid networks, which is sufficient to solve problems effectively.
5.1 Regression on Scattered Data. Franke’s function is a standard test function for 2D scattered data fitting of the form
[TABLE]
and its graph in the unit square in is shown on the left of Figure LABEL:Haltonp. One can see that Franke’s function is a complex function with two Gaussian peaks and a small trough. We generate 289 training data points and 121 test data points using the Halton sequence. The points are uniformly distributed in the unit square in , and the result is shown on the right of Figure LABEL:Haltonp.
Considering the observational errors, we also add small white Gaussian noise to the training data to reflect the real case, that is, , where is a Gaussian distribution with a mean of [math] and a standard deviation of . All numerical experiments are implemented in Python 3.9. We generate the positive Gaussian noise using . The performance measure we choose for the regression task is the root mean squared error (RMS-error):
[TABLE]
where is the predicted value and is the actual value.
When implementing the LPA-type algorithms for the sigmoid networks with a quadratic loss function, we set , , and the stopping criterion as 1e-2. For the inequality (3.7) in Algorithm 3, we set , e-3, and the maximum number of iterations for the backtracking line-search as (indeed, one iteration is enough in most cases, that is, is often used). According to (4.2), we can set to guarantee the reliability of the algorithms. For the case when and 1e5, the performance of the algorithms is shown in Table 1 and Figure 2.
As we can see, the LPA-type algorithms solve the regression tasks well, and they are robust even when the data is perturbed by the noise with a mean of 2.0094e-3 and a maximum of 3.9894e-3. The results show that the training loss is less than 5.2940e-6 for all test cases. In other words, our algorithms can obtain an ideal solution for this task. We find that the monotonic decrease of the objective function occurs at almost every iteration of the LPA. It is almost a descent algorithm. Through multiple experiments, we also find that the performance of the LPA depends on the choice of the initial point, but the GLPA is not affected by this. Thus, we conjecture that the GLPA for sigmoid networks with the quadratic loss function can converge globally under certain conditions. This will be explored in our future work.
Indeed, the LPA-type algorithms using small-scale networks can solve the problem as well. The illustration is shown on the left of Figure 2. Moreover, the performance of the algorithms is also affected by the stepsize of the subproblem. This is shown on the right of Figure 2.
Corollary 6 shows that Algorithm 3 using absolute or hinge loss functions can converge globally. For simplicity, the rest of this section is devoted to demonstrating the performance of Algorithm 3. When implementing the GLPA for the sigmoid networks with an absolute loss function, we still use the same parameter values as in the previous experiments. For Algorithm 3, we set 1e-2, , , and the maximum number of ADMM iterations as 20. For the case when and 1e5, the performance of the algorithm is shown in Table 2 and Figure 4.
The training loss in both experiments is less than 7.7930e-7, which shows that the GLPA obtains a better solution for sigmoid networks. Obviously, this result is more in line with the actual needs of regression tasks.
5.2 Classification on Handwritten Digits. The digits dataset from scikit-learn contains 1797 samples, each with 64 elements corresponding to an image of 88 pixels, and with target attribute 0, 1, , 9. Some of the samples are shown in Figure 5.
We create four binary classification tasks, each to classify two digits: 0 and 1; 2 and 5; 3 and 7; 6 and 9. For each task, we take 70% of the selected samples as the training data and the rest as the test data. Here we run four algorithms on these tasks, including the GLPA and three other popular and practical tools in the machine learning community, SGDM, RMSProp and Adam. We also use the same parameter settings for the GLPA as the previous experiments. The only difference is that we set by Corollary 8 and the number of ADMM iterations does not exceed 10. For the other algorithms, implemented with PyTorch, we set the learning rate as 1e-3, the momentum as 0.9, and the number of iterations as 1000. For the case when , the running results of the four algorithms are shown in Table 3 and Figure LABEL:hinge_loss.
Three observations are indicated by the running results: (i) The small training loss shows that the GLPA can obtain excellent solutions to classification problems, and the training loss of the GLPA is generally smaller than the other algorithms. (ii) The GLPA has a much smaller number of iterations, thanks to its quadratic convergence rate in this case. It is striking that a first-order algorithm (GLPA) even has a second-order convergence rate. (iii) The adaptive network size given by Corollary 8 is sufficient to construct an ideal sigmoid network that solves the problem effectively. Hence Corollary 8 does provide a good guide for setting the size of sigmoid networks.
The essence of Corollary 8 is to guarantee that the number of parameters in neural networks is not smaller than the amount of training data, and that a sufficient number of parameters ensure the feasibility of the networks. In our view, it is as if the information of a data point could be extracted by a single parameter in the model. Inspired by this, we think it can also serve as a general guide for setting the size of neural networks. It is well known that how to set the number of hidden neurons in neural networks is still an open problem, and it is usually adjusted by trial and error in practice. As stated above, we suggest that the number of hidden neurons can be specified by trial and error starting from the adaptive network size, which can avoid certain blindness at the beginning of the trial. This general rule deserves to be tried and further verified in practice.
6 Future Work
Although we only show the composite optimization algorithms for the three-layer sigmoid networks, our algorithms are also applicable to the more complex sigmoid networks, such as the sigmoid networks with multiple hidden layers, with multiple outputs, and with output layer neurons that are processed with sigmoid functions. In the design of model (2.2), the convexity of the outer function is due to the convex loss function , and the smoothness of the inner function is due to the smooth fitting function . So the algorithms can be used to solve the sigmoid networks whenever we maintain the convexity of and the smoothness of (note that is always smooth in sigmoid networks). It is not difficult to solve the general sigmoid networks with convex loss functions using our algorithms by setting the same form of and as the case of one hidden layer. As a matter of fact, the composite structure (2.2) can provide a unified framework for the development and analysis of sigmoid networks, especially for the non-convex and non-smooth optimization problems. Moreover, the various composite structures in neural networks pose more challenges for the study of composite optimization algorithms. The breakthrough of composite optimization algorithms will also drive the development of neural network learning algorithms. Last but not least, the convergence results of convex composite optimization (4.1) in the literature all seem to be established on . While the more general convergence theorems should be established possibly on , which is still an open problem in the area of composite optimization. In view of this, we will explore this issue further.
Acknowledgments
The research was supported in part by the National Natural Science Foundation of China under grants 12071157 and 12026602, and the Natural Science Foundation of Guangdong 2020B1515310013. Qi Ye is the corresponding author.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abiodun et al. 2018 Oludare I. Abiodun, Aman Jantan, Abiodun E. Omolara, Kemi V. Dada, Nachaat A. Mohamed, and Humaira Arshad. State-of-the-art in artificial neural network applications: A survey. Heliyon , 4(11):e 00938, 2018.
- 2Anthony and Bartlett 1999 Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations . Cambridge University Press, Cambridge, 1999.
- 3Bertsekas 2015 Dimitri P. Bertsekas. Convex Optimization Algorithms . Athena Scientific, Belmont, MA, 2015.
- 4Boyd et al. 2011 Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning , 3(1):1–122, 2011.
- 5Burke and Ferris 1993 James V. Burke and Michael C. Ferris. Weak sharp minima in mathematical programming. SIAM Journal on Control and Optimization , 31(5):1340–1359, 1993.
- 6Burke and Ferris 1995 James V. Burke and Michael C. Ferris. A gauss-newton method for convex composite optimization. Mathematical Programming , 71(2):179–194, 1995.
- 7Burke et al. 2005 James V. Burke, Adrian S. Lewis, and Michael L. Overton. A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization , 15(3):751–779, 2005.
- 8Hong et al. 2017 Byung-Woo Hong, Ja-Keoung Koo, Hendrik Dirks, and Martin Burger. Adaptive regularization in convex composite optimization for variational imaging problems. In Pattern Recognition: 39th German Conference, GCPR 2017, Basel, Switzerland, September 12–15, 2017, Proceedings 39 , pages 268–280. Springer, 2017.
