Bias of Homotopic Gradient Descent for the Hinge Loss
Denali Molitor, Deanna Needell, Rachel Ward

TL;DR
This paper investigates the convergence behavior of a homotopic gradient descent method applied to the hinge loss in linear classifiers, providing explicit rates towards the max-margin solution for separable data.
Contribution
It introduces a homotopic gradient descent approach for the hinge loss and establishes explicit convergence rates to the max-margin solution in linearly separable data.
Findings
Convergence to max-margin solution is achieved with explicit rates.
Homotopic gradient descent effectively handles non-smooth hinge loss.
Theoretical analysis extends understanding of gradient methods for non-smooth losses.
Abstract
Gradient descent is a simple and widely used optimization method for machine learning. For homogeneous linear classifiers applied to separable data, gradient descent has been shown to converge to the maximal margin (or equivalently, the minimal norm) solution for various smooth loss functions. The previous theory does not, however, apply to non-smooth functions such as the hinge loss which is widely used in practice. Here, we study the convergence of a homotopic variant of gradient descent applied to the hinge loss and provide explicit convergence rates to the max-margin solution for linearly separable data.
| Algorithm 1 | [SHN+18] | |
|---|---|---|
| Angle gap | ||
| Margin gap |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Bias of Homotopic Gradient Descent for the Hinge Loss
Denali Molitor
Department of Mathematics, University of California at Los Angeles, Los Angeles, California, USA
Deanna Needell
Department of Mathematics, University of California at Los Angeles, Los Angeles, California, USA
Rachel Ward
Department of Mathematics, University of Texas at Austin, Austin, Texas, USA
Abstract
Gradient descent is a simple and widely used optimization method for machine learning. For homogeneous linear classifiers applied to separable data, gradient descent has been shown to converge to the maximal margin (or equivalently, the minimal norm) solution for various smooth loss functions. The previous theory does not, however, apply to non-smooth functions such as the hinge loss which is widely used in practice. Here, we study the convergence of a homotopic variant of gradient descent applied to the hinge loss and provide explicit convergence rates to the max-margin solution for linearly separable data.
1 Introduction
Several recent works suggest that the optimization methods used in training models affect the model’s ability to generalize through implicit biases to certain solutions [ZBH*+*17, NTS14, HRS16, HHS17, PKL*+*17, PLM*+*18, HHS17, CCS*+*17, CPS*+*18]. In order to understand the effects of optimization methods in more complex and often non-convex settings such as for neural networks, it is natural to first understand their behavior in simpler settings, such as for least squares regression, logistic regression, and support vector machines (SVM) [SHN*+*18, NLG*+*19, GLSS18]. In particular, gradient descent and its many variants, including the subgradient method, are popular choices for optimizing machine learning models and thus warrant careful study.
It was recently shown that gradient descent applied to the (unregularized) logistic regression problem for linearly separable data converges to the solution with maximal margin, while other choices of optimization method converge to different solutions [SHN*+*18]. Convergence to the maximal-margin solution is desirable, as the margin is an important quantity for deriving generalization guarantees [BST99, Vap82, Vap99, VC74, Vap13]. The analysis of Soudry et al [SHN*+*18] extends to additional loss functions, but requires particular properties, including smoothness and monotonicity. These assumptions do not hold, however, for non-differentiable functions such as the hinge loss objective, which is the loss function used in training SVM [CV95].
Here, we analyze the convergence to the maximal margin solution of a homotopic subgradient method applied to the non-smooth hinge loss. In particular we consider a method in which a number of subgradient updates are applied to the hinge loss with decreasing regularization. Although it is well known that the exact solutions of the regularized hinge loss converge to the hard-margin SVM solution as the regularization decreases to zero in the linearly separable case [RZH04, HRTZ04], we are unaware of results that provide explicit convergence rates for an iterative optimization algorithm, such as the subgradient method, that converges to the hard-margin SVM solution in a single pass of the regularization parameter . We provide such an analysis here, and demonstrate that the iterates of an averaged subgradient method applied to the regularized SVM loss with shrinking regularization parameters converge to the max-margin solution at a rate of for linearly separable data, where is any small positive constant.
For linearly separable data there exists be such that the solution to the hinge loss with regularization parameter is equal to the true, hard-margin solution for all [RZH04, HRTZ04]. While is constant for a fixed problem, knowing its value in advance is typically unrealistic. Additionally, if the data is not well separated, can be very small. The homotopic subgradient method analyzed here depends on the value of and converges at a rate of . If one were to know the appropriate regularization parameter in advance, the averaged subgradient method with appropriate fixed step sizes would converge in error at a rate of . This rate can be improved to by using weighted step sizes that depend on [B*+*15, LJSB12]. Thus, we pay a small price for the shrinking regularization routine and for not knowing the value of in advance. We additionally provide faster convergence guarantees and improved convergence results for the proposed method on small datasets as compared to gradient descent applied to the logistic loss with fixed step sizes [SHN*+*18].
1.1 Contributions
While several works analyze the convergence of various optimization methods to the maximal-margin solution for separable data [SHN*+*18, NLG*+*19], we are unaware of any works that provide explicit convergence rates for the fundamental subgradient descent method. Convergence of the subgradient method and stochastic subgradient method have been analyzed for non-smooth convex functions, however these works only provide convergence guarantees in the loss-function values and not the iterates, as, for general convex functions, the minimizer may not be finite and may not be unique [SZ13, Zha04]. In the context of solving the hard-margin SVM, the restriction to linearly separable data guarantees the existence of a minimizer and considering the maximal margin solution ensures uniqueness. Moreover, in the context of general convex functions, previous works often use the projected subgradient method and require knowledge of a bounded domain in which a minimizer exists [SZ13, B*+*15]. For solving the hard-margin SVM via gradient descent, we show that such a projection is unnecessary.
Here, we provide explicit convergence guarantees for a homotopic subgradient method for optimizing the non-smooth SVM hinge loss. The proposed method uses decreasing regularization parameters and leads to the hard-margin SVM solution. We study the effects of optimization via this method on the generalization ability of the learned solutions through proved convergence rates to the hard-margin SVM solution in terms of error as well as difference in angle and margin from the true solution. We additionally show that these convergence rates to the hard-margin SVM solution outpace recent results such as gradient descent with fixed step sizes applied to the logistic loss [SHN*+*18, NLG*+*19]. We demonstrate the convergence of the proposed method on a synthetic dataset.
1.2 Organization
In Section 2, we introduce the specific problem setting, the notation that will be used throughout, and the proposed optimization scheme, Algorithm 1. Section 3 provides the main convergence results for Algorithm 1. An outline for the proof of the main convergence theorem, Theorem 3.1, is provided in Section 4, with additional details in Appendix A. We test convergence properties of Algorithm 1 for a simple synthetic dataset in Section 5. Section 6 provides additional implementation details for Algorithm 1 as well as possible modifications and extensions.
2 Problem Setup
We consider the binary classification problem with data , where are the data points and their labels. We aim to classify the data via a homogeneous linear SVM. Specifically, we wish to identify a weight vector that satisfies
[TABLE]
Throughout, we write . We can equivalently find
[TABLE]
where
[TABLE]
The function is commonly referred to as the hinge loss. We assume throughout that the data is linearly separable, i.e. there exists a vector satisfying as is done in [SHN*+*18, NLG*+*19, WGC19, BGMS18, NSS19, RZH04]. This assumption is common and necessary in order to discuss the margin of the approximated solutions. Minimizing the norm of the solution to corresponds to maximizing the margin, that is maximizing the minimal distance between any data point and the separating hyperplane determined by . In this setting, the solution to Equation 1, , is often referred to as the hard-margin SVM solution.
The constrained optimization problem in Equation 1 is the primal formulation of an SVM. While solving or approximating the corresponding dual SVM formulation is popular in practice, there are advantages to approximating the primal problem directly [Cha07]. Of particular interest for this work, considering the primal formulation allows for straightforward analysis of the effect of the optimization error on the margin and hyperplane angle.
As an alternative to solving Equation 1 directly, one often looks for a solution to an unconstrained, regularized version. Define the functional:
[TABLE]
For , is strongly convex with strong convexity parameter . We will use to denote the subgradient of . The gradient of exists as long as for all and is given by
[TABLE]
When for some , the subgradient set contains the point
[TABLE]
When the gradient does not exist, we will abuse notation and use Equation 3 in the subgradient method update of Equation 5.
Let
[TABLE]
We will refer to as the solution to the regularized subproblem of minimizing Equation 2. A larger regularization parameter encourages a solution with smaller norm at the cost of having some points lie within the margin. For linearly separable data and as approaches 0, the regularized solutions converge to the unregularized solution, . Let be such that for all . Such a is guaranteed to exist for linearly separable data [RZH04, HRTZ04]. This fact suggests solving Equation 2 by using the subgradient method for a sufficiently small value of . Of course, the value of will typically be unknown.
We use the following assumption and definition of throughout.
Assumption 2.1**.**
The data with labels are linearly separable, i.e. there exists such that for all , . Let be the hard-margin SVM (i.e. solves Equation 1) and be such that for all , .
While in practice, one may be satisfied with the solution for sufficiently small, we are interested in the convergence to the true hard-margin SVM given by . Thus, we instead propose to use a “homotopic” variant of the subgradient method that iteratively approximates the solution to Equation 2 while the regularization parameter and accompanying step size of the subgradient method in Equation 5 decay at prescribed rates. Incorporating a piecewise constant decaying step size is commonly used for large-scale minimization problems, especially when using stochastic gradient descent variants [BCN18].
Recall the subgradient method given by the updates:
[TABLE]
where is the approximate solution at iteration and is a step size. For some number of outer iterations , we choose a regularization parameter , a step size and a number of inner iterations . The regularization parameter and step size are selected such that they decrease to 0 as increases. Let be the current estimate of . We then perform subgradient updates applied to the loss function with initial iterate and step size . The next estimate, , is given by the average of the subgradient iterates. This process is detailed in Algorithm 1. For specific choices of and , Algorithm 1 converges to the hard-margin SVM solution . Convergence guarantees are detailed in Theorem 3.1.
While the strongly convex functions are not globally Lipschitz, they are Lipschitz functions on bounded domains. Using a projected subgradient method in which iterates are projected onto a bounded domain is a natural strategy for restricting the domain of the iterates. A projection is unnecessary in this setting, however, as the regularization parameter naturally promotes solutions of smaller norm. In fact, the iterates produced by the subgradient method in Algorithm 1 remain bounded in norm with a bound that depends on the current regularization parameter .
Lemma 2.2**.**
Fix a regularization parameter and step size such that . Define
[TABLE]
If the initial iterate is such that , then each iterate produced by the subgradient method of Equation 3 applied to the function of Equation 2 has . Additionally, .
In summary, if the initial iterate is such that , then the iterates produced by the subgradient method applied to will also have norm less than or equal to .
Remark*.*
Using Lemma 2.2, one can show that the functionals are Lipschitz over the domain of iterates produced by Algorithm 1. Specifically, the constant
[TABLE]
bounds the Lipschitz constants of each function restricted to the ball centered at the origin with radius . Lemma 2.2 guarantees that the iterates produced when applying the subgradient method to and for sufficiently small initial iterate remain with this domain. Note that the bound on the Lipschitz constants is independent of the regularization parameter .
3 Main Results
We now provide explicit rates of convergence to the hard-margin SVM solution for Algorithm 1. We provide convergence rates in terms of the error, difference in angle, and difference in margin between the approximation and the true hard-margin solution. The convergence results are stated in terms of , the total number of subgradient updates required. Recall that the approximations are only updated at increments of subgradient updates. Let , so that is the approximation after subgradient calculations.
Theorem 3.1 provides a convergence guarantee for the error of the iterates produced by Algorithm 1. This result will be used to additionally derive convergence guarantees for the angle and margin of the solution in Lemma 3.4. The parameter determines the rate of decay of the regularization and the parameter determines the number of steps used at each fixed level of regularization. The constant is as defined in Equation 7 and is an upper bound on the Lipschitz constants of the functions restricted to the domain of the iterates produced by the subgradient method applied to (Lemma 2.2).
Theorem 3.1**.**
Consider Algorithm 1 with parameters and such that and . Choose an initial number of inner iterations with . Let as defined in Equation 7. Define
[TABLE]
with Let be the average of the subgradient descent updates calculated to minimize the function with step size , where is the total number of subgradient descent updates calculated. Then for data and satisfying Assumption 2.1,
[TABLE]
Let . Then
[TABLE]
An outline for a proof of Theorem 3.1 can be found in Section 4 with additional details in Appendix A. Note that, for small , the two terms in the bound of Equation 8 will decrease at approximately the same rate if and . Corollary 3.2 gives a simpler, explicit rate of convergence by making this specification and setting .
Corollary 3.2**.**
Consider Algorithm 1 with parameters , and an initial number of inner iterations with . Let . Define
[TABLE]
with Let be the average of the subgradient descent updates calculated for with step size , where is the total number of subgradient descent updates calculated. Then for data and satisfying Assumption 2.1,
[TABLE]
Choosing , we have , and arrive at the convergence rate
[TABLE]
At least theoretically, sending leads to the best convergence rate guarantee. In fact, the convergence rate provided by Theorem 3.1 can be made arbitrarily close to by choosing , , and sufficiently large. As we will see in Section 5, using extremely large becomes impractical as the number of iterations for each fixed- subproblem becomes extremely large.
For strongly-convex, Lipschitz functions with strong-convexity parameter , one can achieve convergence in at a rate of , using projected averaged gradient descent with fixed step sizes (Theorem 3.2 [Bub14]). Using weighted step sizes, and knowledge of the strong convexity parameter, this rate can be improved to (Theorem 3.9 [Bub14], originally from [LJSB12]). A challenge of solving for the hard-margin SVM is that we do not optimize a strongly convex function. While one could fix a regularization parameter leading to a strongly convex function, there is no guarantee that the minimizer of this function will correspond to the true solution . Since the convergence rate of Algorithm 1 can be made arbitrarily close to we lose very little, only a factor of compared to the convergence rate of projected averaged gradient descent with fixed step sizes, for not knowing in advance and instead incorporating decreasing explicit regularization.
Additionally, in designing Algorithm 1, we aimed for a simple algorithm as opposed to optimizing all possible parameters. One could possibly improve on the rates given here by further optimizing these parameters.
3.1 Convergence rates for angle and margin gaps
The convergence rate in Theorem 3.1 can be used to derive rates of convergence to the angle and margin of the optimal separating hyperplane .
Definition 3.3**.**
For the hard margin SVM solution and a vector , define
[TABLE]
and
[TABLE]
While it is natural to consider the error of the derived solution, the angle between the true and derived solutions as well as the difference in the size of the margins give a more intuitive interpretation of the effect of that error. For example, an approximate solution that is off by a constant factor, that is , will have an angle gap of zero and non-zero margin gap if . If an approximate solution has a nonzero angle gap, but negligible margin gap, this suggests that the derived solution still separates the data reasonably well.
Convergence rates of Algorithm 1 in terms of the angle and margin gaps are stated in Lemma 3.4 and compared to other recently obtained convergence rates in Table 1. The rates of convergence in these metrics can be derived from Theorem 3.1. These arguments are included in Appendix A.
Lemma 3.4**.**
Let
[TABLE]
where , and are as given in Theorem 3.1 so that is the exponent in the convergence rate of Theorem 3.1. Let be such that . The value of is positive and can be made arbitrarily close to 0 by choosing sufficiently large and setting and Then for the angle gap,
[TABLE]
For the margin gap,
[TABLE]
The convergence guarantees for the angle and margin gaps for Algorithm 1 are significantly faster than those given in Soudry et al [SHN*+*18] for gradient descent with fixed step sizes applied to the logistic loss (see Table 1). Nacson et al [NLG*+*19] demonstrate that using aggressive adaptive step sizes for gradient descent applied to the logistic loss leads to a faster convergence rate of . While the convergence guarantees for Algorithm 1 are slower, as , in this paper, we are interested in analyzing convergence guarantees for gradient descent applied to the non-smooth hinge loss.
4 Proof of Theorem 3.1
We prove Theorem 3.1 through a series of lemmas, which are stated in Subsection 4.1 and whose proofs are contained in Appendix A. The proof of Theorem 3.1 is contained in Subsection 4.2.
We briefly summarize each of the lemmas for convenience. Lemma 4.1 provides a modified convergence guarantee for the averaged subgradient method applied to the functions . Lemma 4.2 bounds the distance between minimizers of for different regularization parameters . This result allows for the incorporation of the decreasing regularization in Algorithm 1. Lemma 4.3 makes use of Lemma 4.1 and Lemma 4.2 to bound the initial error of each regularized subproblem as given in Equation 4.
4.1 Useful lemmas
Lemma 4.1 is a modified version of a standard convergence analysis of the averaged subgradient method for convex Lipschitz functions (Theorem 3.2 of [B*+*15]). This result bounds the distance between the average of the subgradient descent iterates and the minimizer of the functional for a fixed regularization parameter .
Lemma 4.1**.**
Let
[TABLE]
and . Let the initial iterate be such that and let minimize . Suppose , so that is contained in a ball of radius and center . Let be the average of subgradient method iterates with initial iterate and step size . Then
[TABLE]
Note that Lemma 4.1 also guarantees that
[TABLE]
The next lemma bounds the distance between the minimizers and of the functions and and shows that distance from to the true hard-margin solution , , is proportional to the regularization parameter .
Lemma 4.2**.**
Let minimize as given in Equation 2 and let solve Equation 1. Let be such that for all and . For and data satisfying Assumption 2.1, we have
[TABLE]
and
[TABLE]
The final lemma bounds the initial error at each fixed level of regularization for the subgradient updates produced when minimizing . In particular, it specifies a bound shrinking in on the distance between the initial iterate and the minimizer of the function . The fact that the initial error for each regularized subproblem goes to zero is crucial for proving the convergence of Algorithm 1 to the hard margin SVM solution.
Lemma 4.3**.**
Let and . For with , , and , let and . Let
[TABLE]
with
[TABLE]
Let . Then for the averaged subgradient iterates of Algorithm 1,
[TABLE]
Based on Lemma 4.3, for and the radii shrink to 0 as increases.
4.2 Proof of Theorem 3.1.
We now prove Theorem 3.1 using the above lemmas.
Proof.
We use the triangle inequality to bound the error as
[TABLE]
We then bound the terms and using the lemmas of Subsection 4.1.
Let and choose with . Let and . Let
[TABLE]
with
[TABLE]
Let . By Lemma 4.3, considering the first term in the bound of Equation 13,
[TABLE]
Changing the base,
[TABLE]
We now bound the second term of the bound in Equation 13. Let be such that for all . By Lemma 4.2,
[TABLE]
The total number of updates, , used to calculate is bounded by
[TABLE]
Rearranging,
[TABLE]
Writing the bounds in terms of the total number of updates, ,
[TABLE]
and
[TABLE]
Combining these,
[TABLE]
∎
In order to optimize the convergence rate given in Theorem 3.1, we aim to choose parameters and such that
[TABLE]
For small, and lead to a nearly optimal converge rate of
[TABLE]
The choices and are considered in Corollary 3.2 and an explicit convergence rate is given under these conditions.
5 Experimental Results
We demonstrate the convergence of Algorithm 1 through several experiments on a simple synthetic dataset that is shown in Figure 1. The experiments aim to explore the differences between convergence in theory versus practice and are not intended to be exhaustive or demonstrate superior performance over existing methods. The data includes four support vectors which occur at and . The hard-margin SVM solution is given by . The maximal regularization parameter such that for all is . We fix the parameters and as are considered in Corollary 3.2 and initialize .
We measure convergence in terms of the error as well as the angle and margin gaps of Equation 10. Convergence results for Algorithm 1 with , and varying are shown in Figure 2. In terms of the error, for a fixed number of iterations, there appears to be an optimal choice for the parameter , as choosing performs better than or .
We additionally compare the convergence of Algorithm 1 in terms of the angle gap and margin gap to gradient descent using fixed step sizes applied to the logistic loss. We use step sizes , where is the largest singular value of the data matrix . As can be seen in Figure 3, we find significantly faster convergence via Algorithm 1 as compared to minimization of the logistic loss via gradient descent with fixed step sizes as considered in [SHN*+*18, NLG*+*19]. This result is unsurprising, as Algorithm 1 arrives at the SVM solution via controlled explicit regularization as opposed to only implicit regularization via gradient descent.
We additionally consider the performance of Algorithm 1 applied to the data of Figure 1 with the y-values of the data multiplied by 20. This leads to a slightly more challenging problem with less symmetric data. The results are shown in Figure 4. We find that the convergence of Algorithm 1 is slightly slower in terms of error. The logistic loss converges significantly slower in terms of both the angle and margin gaps, whereas the effect on the convergence of Algorithm 1 appears to be minimal.
6 Implementation remarks
As presented, Algorithm 1 is highly adaptable for different loss functions and settings in which one would like to consider a range of regularization parameters or variable regularization. In this section, we present several potential modifications of interest, including adaptive or gradient based step sizes, amenability to using stochastic subgradients, and alternative updates.
6.1 Adaptive step sizes
When the regularization parameter, , or the norm of are small and close to optimal, if an iterate violates one of the hinge loss constraints, this can increase the magnitude of the gradient of the loss significantly, leading to a relatively large jump in the next iterate followed by many smaller steps back toward the optimal solution of smaller norm. Using gradient descent with adaptive or loss-dependent step sizes can minimize the effects of these cycles. For example, we could adjust Algorithm 1 to use step sizes that are normalized by the magnitude of the subgradient,
[TABLE]
With this choice, the magnitude of the update is always and is independent of the magnitude of the gradient of . Cursory experimental results suggest that using adaptive step sizes as in Equation 14, leads to slower convergence to the true solution initially and does not lead to improved convergence overall.
One could also potentially increase the convergence rate guarantees for Algorithm 1 by incorporating aggressive loss-dependent step sizes. In [NLG*+*19], the authors show that when using Equation 14 with step sizes , gradient descent applied to the logistic loss converges at the nearly optimal rate of . While this strategy provides a faster convergence rate, loss-dependent step sizes are less commonly used in practice as, in the stochastic setting, updating the loss at each iteration is often too expensive. The stochastic setting is discussed further in Subsection 6.3.
6.2 Regularization decay rate
In Algorithm 1, we consider regularization parameters that decay at a rate of for a constant . One might consider other choices for the decay rate of the regularization parameter . For example or for . Recall that in bounding the error we use the decomposition
[TABLE]
The first term converges more quickly when is large while the second term converges more quickly when is small. The decay rate of was chosen to balance the convergence of these terms.
6.3 Stochastic subgradients
Algorithm 1 can be naturally extended to the stochastic subgradient setting, in which one performs updates based on the subgradient of the loss with respect to only a subset of the data points. This is often necessary for large-scale optimization problems. Additionally, although piecewise-constant decaying step sizes are incorporated into Algorithm 1 to account for the introduced regularization, it is also often used in stochastic gradient descent in order to mitigate the effect of noise in the gradient approximation of each update [BCN18]. This commonality suggests that Algorithm 1 may be particularly suited for the stochastic setting.
6.4 Alternative updates
Lemma 4.1 is the only result that depends on the update given by the fixed- subproblem and, in particular, Theorem 3.1 applies to any update that satisfies for each . Thus, as opposed to using the average of the iterates from each fixed subproblem, one could use alternative updates, such as
[TABLE]
or the iterate that leads to the minimal loss for that subproblem. We refer to this update choice as the best-iterate update and investigate the effects of this choice in Figure 5.
We find that the best-iterate update typically leads to significantly faster convergence in terms of the error. Specifically, choosing the best iterate can alleviate the slow convergence caused by the slow decrease in step size. The convergence of the two strategies, using the averaged iterate and the best iterate, perform comparably in terms of the angle gap. Using the best iterate converges somewhat slower in terms of the margin gap.
6.5 Incorporating a bias term
As in [RZH04, SHN*+*18], we consider the case in which the maximal-margin separating hyperplane intersects the origin. One can allow for more general hyperplanes by learning a bias term for the separating hyperplane. We propose the following method for approximating the bias term
[TABLE]
which is guaranteed to be close to the true max-margin bias when is small. Specifically, one can verify that for the bias as calculated in Equation 15 and the true bias, we have
[TABLE]
Initial experiments with a non-trivial bias demonstrate convergence similar to the zero-bias case.
7 Conclusion
We have shown that, for linearly separable data, the subgradient method converges to the max-margin SVM solution when minimizing the unconstrained regularized SVM, Equation 2, with decreasing regularization parameters, . Under the conditions given in Theorem 3.1, this convergence can be guaranteed to be for any . We compare convergence rates in several metrics to those provided in [SHN*+*18, NLG*+*19]. In particular, the convergence rate guarantees for Algorithm 1 are faster than those of [SHN*+*18, NLG*+*19] for gradient descent with fixed step sizes. This restriction to fixed or piecewise constant step sizes is a practical choice, especial when working with large-scale optimization problems. We additionally demonstrate the convergence of Algorithm 1 on a simple synthetic dataset.
Although we specifically consider the hinge loss and SVMs, the results and analysis presented here could be extended to more general settings. For example, one could more generally consider settings in which one aims to solve
[TABLE]
where is strongly convex and Lipschitz over bounded domains, is convex and Lipschitz, and the regularization path,
[TABLE]
is Lipschitz in .
Acknowledgments
D. Molitor and D. Needell are grateful to and were partially supported by NSF CAREER DMS #1348721 and NSF BIGDATA DMS #1740325. R. Ward was supported in part by AFOSR MURI Award N00014-17-S-F006.
Appendix A Lemma Proofs
We now present proofs for the lemmas of Sections 2, 3 and 4.
We first prove Lemma 2.2, which gives a bound on the norm of the iterates produced by the subgradient method applied to Equation 2.
Proof of Lemma 2.2
Proof.
Consider the subgradient update for minimizing the function of Equation 2
[TABLE]
with . Suppose that the iterate satisfies . We aim to show that given by the subgradient update also satisfies Taking the norm on both sides of Equation 16,
[TABLE]
Thus the norms of all iterates of the subgradient method applied to the function remain bounded by if the initial iterate has norm at most . The norm of the minimizer of must also satisfy the bound as and so
[TABLE]
∎
**Proof of Lemma 3.4
Lemma 3.4** uses Theorem 3.1 to derive bounds for the angle and margin gaps.
Proof.
To derive a convergence rate for the angle gap, we use the decomposition
[TABLE]
Dividing by ,
[TABLE]
Since is necessarily bounded away from 0 since for all . We can bound away from 0 for large using the convergence of to guaranteed by Theorem 3.1. Let
[TABLE]
be the exponent in the convergence rate of and and be defined as in Theorem 3.1. Since
[TABLE]
for constants by Theorem 3.1, then Thus for sufficiently large, we can bound away from 0 and have
[TABLE]
We now consider the margin bound. Let . Since for all , we have that
[TABLE]
Note that
[TABLE]
Assuming the data is finite and linearly separable, by Equation 17 we then have
[TABLE]
∎
**Proof of Lemma 4.1
Lemma 4.1** provides a modified convergence guarantee for the averaged subgradient method applied to the functions [Bub14].
Proof.
Let be a strongly convex function with strong convexity parameter and Lipschitz constant on the bounded domain considered. Let be an initial iterate and be the minimizer of . Suppose , so that is contained in a ball of radius and center . Let be the average of subgradient descent iterates with initial iterate and step size . We aim to show that
[TABLE]
The following proof relies heavily on Theorem 3.2 of [Bub14] (See also [B*+*15]).
Since is the minimizer of , the inequality
[TABLE]
is immediate. Let . Since is convex,
[TABLE]
and thus
[TABLE]
Reorganizing and subtracting ,
[TABLE]
Using the strong convexity of and the proof of Theorem 3.2 of [Bub14],
[TABLE]
Making this substitution into Equation 18,
[TABLE]
Decomposing the sum,
[TABLE]
Making this substitution,
[TABLE]
Since is convex, and
[TABLE]
as desired. ∎
**Proof of Lemma 4.2
**We now prove Lemma 4.2, which bounds the distance between minimizers of for different regularization parameters .
Proof.
Let minimize as given in Equation 2. Let be such that for all . For and data satisfying Assumption 2.1, we aim to show that
[TABLE]
and
[TABLE]
The proof of Lemma 4.2 makes use of Lemma 8 of [LS18], which is also stated below.
Lemma A.1**.**
(Perturbation of strongly convex functions I [LS18]). Let be a non-negative, -strongly convex function. Let be a L-Lipschitz non-negative convex function. For any , let be the minimizer of , then we have,
[TABLE]
Let and . Then is strongly convex with strong convexity parameter and is Lipschitz with a Lipschitz constant bounded by . Note that
[TABLE]
for . Applying Lemma 8 of [LS18],
[TABLE]
Integrating, for any , we have
[TABLE]
As the regularization parameter approaches zero, we will use the following bound. Since for all , , then for , \big{\|}\frac{d\bm{w}[\lambda]}{d\lambda}\big{\|}=0. Thus
[TABLE]
This gives the second bound,
[TABLE]
∎
Proof of Lemma 4.3 We finally prove Lemma 4.3, which makes use of Lemma 4.1 and Lemma 4.2 to bound the initial error of each regularized subproblem given in Equation 4.
Proof.
We aim to show with defined below and proceed by induction. For with , , and , let , . Recall that . For some parameter , let
[TABLE]
By Lemma 2.2, and since , we have . Note that and thus the base case, is satisfied.
Suppose that . By the triangle inequality,
[TABLE]
For generated as in Algorithm 1, Lemma 4.1 along with the inductive assumption gives that
[TABLE]
From Equation 11 of Lemma 4.2,
[TABLE]
Combining these
[TABLE]
Applying a change of base via ,
[TABLE]
To simplify the analysis and remove the dependence of on the iteration number , we use Now, for
[TABLE]
and , we have
[TABLE]
Note that allowing the first term in the upper bound on to increase with leads to smaller bounds . This choice, however, complicates the analysis.
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[B + 15] S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning , 8(3-4):231–357, 2015.
- 2[BCN 18] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review , 60(2):223–311, 2018.
- 3[BGMS 18] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. SGD learns over-parameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings , 2018.
- 4[BST 99] P. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. Advances in Kernel methods—support vector learning , pages 43–54, 1999.
- 5[Bub 14] S. Bubeck. Convex optimization: Algorithms and complexity. ar Xiv e-prints , page ar Xiv:1405.4980, May 2014.
- 6[CCS + 17] P. Chaudhari, A. Choromanska, S. Soatto, Y. Le Cun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations , 2017.
- 7[Cha 07] O. Chapelle. Training a support vector machine in the primal. Neural Computation , 19:1155–1178, 2007.
- 8[CPS + 18] R. T. d. Combes, M. Pezeshki, S. Shabanian, A. C. Courville, and Y. Bengio. On the learning dynamics of deep neural networks. Co RR , abs/1809.06848, 2018.
