Bias of Homotopic Gradient Descent for the Hinge Loss

Denali Molitor; Deanna Needell; Rachel Ward

arXiv:1907.11746·stat.ML·July 30, 2019

Bias of Homotopic Gradient Descent for the Hinge Loss

Denali Molitor, Deanna Needell, Rachel Ward

PDF

TL;DR

This paper investigates the convergence behavior of a homotopic gradient descent method applied to the hinge loss in linear classifiers, providing explicit rates towards the max-margin solution for separable data.

Contribution

It introduces a homotopic gradient descent approach for the hinge loss and establishes explicit convergence rates to the max-margin solution in linearly separable data.

Findings

01

Convergence to max-margin solution is achieved with explicit rates.

02

Homotopic gradient descent effectively handles non-smooth hinge loss.

03

Theoretical analysis extends understanding of gradient methods for non-smooth losses.

Abstract

Gradient descent is a simple and widely used optimization method for machine learning. For homogeneous linear classifiers applied to separable data, gradient descent has been shown to converge to the maximal margin (or equivalently, the minimal norm) solution for various smooth loss functions. The previous theory does not, however, apply to non-smooth functions such as the hinge loss which is widely used in practice. Here, we study the convergence of a homotopic variant of gradient descent applied to the hinge loss and provide explicit convergence rates to the max-margin solution for linearly separable data.

Tables1

Table 1. Table 1 : Comparison of convergence rates for Algorithm 1 with those of [ SHN + 18 ] for gradient descent with fixed step sizes applied to the logistic loss.

	Algorithm 1	[SHN⁺18]
Angle gap	$O (k^{- 1 / 3 + 2 δ})$	$O ({(\frac{\log \log (k)}{\log (k)})}^{2})$
Margin gap	$O (k^{- 1 / 6 + δ})$	$O (\frac{1}{\log (k)})$

Equations199

w^{*} = w argmin ∥ w ∥ subject to y_{j} x_{j}^{⊤} w \geq 1 \forall j .

w^{*} = w argmin ∥ w ∥ subject to y_{j} x_{j}^{⊤} w \geq 1 \forall j .

w^{*} = w argmin ∥ w ∥ subject to L (w) = 0,

w^{*} = w argmin ∥ w ∥ subject to L (w) = 0,

L (w) := \frac{1}{n} j = 1 \sum n h (y_{j} x_{j}^{⊤} w) and h (u) := max (0, 1 - u) .

L (w) := \frac{1}{n} j = 1 \sum n h (y_{j} x_{j}^{⊤} w) and h (u) := max (0, 1 - u) .

F_{λ} (w) := \frac{λ}{2} ∥ w ∥^{2} + \frac{1}{n} j = 1 \sum n h (y_{j} x_{j}^{⊤} w) .

F_{λ} (w) := \frac{λ}{2} ∥ w ∥^{2} + \frac{1}{n} j = 1 \sum n h (y_{j} x_{j}^{⊤} w) .

\partial F_{λ} (w) = λ w - \frac{1}{n} j : y_{j} x_{j}^{⊤} w < 1 \sum y_{j} x_{j} .

\partial F_{λ} (w) = λ w - \frac{1}{n} j : y_{j} x_{j}^{⊤} w < 1 \sum y_{j} x_{j} .

λ w - \frac{1}{n} j : y_{j} x_{j}^{⊤} w < 1 \sum y_{j} x_{j} \in \partial F_{λ} (w) .

λ w - \frac{1}{n} j : y_{j} x_{j}^{⊤} w < 1 \sum y_{j} x_{j} \in \partial F_{λ} (w) .

w_{λ}^{*} := w argmin F_{λ} (w) .

w_{λ}^{*} := w argmin F_{λ} (w) .

w_{k + 1} = w_{k} - η_{k} \partial F_{λ} (w_{k}),

w_{k + 1} = w_{k} - η_{k} \partial F_{λ} (w_{k}),

B_{λ} := \frac{\sum _{j = 1}^{n} ∥ x _{j} ∥}{λn} .

B_{λ} := \frac{\sum _{j = 1}^{n} ∥ x _{j} ∥}{λn} .

L := \frac{2}{n} j = 1 \sum n ∥ x_{j} ∥

L := \frac{2}{n} j = 1 \sum n ∥ x_{j} ∥

C = max {4, \frac{1}{2} s_{0}^{p} (s_{0} - 1)^{α}} and α = min (\frac{r - 2 p}{2 ( 1 + ϵ _{0} )}, 1 - p),

C = max {4, \frac{1}{2} s_{0}^{p} (s_{0} - 1)^{α}} and α = min (\frac{r - 2 p}{2 ( 1 + ϵ _{0} )}, 1 - p),

∥ z_{k} - w^{*} ∥ \leq C L ((r + 1) k)^{\frac{- α ( 1 - ϵ _{0} )}{r + 1}} + \frac{L}{2 ( λ ^{'} ) ^{2}} ((r + 1) k)^{\frac{- p}{r + 1}} .

∥ z_{k} - w^{*} ∥ \leq C L ((r + 1) k)^{\frac{- α ( 1 - ϵ _{0} )}{r + 1}} + \frac{L}{2 ( λ ^{'} ) ^{2}} ((r + 1) k)^{\frac{- p}{r + 1}} .

∥ z_{k} - w^{*} ∥

∥ z_{k} - w^{*} ∥

C = max {4, \frac{1}{2} s_{0}^{p} (s_{0} - 1)^{α}} and α = min (\frac{r - 2 p}{2 ( 1 + ϵ _{0} )}, 1 - p),

C = max {4, \frac{1}{2} s_{0}^{p} (s_{0} - 1)^{α}} and α = min (\frac{r - 2 p}{2 ( 1 + ϵ _{0} )}, 1 - p),

∥ z_{k} - w^{*} ∥

∥ z_{k} - w^{*} ∥

∥ z_{k} - w^{*} ∥ \leq 4.17 L k^{\frac{- 0.913}{6}} + \frac{0.42 L k ^{- 1/6}}{( λ ^{'} ) ^{2}} .

∥ z_{k} - w^{*} ∥ \leq 4.17 L k^{\frac{- 0.913}{6}} + \frac{0.42 L k ^{- 1/6}}{( λ ^{'} ) ^{2}} .

angle gap := 1 - \frac{w ^{⊤} w ^{*}}{∥ w ∥∥ w ^{*} ∥}

angle gap := 1 - \frac{w ^{⊤} w ^{*}}{∥ w ∥∥ w ^{*} ∥}

margin gap := \frac{1}{∥ w ^{*} ∥} - i min \frac{y _{i} x _{i}^{⊤} w}{∥ w ∥} .

margin gap := \frac{1}{∥ w ^{*} ∥} - i min \frac{y _{i} x _{i}^{⊤} w}{∥ w ∥} .

c = min (\frac{( r - 2 p ) ( 1 - ϵ _{0} )}{2 ( r + 1 ) ( 1 + ϵ _{0} )}, \frac{( 1 - p ) ( 1 + ϵ _{0} )}{r + 1}, \frac{p}{r + 1}),

c = min (\frac{( r - 2 p ) ( 1 - ϵ _{0} )}{2 ( r + 1 ) ( 1 + ϵ _{0} )}, \frac{( 1 - p ) ( 1 + ϵ _{0} )}{r + 1}, \frac{p}{r + 1}),

1 - \frac{w _{k}^{⊤} w ^{*}}{∥ w _{k} ∥∥ w ^{*} ∥} = O (k^{- 1/3 + 2 δ}) .

1 - \frac{w _{k}^{⊤} w ^{*}}{∥ w _{k} ∥∥ w ^{*} ∥} = O (k^{- 1/3 + 2 δ}) .

\frac{1}{∥ w ^{*} ∥} - i min \frac{y _{i} x _{i}^{⊤} w _{k}}{∥ w _{k} ∥} = O (k^{- 1/6 + δ}) .

\frac{1}{∥ w ^{*} ∥} - i min \frac{y _{i} x _{i}^{⊤} w _{k}}{∥ w _{k} ∥} = O (k^{- 1/6 + δ}) .

F_{λ} (w) = \frac{λ}{2} ∥ w ∥^{2} + \frac{1}{n} j = 1 \sum n max (0, 1 - y_{j} x_{j}^{⊤} w)

F_{λ} (w) = \frac{λ}{2} ∥ w ∥^{2} + \frac{1}{n} j = 1 \sum n max (0, 1 - y_{j} x_{j}^{⊤} w)

0 \leq F_{λ} (\overline{w}) - F_{λ} (w_{λ}^{*}) \leq \frac{R L}{t} - \frac{λ}{2} ∥ \overline{w} - w_{λ}^{*} ∥^{2} .

0 \leq F_{λ} (\overline{w}) - F_{λ} (w_{λ}^{*}) \leq \frac{R L}{t} - \frac{λ}{2} ∥ \overline{w} - w_{λ}^{*} ∥^{2} .

∥ \overline{w} - w_{λ}^{*} ∥^{2} \leq \frac{2 R L}{λ t} .

∥ \overline{w} - w_{λ}^{*} ∥^{2} \leq \frac{2 R L}{λ t} .

\|\bm{w}_{\lambda}^{*}-\bm{w}_{\widetilde{\lambda}}^{*}\|\leq\frac{L}{2}\bigg{|}\frac{1}{\lambda}-\frac{1}{\tilde{\lambda}}\bigg{|}

\|\bm{w}_{\lambda}^{*}-\bm{w}_{\widetilde{\lambda}}^{*}\|\leq\frac{L}{2}\bigg{|}\frac{1}{\lambda}-\frac{1}{\tilde{\lambda}}\bigg{|}

∥ w_{λ}^{*} - w^{*} ∥ \leq \frac{L λ}{2 ( λ ^{'} ) ^{2}} .

∥ w_{λ}^{*} - w^{*} ∥ \leq \frac{L λ}{2 ( λ ^{'} ) ^{2}} .

R_{s} = C L (s_{0} + s - 1)^{- α} \mbox f or 0 \leq α \leq min (\frac{r - 2 p}{2 ( 1 + ϵ _{0} )}, 1 - p),

R_{s} = C L (s_{0} + s - 1)^{- α} \mbox f or 0 \leq α \leq min (\frac{r - 2 p}{2 ( 1 + ϵ _{0} )}, 1 - p),

C = max {4, \frac{1}{2 λ _{0}} (s_{0} - 1)^{α}} and ϵ_{0} = \frac{lo g ( s _{0} ) - lo g ( s _{0} - 1 )}{lo g ( s _{0} )} .

C = max {4, \frac{1}{2 λ _{0}} (s_{0} - 1)^{α}} and ϵ_{0} = \frac{lo g ( s _{0} ) - lo g ( s _{0} - 1 )}{lo g ( s _{0} )} .

∥ \overline{w}_{s} - w_{λ_{s}}^{*} ∥ \leq R_{s} .

∥ \overline{w}_{s} - w_{λ_{s}}^{*} ∥ \leq R_{s} .

∥ \overline{w}_{s} - w^{*} ∥ \leq ∥ \overline{w}_{s} - w_{λ_{s}}^{*} ∥ + ∥ w_{λ_{s}}^{*} - w^{*} ∥.

∥ \overline{w}_{s} - w^{*} ∥ \leq ∥ \overline{w}_{s} - w_{λ_{s}}^{*} ∥ + ∥ w_{λ_{s}}^{*} - w^{*} ∥.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Bias of Homotopic Gradient Descent for the Hinge Loss

Denali Molitor

Department of Mathematics, University of California at Los Angeles, Los Angeles, California, USA

Deanna Needell

Department of Mathematics, University of California at Los Angeles, Los Angeles, California, USA

Rachel Ward

Department of Mathematics, University of Texas at Austin, Austin, Texas, USA

Abstract

Gradient descent is a simple and widely used optimization method for machine learning. For homogeneous linear classifiers applied to separable data, gradient descent has been shown to converge to the maximal margin (or equivalently, the minimal norm) solution for various smooth loss functions. The previous theory does not, however, apply to non-smooth functions such as the hinge loss which is widely used in practice. Here, we study the convergence of a homotopic variant of gradient descent applied to the hinge loss and provide explicit convergence rates to the max-margin solution for linearly separable data.

1 Introduction

Several recent works suggest that the optimization methods used in training models affect the model’s ability to generalize through implicit biases to certain solutions [ZBH*+*17, NTS14, HRS16, HHS17, PKL*+*17, PLM*+*18, HHS17, CCS*+*17, CPS*+*18]. In order to understand the effects of optimization methods in more complex and often non-convex settings such as for neural networks, it is natural to first understand their behavior in simpler settings, such as for least squares regression, logistic regression, and support vector machines (SVM) [SHN*+*18, NLG*+*19, GLSS18]. In particular, gradient descent and its many variants, including the subgradient method, are popular choices for optimizing machine learning models and thus warrant careful study.

It was recently shown that gradient descent applied to the (unregularized) logistic regression problem for linearly separable data converges to the solution with maximal margin, while other choices of optimization method converge to different solutions [SHN*+*18]. Convergence to the maximal-margin solution is desirable, as the margin is an important quantity for deriving generalization guarantees [BST99, Vap82, Vap99, VC74, Vap13]. The analysis of Soudry et al [SHN*+*18] extends to additional loss functions, but requires particular properties, including smoothness and monotonicity. These assumptions do not hold, however, for non-differentiable functions such as the hinge loss objective, which is the loss function used in training SVM [CV95].

Here, we analyze the convergence to the maximal margin solution of a homotopic subgradient method applied to the non-smooth hinge loss. In particular we consider a method in which a number of subgradient updates are applied to the hinge loss with decreasing regularization. Although it is well known that the exact solutions of the regularized hinge loss converge to the hard-margin SVM solution as the regularization decreases to zero in the linearly separable case [RZH04, HRTZ04], we are unaware of results that provide explicit convergence rates for an iterative optimization algorithm, such as the subgradient method, that converges to the hard-margin SVM solution in a single pass of the regularization parameter $\lambda$ . We provide such an analysis here, and demonstrate that the iterates of an averaged subgradient method applied to the regularized SVM loss with shrinking regularization parameters converge to the max-margin solution at a rate of $O\left(k^{-1/6+\delta}\right)$ for linearly separable data, where $\delta$ is any small positive constant.

For linearly separable data there exists $\lambda^{\prime}>0$ be such that the solution $\bm{w}_{\lambda}^{*}$ to the hinge loss with regularization parameter $\lambda$ is equal to the true, hard-margin solution $\bm{w}^{*}$ for all $\lambda\leq\lambda^{\prime}$ [RZH04, HRTZ04]. While $\lambda^{\prime}{}$ is constant for a fixed problem, knowing its value in advance is typically unrealistic. Additionally, if the data is not well separated, $\lambda^{\prime}{}$ can be very small. The homotopic subgradient method analyzed here depends on the value of $\lambda^{\prime}{}$ and converges at a rate of $O\left((\lambda^{\prime}{})^{-2}k^{-1/6+\delta}\right)$ . If one were to know the appropriate regularization parameter $\lambda^{\prime}{}$ in advance, the averaged subgradient method with appropriate fixed step sizes would converge in $L_{2}$ error at a rate of $O\left((\lambda^{\prime})^{-1}k^{-1/4}\right)$ . This rate can be improved to $O\left((\lambda^{\prime})^{-1}k^{-1/2}\right)$ by using weighted step sizes that depend on $\lambda^{\prime}{}$ [B*+*15, LJSB12]. Thus, we pay a small price for the shrinking regularization routine and for not knowing the value of $\lambda^{\prime}{}$ in advance. We additionally provide faster convergence guarantees and improved convergence results for the proposed method on small datasets as compared to gradient descent applied to the logistic loss with fixed step sizes [SHN*+*18].

1.1 Contributions

While several works analyze the convergence of various optimization methods to the maximal-margin solution for separable data [SHN*+*18, NLG*+*19], we are unaware of any works that provide explicit convergence rates for the fundamental subgradient descent method. Convergence of the subgradient method and stochastic subgradient method have been analyzed for non-smooth convex functions, however these works only provide convergence guarantees in the loss-function values and not the iterates, as, for general convex functions, the minimizer may not be finite and may not be unique [SZ13, Zha04]. In the context of solving the hard-margin SVM, the restriction to linearly separable data guarantees the existence of a minimizer and considering the maximal margin solution ensures uniqueness. Moreover, in the context of general convex functions, previous works often use the projected subgradient method and require knowledge of a bounded domain in which a minimizer exists [SZ13, B*+*15]. For solving the hard-margin SVM via gradient descent, we show that such a projection is unnecessary.

Here, we provide explicit convergence guarantees for a homotopic subgradient method for optimizing the non-smooth SVM hinge loss. The proposed method uses decreasing regularization parameters and leads to the hard-margin SVM solution. We study the effects of optimization via this method on the generalization ability of the learned solutions through proved convergence rates to the hard-margin SVM solution in terms of $L_{2}$ error as well as difference in angle and margin from the true solution. We additionally show that these convergence rates to the hard-margin SVM solution outpace recent results such as gradient descent with fixed step sizes applied to the logistic loss [SHN*+*18, NLG*+*19]. We demonstrate the convergence of the proposed method on a synthetic dataset.

1.2 Organization

In Section 2, we introduce the specific problem setting, the notation that will be used throughout, and the proposed optimization scheme, Algorithm 1. Section 3 provides the main convergence results for Algorithm 1. An outline for the proof of the main convergence theorem, Theorem 3.1, is provided in Section 4, with additional details in Appendix A. We test convergence properties of Algorithm 1 for a simple synthetic dataset in Section 5. Section 6 provides additional implementation details for Algorithm 1 as well as possible modifications and extensions.

2 Problem Setup

We consider the binary classification problem with data $\{(\bm{x}_{j},y_{j}):j=1,\ldots,n\}$ , where $\bm{x}_{j}\in\mathbb{R}^{d}$ are the data points and $y_{j}\in\{-1,1\}$ their labels. We aim to classify the data via a homogeneous linear SVM. Specifically, we wish to identify a weight vector $\bm{w}^{*}$ that satisfies

[TABLE]

Throughout, we write $\|\cdot\|=\|\cdot\|_{2}$ . We can equivalently find

[TABLE]

where

[TABLE]

The function $h(u)$ is commonly referred to as the hinge loss. We assume throughout that the data is linearly separable, i.e. there exists a vector $\bm{w}$ satisfying $\mathcal{L}(\bm{w})=0$ as is done in [SHN*+*18, NLG*+*19, WGC19, BGMS18, NSS19, RZH04]. This assumption is common and necessary in order to discuss the margin of the approximated solutions. Minimizing the norm of the solution $\bm{w}$ to $\mathcal{L}(\bm{w})=0$ corresponds to maximizing the margin, that is maximizing the minimal distance between any data point and the separating hyperplane determined by $\bm{w}$ . In this setting, the solution to Equation 1, $\bm{w}^{*}$ , is often referred to as the hard-margin SVM solution.

The constrained optimization problem in Equation 1 is the primal formulation of an SVM. While solving or approximating the corresponding dual SVM formulation is popular in practice, there are advantages to approximating the primal problem directly [Cha07]. Of particular interest for this work, considering the primal formulation allows for straightforward analysis of the effect of the optimization error on the margin and hyperplane angle.

As an alternative to solving Equation 1 directly, one often looks for a solution to an unconstrained, regularized version. Define the functional:

[TABLE]

For $\lambda>0$ , $F_{\lambda}$ is strongly convex with strong convexity parameter $\lambda$ . We will use $\partial F$ to denote the subgradient of $F$ . The gradient of $F_{\lambda}(\bm{w})$ exists as long as $y_{j}\bm{x}_{j}^{\top}\bm{w}\neq 1$ for all $j$ and is given by

[TABLE]

When $y_{j}\bm{x}_{j}^{\top}\bm{w}=1$ for some $j$ , the subgradient set $\partial F_{\lambda}(\bm{w})$ contains the point

[TABLE]

When the gradient does not exist, we will abuse notation and use Equation 3 in the subgradient method update of Equation 5.

Let

[TABLE]

We will refer to $\bm{w}_{\lambda}^{*}$ as the solution to the regularized subproblem of minimizing Equation 2. A larger regularization parameter $\lambda$ encourages a solution $\bm{w}_{\lambda}^{*}$ with smaller norm at the cost of having some points lie within the margin. For linearly separable data and as $\lambda$ approaches 0, the regularized solutions $\bm{w}_{\lambda}^{*}$ converge to the unregularized solution, $\bm{w}^{*}$ . Let $\lambda^{\prime}>0$ be such that $\bm{w}_{\lambda}^{*}=\bm{w}^{*}$ for all $\lambda\leq\lambda^{\prime}$ . Such a $\lambda^{\prime}$ is guaranteed to exist for linearly separable data [RZH04, HRTZ04]. This fact suggests solving Equation 2 by using the subgradient method for a sufficiently small value of $\lambda$ . Of course, the value of $\lambda^{\prime}$ will typically be unknown.

We use the following assumption and definition of $\lambda^{\prime}{}$ throughout.

Assumption 2.1.

The data $\bm{x}_{1},\ldots,\bm{x}_{n}\in\mathbb{R}^{d}$ with labels $y_{1},\ldots,y_{n}\in\{-1,1\}$ are linearly separable, i.e. there exists $\bm{w}$ such that for all $i$ , $y_{i}\bm{w}^{\top}\bm{x}_{i}>0$ . Let $\bm{w}^{*}$ be the hard-margin SVM (i.e. $\bm{w}^{*}$ solves Equation 1) and $\lambda^{\prime}$ be such that for all $\lambda\leq\lambda^{\prime}$ , $\bm{w}_{\lambda}^{*}=\operatorname*{argmin}F_{\lambda}=\bm{w}^{*}$ .

While in practice, one may be satisfied with the solution $\bm{w}_{\lambda}^{*}$ for $\lambda$ sufficiently small, we are interested in the convergence to the true hard-margin SVM given by $\bm{w}^{*}$ . Thus, we instead propose to use a “homotopic” variant of the subgradient method that iteratively approximates the solution to Equation 2 while the regularization parameter $\lambda$ and accompanying step size $\eta$ of the subgradient method in Equation 5 decay at prescribed rates. Incorporating a piecewise constant decaying step size is commonly used for large-scale minimization problems, especially when using stochastic gradient descent variants [BCN18].

Recall the subgradient method given by the updates:

[TABLE]

where $\bm{w}_{k}$ is the approximate solution at iteration $k$ and $\eta_{k}$ is a step size. For some number of outer iterations $s=1,\dots,S$ , we choose a regularization parameter $\lambda_{s}>0$ , a step size $\eta_{s}>0$ and a number of inner iterations $t_{s}$ . The regularization parameter $\lambda_{s}$ and step size $\eta_{s}$ are selected such that they decrease to 0 as $s$ increases. Let $\overline{\bm{w}}_{s-1}$ be the current estimate of $\bm{w}^{*}$ . We then perform $t_{s}$ subgradient updates applied to the loss function $F_{\lambda_{s}}$ with initial iterate $\overline{\bm{w}}_{s-1}$ and step size $\eta_{s}$ . The next estimate, $\overline{\bm{w}}_{s}$ , is given by the average of the $t_{s}$ subgradient iterates. This process is detailed in Algorithm 1. For specific choices of $\lambda_{s},$ $\eta_{s}$ and $t_{s}$ , Algorithm 1 converges to the hard-margin SVM solution $\bm{w}^{*}$ . Convergence guarantees are detailed in Theorem 3.1.

While the strongly convex functions $F_{\lambda}$ are not globally Lipschitz, they are Lipschitz functions on bounded domains. Using a projected subgradient method in which iterates are projected onto a bounded domain is a natural strategy for restricting the domain of the iterates. A projection is unnecessary in this setting, however, as the regularization parameter $\lambda>0$ naturally promotes solutions of smaller norm. In fact, the iterates produced by the subgradient method in Algorithm 1 remain bounded in norm with a bound that depends on the current regularization parameter $\lambda$ .

Lemma 2.2.

Fix a regularization parameter $\lambda>0$ and step size $\eta>0$ such that $\eta\lambda<1$ . Define

[TABLE]

If the initial iterate $\bm{w}_{0}$ is such that $\lVert{\bm{w}_{0}}\rVert\leq B_{\lambda}$ , then each iterate $\bm{w}_{k}$ produced by the subgradient method of Equation 3 applied to the function $F_{\lambda}$ of Equation 2 has $\lVert{\bm{w}_{k}}\rVert\leq B_{\lambda}$ . Additionally, $\lVert{\bm{w}^{*}}\rVert\leq B_{\lambda}$ .

In summary, if the initial iterate $\bm{w}_{0}$ is such that $\lVert{\bm{w}_{0}}\rVert\leq B_{\lambda}$ , then the iterates produced by the subgradient method applied to $F_{\lambda}$ will also have norm less than or equal to $B_{\lambda}$ .

*Remark**.*

Using Lemma 2.2, one can show that the functionals $F_{\lambda}$ are Lipschitz over the domain of iterates produced by Algorithm 1. Specifically, the constant

[TABLE]

bounds the Lipschitz constants of each function $F_{\lambda}$ restricted to the ball centered at the origin with radius $B_{\lambda}$ . Lemma 2.2 guarantees that the iterates produced when applying the subgradient method to $F_{\lambda}$ and for sufficiently small initial iterate remain with this domain. Note that the bound on the Lipschitz constants $L$ is independent of the regularization parameter $\lambda$ .

3 Main Results

We now provide explicit rates of convergence to the hard-margin SVM solution for Algorithm 1. We provide convergence rates in terms of the $L_{2}$ error, difference in angle, and difference in margin between the approximation $\overline{\bm{w}}_{S}$ and the true hard-margin solution. The convergence results are stated in terms of $k$ , the total number of subgradient updates required. Recall that the approximations $\overline{\bm{w}}_{s}$ are only updated at increments of $t_{s}$ subgradient updates. Let $\bm{z}_{k}=\overline{\bm{w}}_{s}$ , so that $\bm{z}_{k}$ is the approximation after $k=\sum_{i=1}^{s}t_{i}$ subgradient calculations.

Theorem 3.1 provides a convergence guarantee for the $L_{2}$ error of the iterates produced by Algorithm 1. This result will be used to additionally derive convergence guarantees for the angle and margin of the solution in Lemma 3.4. The parameter $p$ determines the rate of decay of the regularization $\lambda_{s}$ and the parameter $r$ determines the number of steps $t_{s}$ used at each fixed level of regularization. The constant $L$ is as defined in Equation 7 and is an upper bound on the Lipschitz constants of the functions $F_{\lambda}$ restricted to the domain of the iterates produced by the subgradient method applied to $F_{\lambda}$ (Lemma 2.2).

Theorem 3.1.

Consider Algorithm 1 with parameters $r$ and $p$ such that $0<p<1$ and $r>2p$ . Choose an initial number of inner iterations $s_{0}^{r}\in\mathbb{N}$ with $s_{0}>2$ . Let $L=2\frac{\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert}{n}$ as defined in Equation 7. Define

[TABLE]

with $\epsilon_{0}=\frac{\log(s_{0})-\log(s_{0}-1)}{\log(s_{0})}.$ Let $\bm{z}_{k}$ be the average of the $t_{s}$ subgradient descent updates calculated to minimize the function $F_{\lambda_{s}}$ with step size $\eta_{s}=\frac{C(s_{0}+s-1)^{-\alpha}}{\sqrt{t_{s}}}$ , where $k$ is the total number of subgradient descent updates calculated. Then for data and $\lambda^{\prime}{}$ satisfying Assumption 2.1,

[TABLE]

Let $c=\min\left(\frac{\alpha(1-\epsilon_{0})}{r+1},\frac{p}{r+1}\right)$ . Then

[TABLE]

An outline for a proof of Theorem 3.1 can be found in Section 4 with additional details in Appendix A. Note that, for small $\epsilon_{0}$ , the two terms in the bound of Equation 8 will decrease at approximately the same rate if $r=2$ and $p=1/2$ . Corollary 3.2 gives a simpler, explicit rate of convergence by making this specification and setting $s_{0}=10$ .

Corollary 3.2.

Consider Algorithm 1 with parameters $r=2$ , $p=1/2$ and an initial number of inner iterations $s_{0}^{r}=s_{0}^{2}\in\mathbb{N}$ with $s_{0}>2$ . Let $L=2\frac{\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert}{n}$ . Define

[TABLE]

with $\epsilon_{0}=\frac{\log(s_{0})-\log(s_{0}-1)}{\log(s_{0})}.$ Let $\bm{z}_{k}$ be the average of the $t_{s}$ subgradient descent updates calculated for $F_{\lambda_{s}}$ with step size $\eta_{s}=\frac{C(s_{0}+s-1)^{-\alpha}}{\sqrt{t_{s}}}$ , where $k$ is the total number of subgradient descent updates calculated. Then for data and $\lambda^{\prime}{}$ satisfying Assumption 2.1,

[TABLE]

Choosing $s_{0}=10$ , we have $\epsilon_{0}<0.046$ , $C<4.9$ and arrive at the convergence rate

[TABLE]

At least theoretically, sending $s_{0}\to\infty$ leads to the best convergence rate guarantee. In fact, the convergence rate provided by Theorem 3.1 can be made arbitrarily close to $O\left(k^{-1/6}\right)$ by choosing $r=2$ , $p=1/2$ , and $s_{0}$ sufficiently large. As we will see in Section 5, using $s_{0}$ extremely large becomes impractical as the number of iterations for each fixed- $\lambda$ subproblem becomes extremely large.

For strongly-convex, Lipschitz functions with strong-convexity parameter $\lambda$ , one can achieve convergence in $\|\bm{w}-\bm{w}^{*}\|$ at a rate of $O\left(\lambda^{-1}k^{-1/4}\right)$ , using projected averaged gradient descent with fixed step sizes (Theorem 3.2 [Bub14]). Using weighted step sizes, and knowledge of the strong convexity parameter, this rate can be improved to $O\left(\lambda^{-1}k^{-1/2}\right)$ (Theorem 3.9 [Bub14], originally from [LJSB12]). A challenge of solving for the hard-margin SVM is that we do not optimize a strongly convex function. While one could fix a regularization parameter $\lambda$ leading to a strongly convex function, there is no guarantee that the minimizer of this function $F_{\lambda}$ will correspond to the true solution $\bm{w}^{*}$ . Since the convergence rate of Algorithm 1 can be made arbitrarily close to $O\left((\lambda^{\prime})^{-2}k^{-1/6}\right)$ we lose very little, only a factor of $O\left((\lambda^{\prime})^{-1}k^{1/12+\delta}\right)$ compared to the convergence rate of projected averaged gradient descent with fixed step sizes, for not knowing $\lambda^{\prime}$ in advance and instead incorporating decreasing explicit regularization.

Additionally, in designing Algorithm 1, we aimed for a simple algorithm as opposed to optimizing all possible parameters. One could possibly improve on the rates given here by further optimizing these parameters.

3.1 Convergence rates for angle and margin gaps

The convergence rate in Theorem 3.1 can be used to derive rates of convergence to the angle and margin of the optimal separating hyperplane $\bm{w}^{*}$ .

Definition 3.3.

For the hard margin SVM solution $\bm{w}^{*}$ and a vector $\bm{w}$ , define

[TABLE]

and

[TABLE]

While it is natural to consider the $L_{2}$ error of the derived solution, the angle between the true and derived solutions as well as the difference in the size of the margins give a more intuitive interpretation of the effect of that error. For example, an approximate solution $\bm{w}$ that is off by a constant factor, that is $\bm{w}=c\bm{w}^{*}$ , will have an angle gap of zero and non-zero margin gap if $c\neq 1$ . If an approximate solution $\bm{w}$ has a nonzero angle gap, but negligible margin gap, this suggests that the derived solution $\bm{w}$ still separates the data reasonably well.

Convergence rates of Algorithm 1 in terms of the angle and margin gaps are stated in Lemma 3.4 and compared to other recently obtained convergence rates in Table 1. The rates of convergence in these metrics can be derived from Theorem 3.1. These arguments are included in Appendix A.

Lemma 3.4.

Let

[TABLE]

where $p,r,s_{0}$ , and $\epsilon_{0}$ are as given in Theorem 3.1 so that $c$ is the exponent in the convergence rate of Theorem 3.1. Let $\delta$ be such that $c=1/6-\delta$ . The value of $\delta$ is positive and can be made arbitrarily close to 0 by choosing $s_{0}$ sufficiently large and setting $p=1/2$ and $r=2.$ Then for the angle gap,

[TABLE]

For the margin gap,

[TABLE]

The convergence guarantees for the angle and margin gaps for Algorithm 1 are significantly faster than those given in Soudry et al [SHN*+*18] for gradient descent with fixed step sizes applied to the logistic loss (see Table 1). Nacson et al [NLG*+*19] demonstrate that using aggressive adaptive step sizes for gradient descent applied to the logistic loss leads to a faster convergence rate of $O\left(\frac{\log(t)}{\sqrt{t}}\right)$ . While the convergence guarantees for Algorithm 1 are slower, as $c\leq 1/6$ , in this paper, we are interested in analyzing convergence guarantees for gradient descent applied to the non-smooth hinge loss.

4 Proof of Theorem 3.1

We prove Theorem 3.1 through a series of lemmas, which are stated in Subsection 4.1 and whose proofs are contained in Appendix A. The proof of Theorem 3.1 is contained in Subsection 4.2.

We briefly summarize each of the lemmas for convenience. Lemma 4.1 provides a modified convergence guarantee for the averaged subgradient method applied to the functions $F_{\lambda}$ . Lemma 4.2 bounds the distance between minimizers of $F_{\lambda}$ for different regularization parameters $\lambda$ . This result allows for the incorporation of the decreasing regularization in Algorithm 1. Lemma 4.3 makes use of Lemma 4.1 and Lemma 4.2 to bound the initial error $\lVert{\overline{\bm{w}}_{s}-\bm{w}_{\lambda}^{*}}\rVert$ of each regularized subproblem as given in Equation 4.

4.1 Useful lemmas

Lemma 4.1 is a modified version of a standard convergence analysis of the averaged subgradient method for convex Lipschitz functions (Theorem 3.2 of [B*+*15]). This result bounds the distance between the average of the subgradient descent iterates $\overline{\bm{w}}$ and the minimizer $\bm{w}_{\lambda}^{*}$ of the functional $F_{\lambda}$ for a fixed regularization parameter $\lambda$ .

Lemma 4.1.

Let

[TABLE]

and $L=2\frac{\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert}{n}$ . Let the initial iterate $\bm{w}_{0}$ be such that $\lVert{\bm{w}_{0}}\rVert\leq\frac{L}{2\lambda}$ and let $\bm{w}_{\lambda}^{*}$ minimize $F_{\lambda}$ . Suppose $\|\bm{w}_{0}-\bm{w}_{\lambda}^{*}\|\leq R$ , so that $\bm{w}_{\lambda}^{*}$ is contained in a ball of radius $R$ and center $\bm{w}_{0}$ . Let $\overline{\bm{w}}=\frac{1}{t}\sum_{s=1}^{t}\bm{w}_{s}$ be the average of $t$ subgradient method iterates with initial iterate $\bm{w}_{0}$ and step size $\eta=\frac{R}{L\sqrt{t}}$ . Then

[TABLE]

Note that Lemma 4.1 also guarantees that

[TABLE]

The next lemma bounds the distance between the minimizers $\bm{w}_{\lambda}^{*}$ and $\bm{w}_{\widetilde{\lambda}}^{*}$ of the functions $F_{\lambda}$ and $F_{\widetilde{\lambda}}$ and shows that distance from $\bm{w}_{\lambda}^{*}$ to the true hard-margin solution $\bm{w}^{*}$ , $\|\bm{w}_{\lambda}^{*}-\bm{w}^{*}\|$ , is proportional to the regularization parameter $\lambda$ .

Lemma 4.2.

Let $\bm{w}_{\lambda}^{*}$ minimize $F_{\lambda}$ as given in Equation 2 and let $\bm{w}^{*}$ solve Equation 1. Let $\lambda^{\prime}>0$ be such that $\bm{w}_{\lambda}^{*}=\bm{w}^{*}$ for all $\lambda\leq\lambda^{\prime}$ and $L=2\frac{\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert}{n}$ . For $\lambda,\widetilde{\lambda}\geq 0$ and data satisfying Assumption 2.1, we have

[TABLE]

and

[TABLE]

The final lemma bounds the initial error at each fixed level of regularization for the subgradient updates produced when minimizing $F_{\lambda_{s}}$ . In particular, it specifies a bound shrinking in $s$ on the distance between the initial iterate $\overline{\bm{w}}_{s}$ and the minimizer $\bm{w}_{\lambda_{s}}^{*}$ of the function $F_{\lambda_{s}}$ . The fact that the initial error for each regularized subproblem goes to zero is crucial for proving the convergence of Algorithm 1 to the hard margin SVM solution.

Lemma 4.3.

Let $L=2\frac{\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert}{n}$ and $R_{0}=\frac{L}{2\lambda_{0}}$ . For $s_{0}\in\mathbb{N}$ with $s_{0}>2$ , $p\in(0,1)$ , and $r>2p$ , let $\lambda_{s}=(s_{0}+s)^{-p}$ and $t_{s}=(s_{0}+s)^{r}$ . Let

[TABLE]

with

[TABLE]

Let $\eta_{s}=\frac{R_{s}}{L\sqrt{t_{s}}}$ . Then for the averaged subgradient iterates $\overline{\bm{w}}_{s}$ of Algorithm 1,

[TABLE]

Based on Lemma 4.3, for $r>2p$ and $p<1$ the radii $R_{s}$ shrink to 0 as $s$ increases.

4.2 Proof of Theorem 3.1.

We now prove Theorem 3.1 using the above lemmas.

Proof.

We use the triangle inequality to bound the error as

[TABLE]

We then bound the terms $\lVert{\overline{\bm{w}}_{s}-\bm{w}_{\lambda_{s}}^{*}}\rVert$ and $\|\bm{w}_{\lambda_{s}}^{*}-\bm{w}^{*}\|$ using the lemmas of Subsection 4.1.

Let $L=2\frac{\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert}{n}$ and choose $s_{0}\in\mathbb{N}$ with $s_{0}>2$ . Let $\lambda_{s}=(s_{0}+s)^{-p}$ and $t_{s}=(s_{0}+s)^{r}$ . Let

[TABLE]

with

[TABLE]

Let $\eta_{s}=\frac{R_{s}}{L\sqrt{t_{s}}}$ . By Lemma 4.3, considering the first term in the bound of Equation 13,

[TABLE]

Changing the base,

[TABLE]

We now bound the second term of the bound in Equation 13. Let $\lambda^{\prime}>0$ be such that $\bm{w}_{\lambda}^{*}=\bm{w}^{*}$ for all $\lambda\leq\lambda^{\prime}$ . By Lemma 4.2,

[TABLE]

The total number of updates, $k$ , used to calculate $\overline{\bm{w}}_{s}$ is bounded by

[TABLE]

Rearranging,

[TABLE]

Writing the bounds in terms of the total number of updates, $k$ ,

[TABLE]

and

[TABLE]

Combining these,

[TABLE]

∎

In order to optimize the convergence rate given in Theorem 3.1, we aim to choose parameters $p$ and $r$ such that

[TABLE]

For $\epsilon_{0}$ small, $p=1/2$ and $r=2$ lead to a nearly optimal converge rate of

[TABLE]

The choices $p=\tfrac{1}{2}$ and $r=2$ are considered in Corollary 3.2 and an explicit convergence rate is given under these conditions.

5 Experimental Results

We demonstrate the convergence of Algorithm 1 through several experiments on a simple synthetic dataset that is shown in Figure 1. The experiments aim to explore the differences between convergence in theory versus practice and are not intended to be exhaustive or demonstrate superior performance over existing methods. The data includes four support vectors which occur at $\pm(0.5,1.5)$ and $\pm(1.5,0.5)$ . The hard-margin SVM solution is given by $\bm{w}^{*}=(0.5,0.5)$ . The maximal regularization parameter $\lambda^{\prime}{}$ such that $\bm{w}_{\lambda}^{*}=\bm{w}^{*}$ for all $\lambda\leq\lambda^{\prime}$ is $\lambda^{\prime}=0.5$ . We fix the parameters $p=1/2$ and $r=2$ as are considered in Corollary 3.2 and initialize $\bm{w}_{0}=\bm{0}$ .

We measure convergence in terms of the $L_{2}$ error as well as the angle and margin gaps of Equation 10. Convergence results for Algorithm 1 with $p=1/2$ , $r=2$ and varying $s_{0}$ are shown in Figure 2. In terms of the $L_{2}$ error, for a fixed number of iterations, there appears to be an optimal choice for the parameter $s_{0}$ , as choosing $s_{0}=10$ performs better than $s_{0}=3,5$ or $20$ .

We additionally compare the convergence of Algorithm 1 in terms of the angle gap and margin gap to gradient descent using fixed step sizes applied to the logistic loss. We use step sizes $\eta=\frac{1}{\sigma_{\max}(\bm{X})}$ , where $\sigma_{\max}(\bm{X})$ is the largest singular value of the data matrix $\bm{X}$ . As can be seen in Figure 3, we find significantly faster convergence via Algorithm 1 as compared to minimization of the logistic loss via gradient descent with fixed step sizes as considered in [SHN*+*18, NLG*+*19]. This result is unsurprising, as Algorithm 1 arrives at the SVM solution via controlled explicit regularization as opposed to only implicit regularization via gradient descent.

We additionally consider the performance of Algorithm 1 applied to the data of Figure 1 with the y-values of the data multiplied by 20. This leads to a slightly more challenging problem with less symmetric data. The results are shown in Figure 4. We find that the convergence of Algorithm 1 is slightly slower in terms of $L_{2}$ error. The logistic loss converges significantly slower in terms of both the angle and margin gaps, whereas the effect on the convergence of Algorithm 1 appears to be minimal.

6 Implementation remarks

As presented, Algorithm 1 is highly adaptable for different loss functions and settings in which one would like to consider a range of regularization parameters or variable regularization. In this section, we present several potential modifications of interest, including adaptive or gradient based step sizes, amenability to using stochastic subgradients, and alternative updates.

6.1 Adaptive step sizes

When the regularization parameter, $\lambda$ , or the norm of $\bm{w}$ are small and close to optimal, if an iterate violates one of the hinge loss constraints, this can increase the magnitude of the gradient of the loss $F_{\lambda}$ significantly, leading to a relatively large jump in the next iterate followed by many smaller steps back toward the optimal solution of smaller norm. Using gradient descent with adaptive or loss-dependent step sizes can minimize the effects of these cycles. For example, we could adjust Algorithm 1 to use step sizes that are normalized by the magnitude of the subgradient,

[TABLE]

With this choice, the magnitude of the update is always $\eta_{k}$ and is independent of the magnitude of the gradient of $F_{\lambda}$ . Cursory experimental results suggest that using adaptive step sizes as in Equation 14, leads to slower convergence to the true solution initially and does not lead to improved convergence overall.

One could also potentially increase the convergence rate guarantees for Algorithm 1 by incorporating aggressive loss-dependent step sizes. In [NLG*+*19], the authors show that when using Equation 14 with step sizes $\eta_{k}=\frac{1}{L(\bm{w}_{k})}$ , gradient descent applied to the logistic loss converges at the nearly optimal rate of $O(t^{-1/2}\log t)$ . While this strategy provides a faster convergence rate, loss-dependent step sizes are less commonly used in practice as, in the stochastic setting, updating the loss at each iteration is often too expensive. The stochastic setting is discussed further in Subsection 6.3.

6.2 Regularization decay rate

In Algorithm 1, we consider regularization parameters that decay at a rate of $\lambda_{s}=O(s^{-p})$ for a constant $p>0$ . One might consider other choices for the decay rate of the regularization parameter $\lambda$ . For example $\lambda_{s}=O\left(\tfrac{1}{\log(s)}\right)$ or $\lambda_{s}=O(c^{s})$ for $c\in(0,1)$ . Recall that in bounding the error $\|\overline{\bm{w}}_{s}-\bm{w}^{*}\|$ we use the decomposition

[TABLE]

The first term converges more quickly when $\lambda$ is large while the second term converges more quickly when $\lambda$ is small. The decay rate of $\lambda_{s}=O(s^{-p})$ was chosen to balance the convergence of these terms.

6.3 Stochastic subgradients

Algorithm 1 can be naturally extended to the stochastic subgradient setting, in which one performs updates based on the subgradient of the loss with respect to only a subset of the data points. This is often necessary for large-scale optimization problems. Additionally, although piecewise-constant decaying step sizes are incorporated into Algorithm 1 to account for the introduced regularization, it is also often used in stochastic gradient descent in order to mitigate the effect of noise in the gradient approximation of each update [BCN18]. This commonality suggests that Algorithm 1 may be particularly suited for the stochastic setting.

6.4 Alternative updates

Lemma 4.1 is the only result that depends on the update given by the fixed- $\lambda$ subproblem and, in particular, Theorem 3.1 applies to any update that satisfies $\lVert{\overline{\bm{w}}_{s}-\bm{w}_{\lambda}^{*}}\rVert\leq R_{s}$ for each $s=1,\ldots,S$ . Thus, as opposed to using the average of the iterates from each fixed $\lambda$ subproblem, one could use alternative updates, such as

[TABLE]

or the iterate that leads to the minimal loss for that subproblem. We refer to this update choice as the best-iterate update and investigate the effects of this choice in Figure 5.

We find that the best-iterate update typically leads to significantly faster convergence in terms of the $L_{2}$ error. Specifically, choosing the best iterate can alleviate the slow convergence caused by the slow decrease in step size. The convergence of the two strategies, using the averaged iterate and the best iterate, perform comparably in terms of the angle gap. Using the best iterate converges somewhat slower in terms of the margin gap.

6.5 Incorporating a bias term

As in [RZH04, SHN*+*18], we consider the case in which the maximal-margin separating hyperplane intersects the origin. One can allow for more general hyperplanes by learning a bias term $b$ for the separating hyperplane. We propose the following method for approximating the bias term $b$

[TABLE]

which is guaranteed to be close to the true max-margin bias $b^{*}$ when $\lVert{\bm{w}-\bm{w}^{*}}\rVert$ is small. Specifically, one can verify that for the bias $b$ as calculated in Equation 15 and $b^{*}$ the true bias, we have

[TABLE]

Initial experiments with a non-trivial bias demonstrate convergence similar to the zero-bias case.

7 Conclusion

We have shown that, for linearly separable data, the subgradient method converges to the max-margin SVM solution when minimizing the unconstrained regularized SVM, Equation 2, with decreasing regularization parameters, $\lambda$ . Under the conditions given in Theorem 3.1, this convergence can be guaranteed to be $O\left(k^{-1/6+\delta}\right)$ for any $\delta>0$ . We compare convergence rates in several metrics to those provided in [SHN*+*18, NLG*+*19]. In particular, the convergence rate guarantees for Algorithm 1 are faster than those of [SHN*+*18, NLG*+*19] for gradient descent with fixed step sizes. This restriction to fixed or piecewise constant step sizes is a practical choice, especial when working with large-scale optimization problems. We additionally demonstrate the convergence of Algorithm 1 on a simple synthetic dataset.

Although we specifically consider the hinge loss and SVMs, the results and analysis presented here could be extended to more general settings. For example, one could more generally consider settings in which one aims to solve

[TABLE]

where $g$ is strongly convex and Lipschitz over bounded domains, $f$ is convex and Lipschitz, and the regularization path,

[TABLE]

is Lipschitz in $\lambda$ .

Acknowledgments

D. Molitor and D. Needell are grateful to and were partially supported by NSF CAREER DMS #1348721 and NSF BIGDATA DMS #1740325. R. Ward was supported in part by AFOSR MURI Award N00014-17-S-F006.

Appendix A Lemma Proofs

We now present proofs for the lemmas of Sections 2, 3 and 4.

We first prove Lemma 2.2, which gives a bound on the norm of the iterates produced by the subgradient method applied to Equation 2.

Proof of Lemma 2.2

Proof.

Consider the subgradient update for minimizing the function $F_{\lambda}$ of Equation 2

[TABLE]

with $\eta\lambda<1$ . Suppose that the iterate $\bm{w}$ satisfies $\lVert{\bm{w}}\rVert\leq\frac{1}{\lambda n}\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert$ . We aim to show that $\bm{w}^{\prime}{}$ given by the subgradient update also satisfies $\lVert{\bm{w}}\rVert\leq\frac{1}{\lambda n}\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert.$ Taking the norm on both sides of Equation 16,

[TABLE]

Thus the norms of all iterates of the subgradient method applied to the function $F_{\lambda}$ remain bounded by $\frac{1}{\lambda n}\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert$ if the initial iterate has norm at most $\frac{1}{\lambda n}\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert$ . The norm of the minimizer $\bm{w}_{\lambda}^{*}$ of $F_{\lambda}$ must also satisfy the bound $\lVert{\bm{w}_{\lambda}^{*}}\rVert\leq\frac{1}{\lambda n}\sum_{j}\lVert{\bm{x}_{j}}\rVert$ as $0\in\partial F_{\lambda}(\bm{w}_{\lambda}^{*})$ and so

[TABLE]

∎

**Proof of Lemma 3.4

Lemma 3.4** uses Theorem 3.1 to derive bounds for the angle and margin gaps.

Proof.

To derive a convergence rate for the angle gap, we use the decomposition

[TABLE]

Dividing by $2\|\bm{w}_{k}\|\|\bm{w}^{*}\|$ ,

[TABLE]

Since $\|\bm{w}^{*}\|$ is necessarily bounded away from 0 since $y_{i}\bm{x}_{i}^{\top}\bm{w}^{*}\geq 1$ for all $i$ . We can bound $\|\bm{w}_{k}\|$ away from 0 for $t$ large using the convergence of $\bm{w}_{k}$ to $\bm{w}^{*}$ guaranteed by Theorem 3.1. Let

[TABLE]

be the exponent in the convergence rate of $\|\bm{w}-\bm{w}^{*}\|$ and $p,r,$ and $\epsilon_{0}$ be defined as in Theorem 3.1. Since

[TABLE]

for constants $A,c>0$ by Theorem 3.1, then $\|\bm{w}_{k}\|\geq\|\bm{w}^{*}\|-Ak^{-c}.$ Thus for $k$ sufficiently large, we can bound $\|\bm{w}\|$ away from 0 and have

[TABLE]

We now consider the margin bound. Let $j=\operatorname*{argmin}_{i=1,\ldots n}\frac{y_{i}\bm{x}_{i}^{\top}\bm{w}_{k}}{\|\bm{w}_{k}\|}$ . Since $y_{i}\bm{x}_{i}^{\top}\bm{w}^{*}\geq 1$ for all $i=1,\ldots,n$ , we have that

[TABLE]

Note that

[TABLE]

Assuming the data is finite and linearly separable, by Equation 17 we then have

[TABLE]

∎

**Proof of Lemma 4.1

Lemma 4.1** provides a modified convergence guarantee for the averaged subgradient method applied to the functions $F_{\lambda}$ [Bub14].

Proof.

Let $F_{\lambda}$ be a strongly convex function with strong convexity parameter $\lambda$ and Lipschitz constant $L$ on the bounded domain considered. Let $\bm{w}_{0}$ be an initial iterate and $\bm{w}_{\lambda}^{*}$ be the minimizer of $F_{\lambda}$ . Suppose $\|\bm{w}_{0}-\bm{w}_{\lambda}^{*}\|\leq R$ , so that $\bm{w}_{\lambda}^{*}$ is contained in a ball of radius $R$ and center $\bm{w}_{0}$ . Let $\overline{\bm{w}}=\frac{1}{t}\sum_{i=1}^{t}\bm{w}_{i}$ be the average of $t$ subgradient descent iterates with initial iterate $\bm{w}_{0}$ and step size $\eta=\frac{R}{L\sqrt{t}}$ . We aim to show that

[TABLE]

The following proof relies heavily on Theorem 3.2 of [Bub14] (See also [B*+*15]).

Since $\bm{w}_{\lambda}^{*}$ is the minimizer of $F_{\lambda}$ , the inequality

[TABLE]

is immediate. Let $g(\bm{w})=F_{\lambda}(\bm{w})-\frac{\lambda}{2}||\bm{w}||^{2}$ . Since $g(\bm{w})$ is convex,

[TABLE]

and thus

[TABLE]

Reorganizing and subtracting $F_{\lambda}(\bm{w}_{\lambda}^{*})$ ,

[TABLE]

Using the strong convexity of $F_{\lambda}$ and the proof of Theorem 3.2 of [Bub14],

[TABLE]

Making this substitution into Equation 18,

[TABLE]

Decomposing the sum,

[TABLE]

Making this substitution,

[TABLE]

Since $||\bm{w}||^{2}$ is convex, $\frac{\lambda}{t}\sum\left(||\bm{w}_{i}||^{2}-||\overline{\bm{w}}||^{2}\right)\geq 0$ and

[TABLE]

as desired. ∎

**Proof of Lemma 4.2

**We now prove Lemma 4.2, which bounds the distance between minimizers of $F_{\lambda}$ for different regularization parameters $\lambda$ .

Proof.

Let $\bm{w}_{\lambda}^{*}$ minimize $F_{\lambda}$ as given in Equation 2. Let $\lambda^{\prime}>0$ be such that $\bm{w}_{\lambda}^{*}=\bm{w}^{*}$ for all $\lambda\leq\lambda^{\prime}$ . For $\lambda,\widetilde{\lambda}\geq 0$ and data satisfying Assumption 2.1, we aim to show that

[TABLE]

and

[TABLE]

The proof of Lemma 4.2 makes use of Lemma 8 of [LS18], which is also stated below.

Lemma A.1.

(Perturbation of strongly convex functions I [LS18]). Let $f(\bm{z})$ be a non-negative, $\alpha^{2}$ -strongly convex function. Let $g(\bm{z})$ be a L-Lipschitz non-negative convex function. For any $\beta\geq 0$ , let $\bm{z}[\beta]$ be the minimizer of $f(\bm{z})+\beta g(\bm{z})$ , then we have,

[TABLE]

Let $f(\bm{w})=\|\bm{w}\|^{2}$ and $g(\bm{w})=\frac{1}{n}\sum_{j=1}^{n}\max\{0,1-y_{j}\bm{x}_{j}^{\top}\bm{w}\}$ . Then $f$ is strongly convex with strong convexity parameter $2$ and $g$ is Lipschitz with a Lipschitz constant bounded by $\tfrac{1}{n}\sum_{j=1}^{n}\|\bm{x}_{j}\|$ . Note that

[TABLE]

for $\beta(\lambda)=\frac{2}{\lambda}$ . Applying Lemma 8 of [LS18],

[TABLE]

Integrating, for any $\tilde{\lambda}\geq\hat{\lambda}>0$ , we have

[TABLE]

As the regularization parameter $\lambda$ approaches zero, we will use the following bound. Since for all $\lambda<\lambda^{\prime}$ , $\bm{w}[\lambda]=\bm{w}[\lambda^{\prime}]=\bm{w}^{*}$ , then for $\lambda<\lambda^{\prime}$ , $\big{\|}\frac{d\bm{w}[\lambda]}{d\lambda}\big{\|}=0$ . Thus

[TABLE]

This gives the second bound,

[TABLE]

∎

Proof of Lemma 4.3 We finally prove Lemma 4.3, which makes use of Lemma 4.1 and Lemma 4.2 to bound the initial error $\lVert{\overline{\bm{w}}_{s}-\bm{w}_{\lambda}^{*}}\rVert$ of each regularized subproblem given in Equation 4.

Proof.

We aim to show $\lVert{\overline{\bm{w}}_{s}-\bm{w}_{\lambda_{s}}^{*}}\rVert\leq R_{s}$ with $R_{s}$ defined below and proceed by induction. For $s_{0}\in\mathbb{N}$ with $s_{0}>2$ , $p\in(0,1)$ , and $r>2p$ , let $\lambda_{s}=(s_{0}+s)^{-p}$ , $t_{s}=(s_{0}+s)^{r}$ . Recall that $L=\tfrac{2}{n}\sum_{j=1}^{n}\lVert{\bm{x}_{j}}\rVert$ . For some parameter $\alpha>0$ , let

[TABLE]

By Lemma 2.2, and since $\overline{\bm{w}}_{0}=\bf{0}$ , we have $\lVert{\overline{\bm{w}}_{0}-\bm{w}_{\lambda_{0}}^{*}}\rVert\leq\frac{L}{2\lambda_{0}}$ . Note that $R_{0}\geq\frac{L}{2\lambda_{0}}$ and thus the base case, $\lVert{\overline{\bm{w}}_{0}-\bm{w}_{\lambda_{0}}^{*}}\rVert\leq R_{0}$ is satisfied.

Suppose that $\lVert{\overline{\bm{w}}_{s}-\bm{w}_{\lambda_{s}}^{*}}\rVert\leq R_{s}$ . By the triangle inequality,

[TABLE]

For $\overline{\bm{w}}_{s}$ generated as in Algorithm 1, Lemma 4.1 along with the inductive assumption gives that

[TABLE]

From Equation 11 of Lemma 4.2,

[TABLE]

Combining these

[TABLE]

Applying a change of base via $\epsilon\geq\frac{\log(s_{0}+s-1)-\log(s_{0}+s-2)}{\log(s_{0}+s-1)}$ ,

[TABLE]

To simplify the analysis and remove the dependence of $\epsilon$ on the iteration number $s$ , we use $\epsilon_{0}=\frac{\log(s_{0})-\log(s_{0}-1)}{\log(s_{0})}.$ Now, for

[TABLE]

and $p<1$ , we have

[TABLE]

Note that allowing the first term in the upper bound on $\alpha$ to increase with $s$ leads to smaller bounds $R_{s}$ . This choice, however, complicates the analysis.

∎

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[B + 15] S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning , 8(3-4):231–357, 2015.
2[BCN 18] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review , 60(2):223–311, 2018.
3[BGMS 18] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. SGD learns over-parameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings , 2018.
4[BST 99] P. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. Advances in Kernel methods—support vector learning , pages 43–54, 1999.
5[Bub 14] S. Bubeck. Convex optimization: Algorithms and complexity. ar Xiv e-prints , page ar Xiv:1405.4980, May 2014.
6[CCS + 17] P. Chaudhari, A. Choromanska, S. Soatto, Y. Le Cun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations , 2017.
7[Cha 07] O. Chapelle. Training a support vector machine in the primal. Neural Computation , 19:1155–1178, 2007.
8[CPS + 18] R. T. d. Combes, M. Pezeshki, S. Shabanian, A. C. Courville, and Y. Bengio. On the learning dynamics of deep neural networks. Co RR , abs/1809.06848, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Bias of Homotopic Gradient Descent for the Hinge Loss

Abstract

1 Introduction

1.1 Contributions

1.2 Organization

2 Problem Setup

Assumption 2.1**.**

Lemma 2.2**.**

Remark*.*

3 Main Results

Theorem 3.1**.**

Corollary 3.2**.**

3.1 Convergence rates for angle and margin gaps

Definition 3.3**.**

Lemma 3.4**.**

4 Proof of Theorem 3.1

4.1 Useful lemmas

Lemma 4.1**.**

Lemma 4.2**.**

Lemma 4.3**.**

4.2 Proof of Theorem 3.1.

Proof.

5 Experimental Results

6 Implementation remarks

6.1 Adaptive step sizes

6.2 Regularization decay rate

6.3 Stochastic subgradients

6.4 Alternative updates

6.5 Incorporating a bias term

7 Conclusion

Appendix A Lemma Proofs

Proof.

Proof.

Proof.

Proof.

Lemma A.1**.**

Proof.

Assumption 2.1.

Lemma 2.2.

*Remark**.*

Theorem 3.1.

Corollary 3.2.

Definition 3.3.

Lemma 3.4.

Lemma 4.1.

Lemma 4.2.

Lemma 4.3.

Lemma A.1.