SVM via Saddle Point Optimization: New Bounds and Distributed Algorithms

Yifei Jin; Lingxiao Huang; Jian Li

arXiv:1705.07252·cs.LG·January 30, 2018

SVM via Saddle Point Optimization: New Bounds and Distributed Algorithms

Yifei Jin, Lingxiao Huang, Jian Li

PDF

TL;DR

This paper introduces new saddle point optimization algorithms for SVM variants, achieving faster approximate solutions with nearly linear time complexity and efficient distributed implementation, outperforming previous methods especially in high-dimensional settings.

Contribution

The paper presents the first nearly linear time algorithm for $ u$-SVM and improved algorithms for hard-margin SVM using saddle point optimization, with theoretical guarantees and distributed efficiency.

Findings

01

Achieves $(1- heta)$-approximation with $ ilde{O}(nd + nrac{ ext{d}}{ heta})$ time.

02

First nearly linear time algorithm for $ u$-SVM.

03

Distributed algorithms require $ ilde{O}(k(d + rac{ ext{d}}{ heta}))$ communication, nearly matching lower bounds.

Abstract

We study two important SVM variants: hard-margin SVM (for linearly separable cases) and $ν$ -SVM (for linearly non-separable cases). We propose new algorithms from the perspective of saddle point optimization. Our algorithms achieve $(1 - ϵ)$ -approximations with running time $\tilde{O} (n d + n d / ϵ)$ for both variants, where $n$ is the number of points and $d$ is the dimensionality. To the best of our knowledge, the current best algorithm for $ν$ -SVM is based on quadratic programming approach which requires $Ω (n^{2} d)$ time in worst case~\cite{joachims1998making,platt199912}. In the paper, we provide the first nearly linear time algorithm for $ν$ -SVM. The current best algorithm for hard margin SVM achieved by Gilbert algorithm~\cite{gartner2009coresets} requires $O (n d / ϵ)$ time. Our algorithm improves the running time by a factor of…

Figures35

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1: Saddle-SVC vs. Gilbert Algorithm. ϵ = 0.001 italic-ϵ 0.001 \epsilon=0.001 . iris: n = 150 , 𝑛 150 n=150, d = 4 𝑑 4 d=4 . mushrooms: n = 8124 , 𝑛 8124 n=8124, d = 112 𝑑 112 d=112 , Synthetic data: n = 10000 𝑛 10000 n=10000 .

data set	Saddle-SVC		Gilbert
data set	obj	time	obj	time
iris	0.835	0.152s	0.835	0.0005s
mushrooms	0.516	11.2s	0.517	12.5s

Table 2. Table 2: The sketch of the data sets. Here n 𝑛 n is the number of points. n 1 subscript 𝑛 1 n_{1} is the number of points with + 1 1 +1 label and n 2 subscript 𝑛 2 n_{2} is the number of points with − 1 1 -1 label. d 𝑑 d is the dimension of the features. n n z 𝑛 𝑛 𝑧 nnz is the non-zeros data ratio.

data set	parameters
data set	$n$	$n_{1}$	$n_{2}$	$d$	$n n z$
a1a	1605	395	1210	119	0.12
a5a	6414	1569	4845	122	0.114
a9a	32561	7841	24,720	123	0.113
phishing	11055	6157	4898	68	0.441
mushrooms	8124	3916	4208	112	0.188
iris	150	100	50	4	0.978
gisette	6000	3000	3000	5000	0.99
w8a	49749	1479	48270	300	0.038
ijcnn1	49990	4,853	45,137	22	0.590
skin_nonskin	245057	50859	194198	3	0.982

Table 3. Table 3: Experiments on different parameter ν 𝜈 \nu in ν 𝜈 \nu -SVM. Here n 𝑛 n is the number of points. n 1 subscript 𝑛 1 n_{1} is the number of points with + 1 1 +1 label and n 2 subscript 𝑛 2 n_{2} is the number of points with − 1 1 -1 label.

data set	$α$	LIBSVM		Saddle-SVC
data set	$α$	Obj	Test Acy	Obj	Test Acy
a9a	0.1	6e-12	0.35	6e-4	0.69
	0.3	6e-13	0.36	7e-4	0.69
	0.5	6e-13	0.71	3e-4	0.70
phishing	0.1	6e-11	0.89	3e-4	0.82
	0.3	0.002	0.93	0.002	0.93
	0.5	0.01	0.92	0.01	0.93
ijcnn1	0.1	2e-12	0.17	0.0039	0.73
	0.3	6e-13	0.17	0.002	0.47
	0.5	3e-13	0.80	0.0004	0.31

Table 4. Table 4: Saddle-SVC vs. LinearSVC: The parameter α 𝛼 \alpha for Saddle-SVC is 0.85. The parameter C 𝐶 C for LinearSVC is 8. skin_nonskin: n = 245057 , d = 3 formulae-sequence 𝑛 245057 𝑑 3 n=245057,d=3 . w8a: n = 49745 , d = 300 formulae-sequence 𝑛 49745 𝑑 300 n=49745,d=300 . Synthetic data: n = 100000 , d = 128 formulae-sequence 𝑛 100000 𝑑 128 n=100000,d=128 .

data set	nnz	Saddle-SVC		LinearSVC
data set	nnz	test acy	time	test acy	time
skin	0.98	0.931	40.0s	0.913	654s
w8a	0.03	0.984	3075s	0.986	12.5s
synthetic	0.1	0.804	393s	0.830	28.2s
synthetic	0.5	0.844	369s	0.843	214s
synthetic	0.9	0.825	363s	0.828	537s

Equations157

\begin{array}[]{lcl}\min\limits_{w,b}&\frac{1}{2}\|w\|^{2}&\\ \text{s.t.}&y_{i}(w^{\mathrm{T}}x_{i}-b)\geq 1,&\forall i\end{array}

\begin{array}[]{lcl}\min\limits_{w,b}&\frac{1}{2}\|w\|^{2}&\\ \text{s.t.}&y_{i}(w^{\mathrm{T}}x_{i}-b)\geq 1,&\forall i\end{array}

\begin{array}[]{lcl}\min\limits_{\eta,\xi}&\frac{1}{2}\|A\eta-B\xi\|^{2}&\\ \text{s.t.}&\|\eta\|_{1}=1,\|\xi\|_{1}=1.\quad\eta\geq 0,\xi\geq 0.\end{array}

\begin{array}[]{lcl}\min\limits_{\eta,\xi}&\frac{1}{2}\|A\eta-B\xi\|^{2}&\\ \text{s.t.}&\|\eta\|_{1}=1,\|\xi\|_{1}=1.\quad\eta\geq 0,\xi\geq 0.\end{array}

OPT = w max η \in Δ_{n_{1}}, ξ \in Δ_{n_{2}} min w^{T} A η - w^{T} B ξ - \frac{1}{2} ∥ w ∥^{2}

OPT = w max η \in Δ_{n_{1}}, ξ \in Δ_{n_{2}} min w^{T} A η - w^{T} B ξ - \frac{1}{2} ∥ w ∥^{2}

\begin{array}[]{rl}\max\limits_{w}\min\limits_{\eta\in\Delta_{n_{1}},\xi\in\Delta_{n_{2}}}&w^{\rm T}A\eta-w^{\rm T}B\xi\\ &\qquad+\gamma H(\eta)+\gamma H(\xi)-\frac{1}{2}\|w\|^{2},\end{array}

\begin{array}[]{rl}\max\limits_{w}\min\limits_{\eta\in\Delta_{n_{1}},\xi\in\Delta_{n_{2}}}&w^{\rm T}A\eta-w^{\rm T}B\xi\\ &\qquad+\gamma H(\eta)+\gamma H(\xi)-\frac{1}{2}\|w\|^{2},\end{array}

g (w) := η \in Δ_{n_{1}}, ξ \in Δ_{n_{2}} min w^{T} A η - w^{T} B ξ - \frac{1}{2} ∥ w ∥^{2} .

g (w) := η \in Δ_{n_{1}}, ξ \in Δ_{n_{2}} min w^{T} A η - w^{T} B ξ - \frac{1}{2} ∥ w ∥^{2} .

\begin{array}[]{lcl}\min\limits_{w,b,\rho,\delta}&\frac{1}{2}\|w\|^{2}-\rho+\frac{\nu}{2}\sum_{i}\delta_{i}&\\ \text{s.t.}&y_{i}(w^{\mathrm{T}}x_{i}-b)\geq\rho-\delta_{i},\delta_{i}\geq 0,&\forall i\end{array}

\begin{array}[]{lcl}\min\limits_{w,b,\rho,\delta}&\frac{1}{2}\|w\|^{2}-\rho+\frac{\nu}{2}\sum_{i}\delta_{i}&\\ \text{s.t.}&y_{i}(w^{\mathrm{T}}x_{i}-b)\geq\rho-\delta_{i},\delta_{i}\geq 0,&\forall i\end{array}

\begin{array}[]{lcl}\min\limits_{\eta,\xi}&\frac{1}{2}\|A\eta-B\xi\|^{2}\\ \text{s.t.}&\|\eta\|_{1}=1,\|\xi\|_{1}=1.\\ &0\leq\eta_{i}\leq\nu,0\leq\xi_{j}\leq\nu,\forall i,j\\ \end{array}

\begin{array}[]{lcl}\min\limits_{\eta,\xi}&\frac{1}{2}\|A\eta-B\xi\|^{2}\\ \text{s.t.}&\|\eta\|_{1}=1,\|\xi\|_{1}=1.\\ &0\leq\eta_{i}\leq\nu,0\leq\xi_{j}\leq\nu,\forall i,j\\ \end{array}

OPT = w max η \in D_{n_{1}}, ξ \in D_{n_{2}} min w^{T} A η - w^{T} B ξ - \frac{1}{2} ∥ w ∥^{2} .

OPT = w max η \in D_{n_{1}}, ξ \in D_{n_{2}} min w^{T} A η - w^{T} B ξ - \frac{1}{2} ∥ w ∥^{2} .

\begin{array}[]{rl}\max\limits_{w}\min\limits_{\eta\in\mathcal{D}_{n_{1}},\xi\in\mathcal{D}_{n_{2}}}&w^{\rm T}A\eta-w^{\rm T}B\xi\\ +&\gamma H(\eta)+\gamma H(\xi)-\frac{1}{2}\|w\|^{2}.\end{array}

\begin{array}[]{rl}\max\limits_{w}\min\limits_{\eta\in\mathcal{D}_{n_{1}},\xi\in\mathcal{D}_{n_{2}}}&w^{\rm T}A\eta-w^{\rm T}B\xi\\ +&\gamma H(\eta)+\gamma H(\xi)-\frac{1}{2}\|w\|^{2}.\end{array}

\forall j \in [d], ∣ (W D x_{i})_{j} ∣ \leq O (lo g n / d) .

\forall j \in [d], ∣ (W D x_{i})_{j} ∣ \leq O (lo g n / d) .

\displaystyle\begin{split}w_{i^{*}}[t+1]=&\arg\max_{w_{i^{*}}}-\big{\{}-(\delta_{i^{*}}^{+}-\delta_{i^{*}}^{-})w_{i^{*}}\\ &\qquad+w^{2}_{i^{*}}/2+(w_{i^{*}}-w_{i^{*}}[t])^{2}/2\sigma\big{\}}\end{split}

\displaystyle\begin{split}w_{i^{*}}[t+1]=&\arg\max_{w_{i^{*}}}-\big{\{}-(\delta_{i^{*}}^{+}-\delta_{i^{*}}^{-})w_{i^{*}}\\ &\qquad+w^{2}_{i^{*}}/2+(w_{i^{*}}-w_{i^{*}}[t])^{2}/2\sigma\big{\}}\end{split}

η_{i} [t + 1] ξ_{j} [t + 1] \leftarrow Φ (η_{i} [t], X^{+}) / Z^{+}, \forall i \in [n_{1}], \leftarrow Φ (ξ_{j} [t], X^{-}) / Z^{-}, \forall j \in [n_{2}]

η_{i} [t + 1] ξ_{j} [t + 1] \leftarrow Φ (η_{i} [t], X^{+}) / Z^{+}, \forall i \in [n_{1}], \leftarrow Φ (ξ_{j} [t], X^{-}) / Z^{-}, \forall j \in [n_{2}]

\displaystyle\begin{split}\Phi(\lambda_{i},X)=&\exp\big{\{}(\gamma+d\tau^{-1})^{-1}(d\tau^{-1}\log\lambda_{i}-\\ &\qquad\qquad y_{i}\cdot\langle w[t]+d(w[t+1]-w[t],X_{\cdot i})\rangle)\big{\}}\end{split}

\displaystyle\begin{split}\Phi(\lambda_{i},X)=&\exp\big{\{}(\gamma+d\tau^{-1})^{-1}(d\tau^{-1}\log\lambda_{i}-\\ &\qquad\qquad y_{i}\cdot\langle w[t]+d(w[t+1]-w[t],X_{\cdot i})\rangle)\big{\}}\end{split}

\begin{array}[]{l}\mathbf{while}\quad\varsigma:=\sum_{\eta_{i}>\nu}(\eta_{i}-\nu)\neq 0:\\ \qquad\Omega=\sum_{\eta_{i}<\nu}\eta_{i}\\ \qquad\forall i,\quad\mathbf{if}\;\eta_{i}\geq\nu,\quad\mathbf{then}\;\eta_{i}=\nu\\ \qquad\forall i,\quad\mathbf{if}\;\eta_{i}<\nu,\quad\mathbf{then}\;\eta_{i}=\eta_{i}(1+\varsigma/\Omega)\end{array}

\begin{array}[]{l}\mathbf{while}\quad\varsigma:=\sum_{\eta_{i}>\nu}(\eta_{i}-\nu)\neq 0:\\ \qquad\Omega=\sum_{\eta_{i}<\nu}\eta_{i}\\ \qquad\forall i,\quad\mathbf{if}\;\eta_{i}\geq\nu,\quad\mathbf{then}\;\eta_{i}=\nu\\ \qquad\forall i,\quad\mathbf{if}\;\eta_{i}<\nu,\quad\mathbf{then}\;\eta_{i}=\eta_{i}(1+\varsigma/\Omega)\end{array}

ν = 1/ (α min (n_{1}, n_{2})),

ν = 1/ (α min (n_{1}, n_{2})),

ar g η \in Δ_{n_{1}} min

ar g η \in Δ_{n_{1}} min

\displaystyle\qquad\qquad+\frac{\gamma}{d}H(\eta)+\frac{1}{\tau}V_{\eta[t]}(\eta)\Big{\}}

\displaystyle Z^{-1}\exp\big{\{}

\displaystyle Z^{-1}\exp\big{\{}

\displaystyle\qquad-\langle w[t]+d(w[t+1]-w[t]),X_{\cdot i}\rangle)\big{\}}

L (η, λ) =

L (η, λ) =

+ \frac{γ}{d} H (η) + \frac{1}{τ} V_{η [t]} (η) + λ (i \sum η_{i} - 1)

\frac{\partial L}{\partial η _{i}} = 0

\frac{\partial L}{\partial η _{i}} = 0

+ d^{- 1} ⟨ w [t] + d (w [t + 1] - w [t]), X_{\cdot i} ⟩

- τ^{- 1} lo g η_{i} [t] + (λ + τ^{- 1}), \forall i

\frac{\partial L}{\partial λ} = 0

η_{i} [t + 1] =

η_{i} [t + 1] =

-

ar g η \in S_{1} min

ar g η \in S_{1} min

+

\displaystyle Z^{-1}\exp\big{\{}

\displaystyle Z^{-1}\exp\big{\{}

-

\forall i,\eta_{i}[t+1]=\left\{\begin{array}[]{ll}\eta_{i}(1+\varsigma_{i^{*}}/\Omega_{i^{*}}),&\text{if }i<i^{*}\\ \nu,&\text{if }i\geq i^{*}\end{array}\right.

\forall i,\eta_{i}[t+1]=\left\{\begin{array}[]{ll}\eta_{i}(1+\varsigma_{i^{*}}/\Omega_{i^{*}}),&\text{if }i<i^{*}\\ \nu,&\text{if }i\geq i^{*}\end{array}\right.

Z^{- 1} exp

Z^{- 1} exp

-

\begin{array}[]{l}\mathbf{while}\quad\varsigma:=\sum_{\eta_{i}>\nu}(\eta_{i}-\nu)\neq 0:\\ \qquad\Omega=\sum_{\eta_{i}<\nu}\eta_{i}\\ \qquad\forall i,\quad\mathbf{if}\;\eta_{i}\geq\nu,\quad\mathbf{then}\;\eta_{i}=\nu\\ \qquad\forall i,\quad\mathbf{if}\;\eta_{i}<\nu,\quad\mathbf{then}\;\eta_{i}=\eta_{i}(1+\varsigma/\Omega)\end{array}

\begin{array}[]{l}\mathbf{while}\quad\varsigma:=\sum_{\eta_{i}>\nu}(\eta_{i}-\nu)\neq 0:\\ \qquad\Omega=\sum_{\eta_{i}<\nu}\eta_{i}\\ \qquad\forall i,\quad\mathbf{if}\;\eta_{i}\geq\nu,\quad\mathbf{then}\;\eta_{i}=\nu\\ \qquad\forall i,\quad\mathbf{if}\;\eta_{i}<\nu,\quad\mathbf{then}\;\eta_{i}=\eta_{i}(1+\varsigma/\Omega)\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSupport Vector Machine

Full text

SVM via Saddle Point Optimization:

New Bounds and Distributed Algorithms

Yifei Jin

Tsinghua University

Lingxiao Huang

EPFL

Jian Li

Tsinghua University

Abstract

We study two important SVM variants: hard-margin SVM (for linearly separable cases) and $\nu$ -SVM (for linearly non-separable cases). We propose new algorithms from the perspective of saddle point optimization. Our algorithms achieve $(1-\epsilon)$ -approximations with running time $\tilde{O}(nd+n\sqrt{d/\epsilon})$ for both variants, where $n$ is the number of points and $d$ is the dimensionality. To the best of our knowledge, the current best algorithm for $\nu$ -SVM is based on quadratic programming approach which requires $\Omega(n^{2}d)$ time in worst case [23, 36]. In the paper, we provide the first nearly linear time algorithm for $\nu$ -SVM. The current best algorithm for hard margin SVM achieved by Gilbert algorithm [17] requires $O(nd/\epsilon)$ time. Our algorithm improves the running time by a factor of $\sqrt{d}/\sqrt{\epsilon}$ . Moreover, our algorithms can be implemented in the distributed settings naturally. We prove that our algorithms require $\tilde{O}(k(d+\sqrt{d/\epsilon}))$ communication cost, where $k$ is the number of clients, which almost matches the theoretical lower bound. Numerical experiments support our theory and show that our algorithms converge faster on high dimensional, large and dense data sets, as compared to previous methods.

1 Introduction

Support Vector Machine (SVM) is widely used for classification in numerous applications such as text categorization, image classification, and hand-written characters recognition.

In this paper, we focus on binary classification. If two classes of points which are linearly separable, one can use the hard-margin SVM ([6, 10]), which is to find a hyperplane that separate two classes of points and the margin is maximized. If the data is not linearly separable, several popular SVM variants have been proposed, such as $l_{2}$ -SVM, $C$ -SVM and $\nu$ -SVM (see e.g., the summary in [17]). The main difference among these variants is that they use different penalty loss functions for the misclassified points. $l_{2}$ -SVM, as the name implied, uses the $l_{2}$ penalty loss. $C$ -SVM and $\nu$ -SVM are two well-known SVM variants using $l_{1}$ -loss. $C$ -SVM uses the $l_{1}$ -loss with penalty coefficient $C\in[0,\infty)$ [46]. On the other hand, $\nu$ -SVM reformulates $C$ -SVM through taking a new regularization parameter $\nu\in(0,1]$ [38]. However, given a $C$ -SVM formulation, it is not easy to compute the regularization parameter $\nu$ and obtain an equivalent $\nu$ -SVM. Because the equivalence is based on some hard-to-compute constant. Compared to $C$ -SVM, the parameter $\nu$ in $\nu$ -SVM has a more clear geometric interpretation: the objective is to minimize the distance between two reduced polytopes defined based on $\nu$ [11]. However, the best known algorithm for $\nu$ -SVM is much worse than that for $C$ -SVM in practice (see below).

In general, SVMs can be formulated as convex quadratic programs and solved by quadratic programs in $O(n^{2}d)$ time [23, 36]. However, better algorithms exists for some SVM variants, which we briefly discuss below.

For hard-margin SVM, [17] showed that Gilbert algorithm [18] achieves a $(1-\epsilon)$ -approximation with $O(nd/\epsilon\beta^{2})$ running time where $\beta$ is the ratio of the minimum distance to the maximum one among the points. $l_{2}$ -SVM and $C$ -SVM have been studied extensively and current best algorithms runs in time linear in the number $n$ of data points [39, 15, 12, 2]. However, these techniques cannot be extended to $\nu$ -SVM directly, mainly because $\nu$ -SVM cannot be transformed to single-objective unconstrained optimization problems. Except the traditional quadratic programming approach, there is no better algorithm known with provable guarantee for $\nu$ -SVM. Whether $\nu$ -SVM can be solved in nearly linear time is still open.

Distributed SVM has also attracted significant attention in recent years. A number of distributed algorithms for SVM have been obtained in the past [19, 32, 30, 14, 44]. Typically, the communication complexity is one of the key performance measurements for distributed algorithms, and has been studied extensively (see [43, 34, 27] ). For hard-margin SVM, recently, Liu et al. [28] proposed a distributed algorithm with $O(kd/\epsilon)$ communication cost, where $k$ is the number of the clients. Hence, it is a natural question to ask whether the communication cost of their algorithm can be improved.

1.1 Our Contributions

We summarize our main contributions as follows.

Hard-Margin SVM: We provide a new $(1-\epsilon)$ -approximation algorithm with running time $\tilde{O}(nd+n\sqrt{d}/\sqrt{\epsilon\beta})$ , where $\beta$ is the ratio of the minimum distance to the maximum one among the points (see Theorem 6). 111 $\tilde{O}$ notation hides logarithm factors such as $\log(n)$ , $\log(\beta)$ and $\log(1/\epsilon)$ . Compared to Gilbert algorithm [17], our algorithm improves the running time by a factor of $\sqrt{d}/\sqrt{\epsilon}$ . First, we regard hard-margin SVM as computing the polytope distance between two classes of points. Then we translate the problem to a saddle point optimization problem using the properties of the geometric structures (Lemma 2), and provide an algorithm to solve the saddle point optimization. 2. 2.

$\nu$ -SVM: Then, we extend our algorithm to $\nu$ -SVM and design an $\tilde{O}(nd+n\sqrt{d}/\sqrt{\epsilon\beta})$ time algorithm, which is the most important technical contribution of this paper. To the best of our knowledge, it is the first nearly linear time algorithm for $\nu$ -SVM. It is known that $\nu$ -SVM is equivalent to computing the distance between two reduced polytopes [5, 11]. The obstacle for providing an efficient algorithm based on the reduced polytopes is that the number of vertices in the reduced polytopes may be exponentially large. However, in our framework, we only need to implicitly represent the reduced polytopes. We show that using the similar saddle point optimization framework, together with a new nontrivial projection method, $\nu$ -SVM can be solved efficiently in the same time complexity as in the hard-margin case. Compared with the QP-based algorithms in previous work [23, 36], our algorithm significantly improves the running time, by a factor of $n$ . 3. 3.

Distributed SVM: Finally, we extend our algorithms for both hard-margin SVM and $\nu$ -SVM to the distributed setting. We prove that the communication cost of our algorithm is $\tilde{O}(k(d+\sqrt{d/\epsilon}))$ , which is almost optimal according to the lower bound provided in [28]. For the hard-margin SVM, compared with the current best algorithm [28] with $O(kd/\epsilon)$ communication cost, our algorithm is more suitable when $\epsilon$ is small and $d$ is large. For $\nu$ -SVM, our algorithm is the first practical distributed algorithm.

Besides, the numerical experiments support our theoretical bounds. We compare our algorithms with Gilbert Algorithm [17] and NuSVC, LinearSVC in scikit-learn [35]. The experiments show that our algorithms converge faster on high dimensional, large and dense data sets.

1.2 Other Related Work

For the hard-margin SVM, there is an alternative to Gilbert’s method, called the MDM algorithm, originally proposed by [31]. Recently, López and Dorronsoro proved that the rate of convergence of MDM algorithm is $O(n^{2}d\log(1/\epsilon))$ [29] which is a linear convergence w.r.t. $\epsilon$ , but worse than Gilbert Algorithm w.r.t. $n$ .

Both $C$ -SVM and $l_{2}$ -SVM have been studied extensively in the literature. Basically, there are three main algorithmic approaches: the primal gradient-based methods [26, 39, 12, 15, 2], dual quadratic programming methods [24, 40, 22] and dual geometry methods [42, 41]. Recently, [2] provided the current best algorithms which achieve $O(nd/\sqrt{\epsilon})$ time for $l_{2}$ -SVM and $O(nd/\epsilon)$ time for $C$ -SVM.

Some sublinear time algorithms for hard-margin SVM and $l_{2}$ -SVM have been proposed [9, 21]. These algorithms are sublinear w.r.t. $nd$ , (i.e., the size of the input), but have worse dependency on $1/\epsilon$ .

The algorithmic framework for saddle point optimization was first developed by Nesterov for structured nonsmooth optimization problem [33]. He only considered the full gradient in the algorithm. Recently, some studies have extended it to the stochastic gradient setting [45, 3]. The most related work is [3], in which the author obtained an $\tilde{O}(nd+n\sqrt{d}/\sqrt{\epsilon})$ algorithm for the minimum enclosing ball problem (MinEB) in Euclidean space, using the saddle point optimization. This result also implies an algorithm for $l_{2}$ -SVM, by the connection between MinEB and $l_{2}$ -SVM (see [42, 20, 41]). However, the implied algorithm is not as efficient. Based on [42, 41], the dual of $l_{2}$ -SVM is equivalent to MinEB by a specific feature mapping. It maps a $d$ -dimensional point to the $(d+n)$ -dimensional space. Thus, after the mapping, it takes quadratic time to solve $l_{2}$ -SVM. To avoid this mapping, they designed an algorithm called Core Vector Machine (CVM), in which they can solve $l_{2}$ -SVM by solving $O(1/\epsilon)$ MinEB problems sequentially.

2 Formulate SVM as Saddle Point Optimization

In this section, we formulate both hard-margin SVM and $\nu$ -SVM, and show that they can be reduced to saddle point optimizations. All vectors in the paper are all column vectors by default.

Definition 1 (Hard-margin SVM).

Given $n$ points $x_{i}\in\mathbb{R}^{d}$ for $1\leq i\leq n$ , each $x_{i}$ has a label $y_{i}\in\{\pm 1\}$ . The hard-margin SVM can be formalized as the following quadratic programming [10].

[TABLE]

The dual problem of (1) is defined as follows, which is equivalent to finding the minimum distance between the two convex hulls of two classes of points [5] when they are linearly separable. We call the problem the C-Hull problem.

[TABLE]

where $A$ and $B$ are the matrices in which each column represents a vector of a point with label $+1$ or $-1$ respectively.

Denote the set of points with label $+1$ by $\mathcal{P}$ and the set with label $-1$ by $\mathcal{Q}$ . Let $n_{1}=|\mathcal{P}|$ and $n_{2}=|\mathcal{Q}|$ . Since $\sum_{i}\eta_{i}=1$ , we can regard it as a probability distribution among points in $P$ (similarly for $Q$ ). We denote $\Delta_{n_{1}}$ to be the set of $n_{1}$ -dimensional probability vectors over $\mathcal{P}$ and $\Delta_{n_{2}}$ to be that over $\mathcal{Q}$ . Then, we prove that the C-Hull problem (2) is equivalent to the following saddle point optimization in Lemma 2. We defer the proof to Appendix C.

Lemma 2.

Problem C-Hull (2) is equivalent to the saddle point optimization (3).

[TABLE]

Let $\phi(w,\eta,\xi)=w^{\rm T}A\eta-w^{\rm T}B\xi-\|w\|^{2}/2$ . Note that $\phi(w,\eta,\xi)$ is only linear w.r.t. $\eta$ and $\xi$ . However, in order to obtain an algorithm which converges faster, we hope that the objective function is strongly convex with respect to $\eta$ and $\xi$ . For this purpose, we can add a small regularization term which ensures that the objective function is strongly convex. This is a commonly used approach in optimization (see [3] for an example). Here, we use the entropy function $H(u):=\sum_{i}u_{i}\log u_{i}$ as the regularization term. The new saddle point optimization problem is as follows.

[TABLE]

where $\gamma=\epsilon\beta/2\log n$ . The following lemma describes the efficiency of the above saddle point optimization (4). We defer the proof to Appendix C.

Lemma 3.

Let $(w^{*},\eta^{*},\xi^{*})$ and $(w^{\circ},\eta^{\circ},\xi^{\circ})$ be the optimal solution of saddle point optimizations (3) and (4) respectively. Define $\mathsf{OPT}$ as in (3). Define

[TABLE]

Then $g(w^{*})-g(w^{\circ})\leq\epsilon\mathsf{OPT}$ (note that $g(w^{*})=\mathsf{OPT}$ ).

We call the saddle point optimization (4) the Hard-Margin Saddle problem, abbreviated as HM-Saddle. Next, we discuss $\nu$ -SVM (see [11, 38]) and again provide an equivalent saddle point optimization formulation.

Definition 4 ( $\nu$ -SVM).

Given $n$ points $x_{i}\in\mathbb{R}^{d}$ for $1\leq i\leq n$ , each $x_{i}$ has a label $y_{i}\in\{+1,-1\}$ . $\nu$ -SVM is the quadratic programming as follows.

[TABLE]

[11] presented a geometry interpretation for $\nu$ -SVM. They proved that $\nu$ -SVM is equivalent to the problem of finding the closest distance between two reduced convex hulls as follows.

[TABLE]

We call the above problem the Reduced Convex Hull problem, abbreviated as RC-Hull. The difference between C-Hull (2) and RC-Hull (6) is that in the latter one, each entry of $\eta$ and $\xi$ has an upper bound $\nu$ . Geometrically, it means to compress the convex hull of $\mathcal{P}$ and $\mathcal{Q}$ such that the two reduced convex hulls are linearly separable. We define $\mathcal{D}_{n_{1}}$ to be the domain of $\eta$ in RC-Hull, i.e., $\{\eta\mid\|\eta\|_{1}=1,0\leq\eta_{i}\leq\nu,\forall i\}$ and $\mathcal{D}_{n_{2}}$ to be the domain of $\xi$ , i.e., $\{\xi\mid\|\xi\|_{1}=1,0\leq\xi_{j}\leq\nu,\forall j\}$ . Similar to Lemma 2, we have the following lemma. The proof is deferred to Appendix C.

Lemma 5.

RC-Hull (6) is equivalent to the following saddle point optimization.

[TABLE]

Again, we add two entropy terms to make the objective function strongly convex with respective to $\eta$ and $\xi$ .

[TABLE]

where $\gamma=\epsilon\beta/2\log n$ . We call this problem a $\nu$ -Saddle problem. Similar to Lemma 3, we can prove that $\nu$ -Saddle (8) is a $(1-\epsilon)$ -approximation of the saddle point optimization (7). See Lemma 15 in Appendix C for the details.

Overall, we formulate hard-margin SVM and $\nu$ -SVM as saddle point problems and prove that through solving HM-Saddle and $\nu$ -Saddle, we can solve hard-margin SVM and $\nu$ -SVM.222 Some readers may wonder why the formulations of HM-Saddle and $\nu$ -Saddle only depends on $(w,\eta,\xi)$ but not the offset $b$ . In fact, according to the fact that the hyperplane bisects the closest points in the (reduced) convex hulls, it is not difficult to show that $b^{*}=w^{*\rm T}(A\eta^{*}+B\xi^{*})/2$ .

3 Saddle Point Optimization Algorithms for SVM

In this section, we propose efficient algorithms to solve the two saddle point optimizations: HM-Saddle (4) and $\nu$ -Saddle (8). The framework is inspired by the prior work by [3]. However, their algorithm does not imply an effective SVM algorithm directly as discussed in Section 1.2. We modify the update rules and introduce new projection methods to adjust the framework to the HM-Saddle and $\nu$ -Saddle problems. We highlight that both the new update rules and projection methods are non-trivial.

First, we introduce a preprocess step to make the data vectors more homogeneous in each coordinate. Then, we explain the update rules and projection methods of our algorithm: Saddle-SVC.

For convenience, we assume that in the hard margin case $\|x_{i}\|^{2}\leq 1$ for $1\leq i\leq n$ . 333 It can be achieved by scaling all data by factor $1/\max\|x_{i}\|^{2}$ in $O(nd)$ time. Let $W$ be the $d\times d$ Walsh-Hadamard matrix and $D$ be a $d\times d$ diagonal matrix whose entries are i.i.d. chosen from $\pm 1$ with equal probability. Then, we transform the data by left-producting the matrix $WD$ . Then with high probability, for any point $x_{i}$ satisfied that [1]

[TABLE]

Let $X^{+}=WDA$ and $X^{-}=WDB$ . It means that after transformation, with high probability, the value of each entry in $X^{+}$ or $X^{-}$ is at most $O(\sqrt{\log n/d})$ . This transformation can be completed in $O(nd\log d)$ time by FFT. Note that $WD$ is an invertible matrix which represents a rotation and mirroring operation. Hence, it does not affect the optima of the problem. In fact, the “Hadamard transform trick” has been used in the numerical analysis literature explicitly or implicitly (see e.g., [16, 25, 3]). Roughly speaking, the main purpose of the transform is to make all coordinates of $X$ more uniform, such that the uniform sampling (line 1 in Algorithm 2) is more efficient (otherwise, the large coordinates would have a disproportionate effect on uniform sampling).

After the data transformation, we define some necessary parameters. See Line 4 of Algorithm 1 for details. 444 Careful readers may notice that $\gamma=\epsilon\beta/(2\log n)$ . But $\beta$ is an unknown parameter, which is the ratio of the minimum distance to the maximum one among the points. The same issue also appears in the previous work [3]. The role of $\beta$ is similar to the step size in the stochastic gradient descent algorithm. In practice, we could try several $\beta=10^{-k}$ for $k\in\mathbb{Z}$ and choose the best one.

We use “ $\alpha[t]$ ” to represent the value of variable “ $\alpha$ ” at iteration $t$ . For example, $w[0]$ , $\eta[0],\xi[0]$ are the initial value of $w,\eta,\xi$ and are defined in Line 5 of Algorithm 1.

Update Rules: In order to unify HM-Saddle and $\nu$ -Saddle in the same framework, we use $(\mathcal{S}_{1},\mathcal{S}_{2})$ to represent the domains $(\Delta_{n_{1}},\Delta_{n_{2}})$ in HM-Saddle (see formula (3)) or $(\mathcal{D}_{n_{1}},\mathcal{D}_{n_{2}})$ in $\nu$ -Saddle (see formula (7)).

Generally speaking, the update rules alternatively maximize the objective with respect to $w$ and minimize with respect to $\eta$ and $\xi$ . See the details in Algorithm 2.

Firstly, we update $w$ according to Line 4 in Algorithm 2. It is equivalent to a variant of the proximal coordinate gradient method with $l_{2}$ -norm regularization as follows.

[TABLE]

We briefly explain the intuition of (9). Note that the term $(\delta_{i^{*}}^{+}-\delta_{i^{*}}^{-})$ in (9) can be considered as the term $\langle X^{+}_{i^{*}},\eta[t]\rangle-\langle X^{-}_{i^{*}},\xi[t]\rangle$ adding an extra momentum term $\theta(\eta[t]-\eta[t-1])$ and $\theta(\xi[t]-\xi[t-1])$ for dual variable $\eta[t]$ and $\xi[t]$ respectively (see Line 2 and 3 in Algorithm 2). Further, $(\langle X^{+}_{i^{*}},\eta[t]\rangle-\langle X^{-}_{i^{*}},\xi[t]\rangle)w_{i^{*}}-w^{2}_{i^{*}}/2$ is the term in the objective function (4) and (8) which are related to $w$ . The $(w_{i^{*}}-w_{i^{*}}[t])^{2}/2$ ) is the $l_{2}$ -norm regularization term.

Moreover, rather than update the whole $w$ vector, randomly selecting one dimension $i^{*}\in[d]$ and updating the corresponding $w_{i^{*}}$ in each iteration can reduce the runtime per round.

The update rules for $\eta$ and $\xi$ are listed in Line 5 and 6 in Algorithm 2, which are the proximal gradient method with a Bergman divergence regularization $V_{x}(y)=H(y)-\langle\nabla H(x),y-x\rangle-H(x)$ . Similar to $(\delta_{i^{*}}^{+}-\delta_{i^{*}}^{-})$ in (9), we also add a momentum term $d(w[t+1]-w[t])$ for primal variable $w$ when updating $\eta$ and $\xi$ .

Projection Methods: However, the update rules for $\eta$ and $\xi$ are implicit update rules. We need to show that we can solve the corresponding optimization problems in line 5 and 6 of Algorithm 2 efficiently. In fact, for both HM-Saddle and $\nu$ -Saddle, we can obtain explicit expressions of these two optimization problems using the method of Lagrange multipliers.

First, we can solve the optimization problem for HM-Saddle (in Line 5 and 6) directly, and the explicit expressions for $\eta$ and $\xi$ are as follows.

[TABLE]

where $Z^{+}$ and $Z^{-}$ are normalizers that ensures $\sum_{i}\eta_{i}[t+1]=1$ and $\sum_{j}\xi_{j}[t+1]=1$ , and

[TABLE]

Note that the factors $Z^{+}$ and $Z^{-}$ are used to project the value $\Phi(\eta_{i}[t],X^{+})$ and $\Phi(\xi_{j}[t],X^{-})$ to the domains $\Delta_{n_{1}}$ and $\Delta_{n_{2}}$ . The above update rules of $\eta$ and $\xi$ can be also considered as the multiplicative weight update method (see [4]).

Next, we consider $\nu$ -Saddle. Compared to HM-Saddle, $\nu$ -Saddle has extra constraints that $\eta_{i},\xi_{j}\leq\nu$ . Thus, we need another projection process (12) to ensure that $\eta[t+1]$ and $\xi[t+1]$ locate in domain $\mathcal{D}_{n_{1}}$ and $\mathcal{D}_{n_{2}}$ respectively. For convenience, we only present the projection for $\eta$ here. The projection for $\xi$ is similar. Let $\eta_{i}$ be $\Phi(\eta_{i}[t],X^{+})/Z^{+}$ .

[TABLE]

Note that there are at most $1/\nu$ (a constant) entries $\eta_{i}$ of value $\nu$ during the whole projection process. In each iteration, there must be at least 1 more entry $\eta_{i}=\nu$ since we make all entries $\eta_{j}>\nu$ equal to $\nu$ after the iteration. Thus, the number of iterations in (12) is at most $1/\nu$ . By (12), we project $\eta$ and $\xi$ to the domains $\mathcal{D}_{n_{1}}$ and $\mathcal{D}_{n_{2}}$ respectively.

We claim that the result of projection (12) is exactly the optimal solution in Line 5. The proof is deferred to Appendix A. Thus, we need $O(n/\nu)$ time to compute $\eta[t+1]$ . Since we assume that $\nu$ is a constant, it only costs linear time. In practice, if $\nu$ is extremely small, we have another update rule to get $\eta[t+1]$ and $\xi[t+1]$ in $O(n\log n)$ time. See Appendix A for details. Finally, we give our main theorem for our algorithm as follows. See the proof in Appendix C.1.

Theorem 6.

Algorithm 2 computes $(1-\epsilon)$ -approximate solutions for HM-Saddle and $\nu$ -Saddle by $\tilde{O}(d+\sqrt{d/\epsilon\beta})$ iterations. Moreover, it takes $O(n)$ time for each iteration.

Combining with Lemmas 2, 3 and 5, we obtain $(1-\epsilon)$ -approximate solutions for C-Hull and RC-Hull problems. Hence by strong duality, we obtain $(1-\epsilon)$ -approximations for hard-margin SVM and $\nu$ -SVM in $\tilde{O}(n(d+\sqrt{d/\epsilon\beta}))$ time.

Theorem 7.

A $(1-\epsilon)$ -approximation for either hard-margin SVM or $\nu$ -SVM can be computed in $\tilde{O}(n(d+\sqrt{d/\epsilon\beta}))$ time.

4 Distributed SVM

Server and Clients Model: We extend Saddle-SVC to the distributed setting and call it Saddle-DSVC. We consider the popular distributed setting: the server and clients model. Denote the server by $S$ . Let $\mathcal{C}$ be the set of clients and $|\mathcal{C}|=k$ . We use the notation $C.\alpha$ to represent any variable $\alpha$ saved in client $C$ and use $S.\alpha$ to represent a variable $\alpha$ saved in the server.

First, we initialize some parameters in each client as the pre-processing step in Section 3. Each client maintains the same random diagonal matrix $D_{d\times d}$ and the total number of points in each type (i.e, $|\mathcal{P}|=n_{1}$ and $|\mathcal{Q}|=n_{2}$ ).555It can be realized using $O(k)$ communication bits. Moreover, each client $C$ applies a Hadamard transformation to its own data and initialize the partial probability vectors $C.\eta$ and $C.\xi$ for its own points.

We first consider HM-Saddle. The interaction between clients and the server can be divided into three rounds in each iteration.

In the first round, the server randomly chooses a number $i^{*}\in[d]$ and broadcasts $i^{*}$ to all clients. Each client computes $C.\delta_{i^{*}}^{+}$ and $C.\delta_{i^{*}}^{-}$ and sends them back to the server. 2. 2.

In the second round, the server sums up all $C.\delta_{i^{*}}^{+}$ and $C.\delta_{i^{*}}^{-}$ and computes $S.\delta_{i^{*}}^{+}$ and $S.\delta_{i^{*}}^{-}$ . We can see that $S.\delta_{i^{*}}^{+}$ (resp. $S.\delta_{i^{*}}^{-}$ ) is exactly $\delta_{i^{*}}^{+}$ (resp. $\delta_{i^{*}}^{-}$ ) in Algorithm 2. The server broadcasts $S.\delta_{i^{*}}^{+}$ and $S.\delta_{i^{*}}^{-}$ to all clients. By $S.\delta_{i^{*}}^{+}$ and $S.\delta_{i^{*}}^{-}$ , each client updates $w$ individually. Moreover, each client $C\in\mathcal{C}$ updates its own $C.\eta$ and $C.\xi$ according to the new directional vector $w$ . In order to normalize the probability vectors $\eta$ and $\xi$ , each client sends the summation $C.Z^{+}$ and $C.Z^{-}$ to the server. 3. 3.

In the third round, the server computes $(S.Z^{+},S.Z^{-})\leftarrow\sum_{C\in\mathcal{C}}(C.Z^{+},C.Z^{-})$ and broadcasts to all clients the normalization factors $S.Z^{+}$ and $S.Z^{-}$ . Finally, each client updates its partial probability vector $C.\eta$ and $C.\xi$ based on the normalization factors.

As we discuss in Section 3, for $\nu$ -Saddle, we need another $O(1/\nu)$ rounds to project $\eta$ and $\xi$ to the domains $\mathcal{D}_{n_{1}}$ and $\mathcal{D}_{n_{2}}$ .

Each client computes $C.\varsigma^{+},C.\varsigma^{-}$ and $C.\Omega^{+}$ , $C.\Omega^{-}$ according to (12) and sends them to the server. The server sums up all $C.\varsigma^{+},C.\varsigma^{-},C.\Omega^{+},C.\Omega^{-}$ respectively and gets $S.\varsigma^{+},S.\varsigma^{-},$ $S.\Omega^{+},S.\Omega^{-}$ . If both $S.\varsigma^{+}$ and $S.\varsigma^{-}$ are zeros, the server stops this iteration. Otherwise, the server broadcasts to all clients the factors $S.\varsigma^{+},S.\varsigma^{-},S.\Omega^{+},S.\Omega^{-}$ . All clients update their $C.\eta$ and $C.\xi$ according to (12) and repeat Step 4 again.

We give the pseudocode in Algorithm 4 in Appendix B. By Theorem 6, after $T=\tilde{O}(d+\sqrt{d/\epsilon})$ iterations, all clients compute the same $(1-\epsilon)$ -approximate solution $w=w[T]$ for SVM. W.l.o.g, let the first client send $w$ to the server. By at most $O(n)$ more communication cost, the server can compute the offset $b$ , the margin for hard-margin SVM and the objective value for the $\nu$ -SVM. The correctness of Algorithm Saddle-DSVC is oblivious since we obtain the same $w[t]$ as in Saddle-SVC after each iteration.

Communication Complexity of Saddle-DSVC: Note that in each iteration of Algorithm 4, the server and clients interact three times for hard-margin SVM and $O(1/\nu)$ times for $\nu$ -SVM. Thus, the communication cost of each iteration is $O(k)$ . By Theorem 6, it takes $\tilde{O}(d+\sqrt{d/\epsilon})$ iterations. Thus, we have the following theorem.

Theorem 8.

The communication cost of Saddle-DSVC is $\tilde{O}(k(d+\sqrt{d/\epsilon}))$ .

Liu et el. [28] prove that the lower bound of the communication cost for distributed SVM is $\Omega(k\min\{d,1/\epsilon\})$ .

Theorem 9 (Theorem 6 in [28]).

Consider a set of $d$ -dimension points distributed at $k$ clients. The communication cost to achieve a $(1-\epsilon)$ -approximation of the distributed SVM problem is at least $\Omega(k\min\{d,1/\epsilon\})$ for any $\epsilon>0$ .

If $d=\Theta(1/\epsilon)$ , the communication lower bound is $\Omega(k(d+\sqrt{d/\epsilon}))$ which matches the communication cost of Saddle-DSVC.

5 Experiments

In this section, we analyze the performance of Saddle-SVC and Saddle-DSVC for both $\nu$ -SVM and hard-margin SVM.

First, we compare Saddle-SVC for $\nu$ -SVM with NuSVC in scikit-learn [35]. Current best $\nu$ -SVM solver is based on quadratic programming. NuSVC is one of the fastest QP-based realization, which based on the famous SVM library LIBSVM [8]. We compare Saddle-SVC with NuSVC and show that when the two reduced polytopes are linearly separable under the parameter $\nu$ , Saddle-SVC converges faster than NuSVC, especially when the data size is large and dense. As a supplement, in Appendix D, we also compare Saddle-SVC for $\nu$ -SVM with LinearSVC in scikit-learn based on LIBLINEAR [13] which is the current best algorithm for linear kernel C-SVM and $l_{2}$ -SVM.666However, we should note that LinearSVC is used to process $C$ -SVM and $l_{2}$ -loss SVM, but not $\nu$ -SVM or hard-margin SVM. Thus, their objective function are incomparable. We compare the test accuracy instead of the objective value. We show that for the large and dense data set, Saddle-SVC is comparable to LinearSVC and even better.

Next, we compare Saddle-SVC for hard-margin SVM with Gilbert algorithm [18]. Gilbert algorithm is the current best algorithm for hard-margin SVM. We show that Saddle-SVC converges faster when the data dimension is large.

On the other hand, we also implement our algorithm in the distributed setting and compare it with distributed Gilbert algorithm [28] and HOGWILD! [37]. We note that the current best distributed algorithm for hard-margin SVM is distributed Gilbert algorithm [28]. Our experiments indicate that Saddle-DSVC has lower communication cost in practice. On the other hand, there is no practical distributed algorithm for $\nu$ -SVM so far. Our algorithm is the first distributed algorithm for $\nu$ -SVM. To evaluate the performance of our distributed algorithm, we first show the convergence curve of Saddle-DSVC on some common datasets. As a supplement, in Appendix D, we also compare the convergent rate with HOGWILD! [37]. 777Note that HOGWILD! is used to solve $C$ -SVM or $l_{2}$ -SVM. We show that Saddle-DSVC converges faster than HOGWILD! w.r.t. communication cost.

The CPU of our platform is Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, and the system is CentOS Linux. We use both synthetic and real-world data sets. The real data sets are from [8]. See Appendix D for the way to generate synthetic data. In each experiment, we mainly care about the performance of algorithms w.r.t. $n$ and $d$ since they are data dependent parameters.

Saddle-SVC vs. NuSVC:888Note that NuSVC uses another equivalent form of $\nu$ -SVM. The paramater $\mu$ in NuSVC equals $2/n\nu$ for $\nu$ in (5). See details in Appendix D. Here we use the data sets “a9a”, “ijcnn1”, “phishing” , and “skin_nonskin” from [8]. Note that “a9a”, “ijcnn1” has the corresponding test set “a9a.t”, “ijcnn1.t”. For “phishing” and “skin_nonskin”, we random choose $10\%$ data as the test set and let the remaining part be the training set. Let

[TABLE]

and set $\alpha=0.85$ for $\nu$ -SVM. We show the experiment results in Figure 1 and we can see that Saddle-SVC converges faster with the similar test accuracy. Our algorithm performs much better when the data size is large. We show that the results in Figure 2 based on synthetic data sets sampling from the same distribution with different sizes.

We discuss a bit more for the parameter selection. Chang and Lin [7] show that $\nu$ -SVM, $\nu\leq 1$ is feasible $\nu$ should larger than $1/\min(n_{1},n_{2})$ where $n_{1}$ and $n_{2}$ are the number of the two classes of points respectively. Moreover, if $\nu$ is too close to $1$ , the $\nu$ -SVM has poor prediction ability because of the two reduced polytopes may not separable. We discuss the detail reasons in Appendix D. We find that, in the experiment, $\alpha>0.7$ usually ensures that the two reduced polytopes are linearly separable, i.e., the objective function converges to a positive number. In Appendix D, we also do experiments for other $\alpha$ s and show that if $\alpha$ is small, $\nu$ -SVM model has poor prediction ability.

Saddle-SVC vs. Gilbert Algorithm: For the hard-margin SVM, we compare Saddle-SVC with Gilbert Algorithm. We use linearly separable data “iris” and “mushrooms”. Since it is hard to find a large real data set which is linearly separable, we generate some synthetic data sets and show that Saddle-SVC converges faster when data dimension is large. We repeat the iterations of Saddle-SVC and compute the objective function every $T$ rounds. If the difference between two consecutive objective value is less than $\epsilon$ , then output the results. See the results in Table 1, in which we can see that Saddle-SVC gets smaller objective value (the closest distance between the two polytopes) with less running time when data dimension is large.

Saddle-DSVC: For hard-margin SVM, we compare Saddle-DSVC with distributed Gilbert algorithm. We compare the margins w.r.t. the communication cost. We count all information communication between the clients and server as the communication cost. The data sets are “mushrooms” and synthetic data sets with different dimensions. Figure 3 illustrates that Saddle-DSVC converges faster w.r.t. communication cost. The data is distributed to $k=20$ nodes. Note that it takes $kd$ communication cost if each client sends a point to the server. We set one unit of $x$ -coordinate to represent $kd$ communication cost.

For the $\nu$ -SVM, we analyze the convergence property on some common data sets including “phishing”, “a9a”, “gisette”, “madelon” from [8]. We show the details in Figure 4. Besides, we also compare Saddle-DSVC with HOGWILD!. We compare the accuracy instead of objective value since they solve different SVM variants. We provide the experiment details in Appendix D and show that our algorithm is convergent faster w.r.t. communication cost.

Appendix A The Equivalence of the Explicit and Implicit Update Rules of $\eta$ and $\xi$

Lemma 10 (Update Rules of HM-Saddle).

The following two update rules are equivalent.

•

$\eta[t+1]:=$ **

[TABLE]

•

$\eta_{i}[t+1]:=$ **

[TABLE]

for each $i\in[n_{1}]$ , where $Z=\sum_{i}\eta_{i}$ 999Recall that $X_{\cdot i}$ is the $i$ th column of $X$ .

Proof.

The Lagrangian function of the first optimization formulation is

[TABLE]

Thus, we have

[TABLE]

Solve the above equalities, we obtain

[TABLE]

∎

Lemma 11 (Update Rules of $\nu$ -Saddle).

The following three update rules are equivalent.

Rule 1:* $\eta[t+1]:=$ *

[TABLE]

Rule 2:**

•

Step 1: $\eta_{i}=$

[TABLE]

for each $i\in[n_{1}]$ , where $Z=\sum_{i}\eta_{i}$ .

•

Step 2: Sort $\eta_{i}$ by the increasing order. W.l.o.g., assume that $\eta_{1},\ldots,\eta_{n_{1}}$ is in increasing order. Define $\varsigma_{i}=\sum_{j\geq i}(\eta_{j}-\nu)$ and $\Omega_{i}=\sum_{j<i}\eta_{j}$ . Find the largest index $i^{*}\in[n]$ such that $\varsigma_{i^{*}}\geq 0$ and $\eta_{i^{*}-1}(1+\varsigma_{i^{*}}/\Omega_{i^{*}})<\nu$ by binary search.

•

Step 3:

[TABLE]

Rule 3:**

•

Step 1: $\eta_{i}:=$

[TABLE]

for each $i\in[n_{1}]$ , where $Z=\sum_{i}\eta_{i}$ .

•

Step 2:

[TABLE]

Proof.

Similar to the proof of Lemma 10, we first give the Lagrangian function of the first optimization formulation as follows.

[TABLE]

By KKT conditions, we have the following.

[TABLE]

We first show the equivalence between Rule 1 and Rule 2. Note that $\eta[t+1]$ in Rule 2 satisfies the second and the fourth KKT conditions. We only need to give all $\alpha_{i}$ and $\lambda$ satisfying other KKT conditions for Rule 2. Let

[TABLE]

Let

[TABLE]

as defined in Step 1 of Rule 2. For $1\leq i\leq i^{*}-1$ , let $\alpha_{i}=0$ . For $i\geq i^{*}$ , let

[TABLE]

The inequality follows from the definition of $i^{*}$ . Note that we only need to prove that $\eta_{i^{*}}(1+\varsigma_{i^{*}}/\Omega_{i^{*}})\geq\nu$ . If ${i^{*}}\geq\nu$ , then the above inequality holds directly. Otherwise if $\eta_{i^{*}}<\nu$ and $\eta_{i^{*}}(1+\varsigma_{i^{*}}/\Omega_{i^{*}})<\nu$ , we have that $\varsigma_{i^{*}+1}=\varsigma_{i^{*}}+\nu-\eta_{i^{*}}>0$ and $\Omega_{i^{*}+1}=\Omega_{i^{*}}+\eta_{i^{*}}$ . We also have the following inequality

[TABLE]

which contradicts with the definition of $i^{*}$ . Finally, randomly choose an index $i$ , let

[TABLE]

By the chosen of $\alpha_{i}$ , it is not hard to check that the value of $\lambda$ is the same for any index $i$ . Thus, $\eta_{i}[t+1],\alpha_{i}$ and $\lambda$ are the unique solution of KKT conditions. So Rule 1 and Rule 2 are equivalent. By a similar argument (define suitable $\alpha_{i}$ and $\lambda$ ), we can prove that Rule 1 and Rule 3 are equivalent, which finishes the proof. ∎

Remark 12.

We analyze Rule 2 in Lemma 11. Roughly speaking, we find a suitable value $\eta_{i^{*}}$ , set all value $\eta_{j}>\eta_{i^{*}}$ to be $\nu$ , and scales up other values by some factor $1+\varsigma_{i^{*}}/\Omega_{i^{*}}$ . We can verify that the running time of Rule 2 is $O(n\log n)$ since both the sorting time and the binary search time are $O(n\log n)$ . On the other hand, recall that the running time of Rule 3 is $O(n/\nu)$ (explained in Section 3). Thus, if the parameter $\nu$ is extremely small, we can use Rule 2 in practice.

Appendix B Details for Distributed Algorithms: Saddle-DSVC

This section is supplementary for Section 4. First, we give the pseudocode of DisSaddle-SVC. See Algorithm 3 for the pre-processing step for each clients. Recall that we assume there are $m_{1}$ points $x^{+}_{1},x^{+}_{2},\ldots,x^{+}_{m_{1}}$ and $m_{2}$ points $x^{-}_{1},x^{-}_{2},\ldots,x^{-}_{m_{2}}$ maintained in $C$ . We use $\mathbf{1}^{m}$ to denote a vector with all components being $1$ . The initialization is as follows.

[TABLE]

Next, see Algorithm 4 for the interactions between the server and clients in every iteration. Note that only $\nu$ -Saddle needs the fourth round in Algorithm 4. We use $\text{flag}_{\nu}\in\{\mathbf{True,False}\}$ to distinguish the two cases. If we consider $\nu$ -Saddle, let $\text{flag}_{\nu}$ be True. Otherwise, let $\text{flag}_{\nu}$ be False.

Then, we analyze the communication cost.

Theorem 13.

The communication cost of Saddle-DSVC is $\tilde{O}(k(d+\sqrt{d/\epsilon}))$ .

Proof.

Note that in each iteration of Algorithm 4, the server and clients interact three times for hard-margin SVM and $O(1/\nu)$ times for $\nu$ -SVM. The communication cost of each iteration is $O(k)$ . By Theorem 6, it takes $\tilde{O}(d+\sqrt{d/\epsilon})$ iterations. Thus, the total communication cost is $\tilde{O}(k(d+\sqrt{d/\epsilon}))$ . ∎

1: for $t\leftarrow 0$ to $T-1$ do

2: # first round

3: Server: Pick an index $i^{*}\in\{1,2,\ldots,d\}$ uniformly at random and send $i^{*}$ to every client.

4: for client $C\in\mathcal{C}$ do

5: $C.\delta_{i^{*}}^{+}\leftarrow$ $\langle C.X^{+}_{i^{*}},C.\eta[t]+\theta(C.\eta[t]-C.\eta[t-1])\rangle$

6: $C.\delta^{-}_{i^{*}}\leftarrow$ $\langle C.X^{-}_{i^{*}},C.\xi[t]+\theta(C.\xi[t]-C.\xi[t-1])\rangle$

7: Send $C.\delta_{i^{*}}^{+}$ and $C.\delta_{i^{*}}^{-}$ to server.

8: end for

9: # second round

10: Server: Let $S.\delta_{i^{*}}^{+}=\sum_{C\in\mathcal{C}}C.\delta_{i^{*}}^{+}$ and $S.\delta_{i^{*}}^{-}=\sum_{C\in\mathcal{C}}C.\delta_{i^{*}}^{-}$ . Broadcast $S.\delta_{i^{*}}^{+}$ and $S.\delta_{i^{*}}^{-}$ .

11: for client $C\in\mathcal{C}$ do

12: $\forall i\in[d],w_{i}[t+1]\leftarrow$ $\left\{\begin{array}[]{ll}(w_{i}[t]+\sigma(S.\delta_{i}^{+}-S.\delta_{i}^{-}))/(\sigma+1),&\text{if }i=i^{*}\\ x\end{array}\right.$

13: $\forall j,C.\eta_{j}[t+1]\leftarrow$ $\exp\big{\{}(\gamma+d\tau^{-1})^{-1}(d\tau^{-1}\log C.\eta_{j}[t]$ $\hskip 113.81102pt-\langle w[t]+d(w[t+1]-w[t]),C.X^{+}_{\cdot j}\rangle)\big{\}}$

14: $\forall j,C.\xi_{j}[t+1]\leftarrow$ $\exp\big{\{}(\gamma+d\tau^{-1})^{-1}(d\tau^{-1}\log C.\xi_{j}[t]$ $\hskip 113.81102pt+\langle w[t]+d(w[t+1]-w[t]),C.X^{-}_{\cdot j}\rangle)\big{\}}$

15: $C.Z^{+}\leftarrow\sum_{j}C.\eta_{j}[t+1],$ $C.Z^{-}\leftarrow\sum_{j}C.\xi_{j}[t+1]$

16: Send $C.Z^{+}$ and $C.Z^{-}$ to server

17: end for

18: # third round

19: Server: Let $(S.Z^{+},S.Z^{-})\leftarrow\sum_{C\in\mathcal{C}}(C.Z^{+},C.Z^{-})$ , and broadcast $S.Z^{+}$ and $S.Z^{-}$ .

20: for client $C\in\mathcal{C}$ do

21: $C.\eta_{j}[t+1]\leftarrow C.\eta_{j}[t+1]/S.Z^{+}$ , $\forall C.\xi_{j}[t+1]\leftarrow C.\xi_{j}[t+1]/S.Z^{-}$

22: end for

23: # fourth round, only for $\nu$ -Saddle. $\text{flag}_{\nu}$ is true if use the code for $\nu$ -Saddle

24: if $\text{flag}_{\nu}$ ** is True** then

25: repeat

26: for client $C\in\mathcal{C}$ do

27: $C.\varsigma^{+}=\sum_{\eta_{i}>\nu}(\eta_{i}-\nu)$ , $C.\Omega^{+}=\sum_{\eta_{i}<\nu}\eta_{i}$ .

28: $C.\varsigma^{-}=\sum_{\xi_{j}>\nu}(\xi_{j}-\nu)$ , $C.\Omega^{-}=\sum_{\xi_{j}<\nu}\xi_{j}$ .

29: Send $C.\varsigma^{+},C.\varsigma^{-},C.\Omega^{+},C.\Omega^{-}$ to server.

30: end for

31: Server: $(S.\varsigma^{+},S.\varsigma^{-},S.\Omega^{+},S.\Omega^{-})\leftarrow$ $\hskip 113.81102pt\sum_{C\in\mathcal{C}}(C.\varsigma^{+},C.\varsigma^{-},C.\Omega^{+},C.\Omega^{-})$ .

32: for client $C\in\mathcal{C}$ do

33: $\forall i$ , $\mathbf{if}\;\eta_{i}>\nu,\mathbf{then}\;\eta_{i}=\nu$ ; $\forall i$ , $\mathbf{if}\;\eta_{i}<\nu,\mathbf{then}\;\eta_{i}=\eta_{i}(1+S.\varsigma^{+}/S.\Omega^{+})$

34: $\forall j$ , $\mathbf{if}\;\xi_{j}>\nu,\mathbf{then}\;\xi_{j}=\nu$ ; $\forall j$ , $\mathbf{if}\;\xi_{j}<\nu,\mathbf{then}\;\xi_{j}=\xi_{j}(1+S.\varsigma^{-}/S.\Omega^{-})$

35: end for

36: until $S.\varsigma^{+}$ and $S.\varsigma^{-}$ are zeroes

37: end if

38: end for

Liu et al. [28] proved a theoretical lower bound of the communication cost for distributed SVM as follows. Note that the statement of Theorem 14 is not exactly the same as the Theorem 6 in [28]. This is because they omit the case that $d<1/\epsilon$ . We prove that they are equivalent briefly. Note that if $d=\Theta(1/\epsilon)$ , the communication lower bound is $\Omega(k(d+\sqrt{d/\epsilon}))$ which matches the communication cost of our algorithm Saddle-DSVC.

Theorem 14 (Theorem 6 in [28]).

Consider a set of $d$ -dimension points distributed at $k$ clients. The communication cost to achieve a $(1-\epsilon)$ -approximation of the distributed SVM problem is at least $\Omega(k\min\{d,1/\epsilon\})$ for any $\epsilon>0$ .

Proof Sketch.

In Theorem 6 of [28], the authors obtain a lower bound $\Omega(kd)$ if $\epsilon\leq(\sqrt{17}-4)/16\d{)}$ . Their proof can be extended to the case $\epsilon\geq(\sqrt{17}-4)/16\d{)}$ . In this case, we can make a reduction from the $k$ -OR problem in which each client maintains a $((\sqrt{17}-4)/16\epsilon)$ -bit vector instead of a $d$ -bit vector. As the proof of Theorem 6 in [28], we can obtain a lower bound $\Omega(k/\epsilon)$ , which proves the theorem. ∎

Appendix C Missing Proofs

Lemma 2 (restated).

Problem C-Hull (2) is equivalent to the saddle point optimization (3).

Proof.

Consider the saddle point optimization (3). First, note that

[TABLE]

The range of the term $(A\eta-B\xi)$ for $\eta\in\Delta_{n_{1}},\xi\in\Delta_{n_{2}}$ is a convex set, denoted by $\mathcal{S}$ . Since the convex hulls of $\mathcal{P}$ and $\mathcal{Q}$ are linearly separable, we have $0\notin\mathcal{S}$ . Denote $\phi(w,z)=w^{\rm T}z-\frac{1}{2}\|w\|^{2}$ for any $w\in\mathbb{R}^{d},z\in\mathcal{S}$ . Then (3) is equivalent to $\max_{w}\min_{z\in\mathcal{S}}\phi(w,z)$ . Note that

[TABLE]

Thus, we only need to consider those directions $w\in\mathbb{R}^{d}$ such that there exists a point $z\in\mathcal{S}$ with $w^{T}z\geq 0$ . We use $\mathcal{W}$ to denote the collection of such directions.

Let $u$ be a unit vector in $\mathcal{W}$ . Denote

[TABLE]

By this definition, $z_{u}$ is the point with smallest projection distance to $u$ among $\mathcal{S}$ (see Figure 5). Observe that if a direction $w=c\cdot u$ ( $c>0$ ), then we have $\arg\min_{z}\phi(w,z)=\arg\min_{z}\phi(u,z)$ . Also note that

[TABLE]

Let

[TABLE]

$w_{u}$ is the projection point of $z_{u}$ to the line $ou$ , where $o$ is the origin. See Figure 5 for an example. Overall, we have

[TABLE]

The last equality is by the Pythagorean theorem. Let $z^{*}$ be the closest point in $\mathcal{S}$ to the origin point. Next, we show that $\max_{u\in\mathcal{W}:\|u\|=1}\|w_{u}\|^{2}=\|z^{*}\|^{2}$ . Given a unit vector $u\in\mathcal{W}$ , define $w^{\prime}$ to be the projection point of $z^{*}$ to the line $ou$ . By the definition of $z_{u}$ and $w_{u}$ , we have that $\max_{u}\|w_{u}\|^{2}\leq\|w^{\prime}\|^{2}\leq\|z^{*}\|^{2}$ . Moreover, let $u=z^{*}/\|z^{*}\|$ . In this case, we have $\|w_{u}\|^{2}=\|z^{*}\|^{2}$ . Thus, we conclude that $\max_{u}\|w_{u}\|^{2}=\|z^{*}\|^{2}$ .

Overall, we prove that

[TABLE]

Thus, C-Hull (2) is equivalent to the saddle point optimization (3). ∎

Lemma 3 (restated).

Let $(w^{*},\eta^{*},\xi^{*})$ and $(w^{\circ},\eta^{\circ},\xi^{\circ})$ be the optimal solution of saddle point optimizations (3) and (4) respectively. Define $\mathsf{OPT}$ as in (3). Define

[TABLE]

Then $g(w^{*})-g(w^{\circ})\leq\epsilon\mathsf{OPT}$ (note that $g(w^{*})=\mathsf{OPT}$ ).

Proof.

Let

[TABLE]

By the definition of saddle points, we have

[TABLE]

Note that entropy function satisfies $0\leq H(u)\leq\log n$ for any $u\in\Delta_{n}$ . Thus, $\gamma H(\tilde{\eta})+\gamma H(\tilde{\xi})\leq\frac{\epsilon\beta}{2\log n}\cdot(\log n_{1}+\log n_{2})\leq\epsilon\mathsf{OPT}$ . Overall, we prove that $g(w^{*})-g(w^{\circ})\leq\epsilon\mathsf{OPT}$ . ∎

Lemma 5 (restated).

RC-Hull (6) is equivalent to the following saddle point optimization.

[TABLE]

Proof.

The proof is almost the same to the proof of Lemma 2. The only difference is that the range of the term $(A\eta-B\xi)$ is another convex set defined by $\eta\in\mathcal{D}_{n_{1}},\xi\in\mathcal{D}_{n_{2}}$ . ∎

Lemma 15.

Let $(w^{*},\eta^{*},\xi^{*})$ and $(w^{\circ},\eta^{\circ},\xi^{\circ})$ be the optimal solution of saddle point optimizations (7) and (8) respectively. Define $\mathsf{OPT}$ as in (7). Define

[TABLE]

Then $g(w^{*})-g(w^{\circ})\leq\epsilon\mathsf{OPT}$ .

Proof.

Note that $\mathcal{D}_{n_{1}}$ is a convex polytope contained in $\Delta_{n_{1}}$ and $\mathcal{D}_{n_{2}}$ is a convex polytope contained in $\Delta_{n_{2}}$ . It is not hard to verify that the proof of Lemma 3 still holds for $\mathcal{D}_{n_{1}}$ and $\mathcal{D}_{n_{2}}$ . ∎

C.1 Proof of Theorem 6

For preparation, we give two useful Lemmas 16 and 17. Recall that $V_{x}(y)$ is the Bregman divergence function which is defined as $H(y)-\langle\nabla H(x),y-x\rangle-H(x)$ .

The two lemmas generalize Lemma A.1 and Lemma A.2 in [3] by changing the domain $\Delta_{m}$ to a convex polytope $\mathcal{S}_{m}$ contained in $\Delta_{m}$ . However, refer to the proofs of Lemma A.1 and Lemma A.2, it still work for the general version.

Lemma 16.

Let $x_{2}=\operatorname{argmin}_{z\in\mathcal{S}_{m}}\left\{\frac{V_{x_{1}}(z)}{\tau}+\gamma H(z)\right\}$ . Let $\mathcal{S}_{m}$ be a convex polytope contained in $\Delta_{m}$ . Then for every $u\in\mathcal{S}_{m}$ , we have

[TABLE]

Lemma 17.

Let $x=\operatorname{argmin}_{z\in\mathcal{S}_{m}}\left\{H(z)\right\}$ . Let $\mathcal{S}_{m}$ be a convex polytope contained in $\Delta_{m}$ . Then for all $u\in\mathcal{S}_{m}$ ,

[TABLE]

Combing the above lemmas and almost the same analysis as in Theorem 2.2 in [3], we obtain the following Theorem 18.

Theorem 18.

After $T$ iterations of Algorithm 2 (both HM-Saddle and $\nu$ -Saddle versions), we obtain a directional vector $w[T]\in\mathbb{R}^{d}$ satisfying that

[TABLE]

where $\tau\leftarrow\frac{1}{2q}\sqrt{\frac{d}{\gamma}},\sigma\leftarrow\frac{1}{2q}\sqrt{d\gamma},\theta\leftarrow 1-\frac{1}{d+q\sqrt{d}/\sqrt{\gamma}}$ , for some $q=O(\sqrt{\log n}).$

Proof Sketch.

The difference between our statement and Theorem 2.2 in [3] is that we update two probability vectors $\eta$ and $\xi$ instead of one in an iteration. Thus, we have two terms $V_{\eta[T]}(\eta^{\circ})$ and $V_{\xi[T]}(\xi^{\circ})$ on the left hand side. Moreover, we care about convex polytopes $\mathcal{S}_{1}\subset\Delta_{n_{1}}$ and $\mathcal{S}_{2}\subset\Delta_{n_{2}}$ instead of $\Delta_{n_{1}}$ and $\Delta_{n_{2}}$ .

However, these differences do not influence the correctness of the proof of Theorem 2.2 in [3]. Note that we replace Lemma A.1 and Lemma A.2 in [3] by Lemma 16 and Lemma 17. It is not hard to verify the proof of Theorem 2.2 in [3] works for our theorem. ∎

We also need the following lemma.

Lemma 19.

Define

[TABLE]

where $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ are two convex polytopes such that $\mathcal{S}_{1}\subset\Delta_{n_{1}}$ and $\mathcal{S}_{2}\subset\Delta_{n_{2}}$ . For any $u,v\in\mathbb{R}^{d}$ , we have

[TABLE]

Proof.

Denote by $\nabla g(w)$ any subgradient of $g(w)$ at point $w$ . We write $\nabla g(w)=A\tilde{\eta}_{w}-B\tilde{\xi}_{w}-w$ for any arbitrary $\tilde{\eta}_{w}\in\mathcal{S}_{1},\tilde{\xi}_{w}\in\mathcal{S}_{2}$ satisfying that $g(w)=w^{\rm T}A\tilde{\eta}_{w}-w^{\rm T}B\tilde{\xi}_{w}-\|w\|^{2}$ . Note that $A\tilde{\eta}_{w}$ (resp. $B\tilde{\xi}_{w}$ ) can be considered as a weighted combination of all points $x_{i}$ (resp. $x_{i}$ ), we claim that $\|A\tilde{\eta}_{w}\|\leq 1$ ( $\|B\tilde{\xi}_{w}\|\leq 1$ ) owing to the assumption that every $x_{i}$ satisfies $\|x_{i}\|\leq 1$ . Next, we compute as follows

[TABLE]

∎

Now we are ready to prove Theorem 6 as follows.

Theorem 6 (restated).

Algorithm 2 computes $(1-\epsilon)$ -approximate solutions for HM-Saddle and $\nu$ -Saddle by $\tilde{O}(d+\sqrt{d/\epsilon\beta})$ iterations. Moreover, it takes $O(n)$ time for each iteration.

Proof.

Let

[TABLE]

According to Theorem 18, we have

[TABLE]

In order to get a $(1-\epsilon)$ -approximate solution, according to Lemma 19, it suffices to choose $T$ such that

[TABLE]

Note that $\theta=1-\frac{1}{d+q\sqrt{d}/\sqrt{\gamma}}=1-\frac{1}{d+q\sqrt{d}/\sqrt{\epsilon\beta/2\log n}}$ . Thus, we only need to have

[TABLE]

∎

Appendix D Supplementary Materials of Experiments

Data set: We use both synthetic and real-world data sets. The real data is from [8] including the separable data set “iris” and “mushrooms” and non-separable data set “w8a”, “gisette”, “madelon”, “phishing”, “a1a”, “a5a”,“a9a”, “ijcnn1”, “skin_nonskin”. We summary the information of the data in Table 2.

Besides the real world data, we generate some synthetic data sets. There are three types synthetic data: 1) separable synthetic data, 2) non-separable synthetic data, 3) sparse non-separable synthetic data. We describe the ways to generate them as follows.

•

Separable synthetic data: we randomly choose a hyperplane $H$ which overlaps with the unit norm ball in $\mathbb{R}^{d}$ space. Then we randomly sample $n$ points in a subset of the unit ball such that the ratio of the maximum distance among the points to $H$ over the minimum distance to $H$ is $\beta_{1}=0.1$ . Let the labels of points above $H$ be $+1$ and let others be $-1$ .

•

Non-separable synthetic data: The difference from the separable synthetic data is that for those points with distance to $H$ smaller than $\beta_{2}=0.1$ , we randomly choose their labels to be $+1$ or $-1$ with equal probability. Moreover, we also use real-world

•

Sparse non-separable synthetic data: First, we set a parameter “nnz” which represent the number of non-zeros elements in each point. The only difference between the dense non-separable synthetic data is that we randomly sample $n$ points such that each point only has “nnz” non-zeros non-zeros points.

$\mu$ -SVM form used in NuSVC: The form of the $\mu$ -SVM used in scikit-learn is a variant of the form in the paper. We give the formulation as follows.

[TABLE]

[11] prove that through reparameterizing, the above formulation is equivalent to $\nu$ -SVM (5). Concretely speaking, let

[TABLE]

Then, (14) can be transformed to $\nu$ -SVM (5).

Parameter $\nu$ in $\nu$ -SVM: As we have discussed in Section 5, although when $\nu$ belongs to $[1/\min(n_{1},n_{2}),1)$ , $\nu$ -SVM has feasible solution, where $n_{1}$ is the number of points with positive label and $n_{2}$ is the number of points with negative label. Not all feasible $\nu$ can induce a reasonable prediction model. If $\nu$ is too close to 1, the two reduced polytopes are not separable. The closest distance between the two reduced polytopes is zero. Note that in general the overlapping points are not unique. Hence the solution is not unique. Moreover, because the solution corresponds two overlapped points, the vector $w$ (which represents the vector determined by the two points) is not unique, hence, is unstable. Overall, here we select a relatively small $\nu$ .

Recall that we let

[TABLE]

We set $\alpha=\{0.1,0.3,0.5\}$ and train the $\nu$ -SVM model on the data set “a9a”, “ijcnn1”, “phishing”. We list the results in Table 3.

Saddle-SVC vs. LinearSVC: As discussed before, they solve different SVM variants. Thus, we use the test accuracy instead of the objective values to evaluate the convergent rate. First, we explain the stop criteria of Saddle-SVC. In Theorem 7, we prove that Saddle-SVC converge in $\tilde{O}(d+\sqrt{d/\epsilon\beta})$ rounds. Let $T=d+\sqrt{d/\epsilon\beta}$ . We repeat the iterations of Saddle-SVC and compute the objective function every $T$ rounds. If the difference between two consecutive objective value is less than $\epsilon$ , then output the results. We note that LinearSVC is very efficient for sparse data set. But for the dense data set, Saddle-SVC performs better. In the experiment, we use “nnz” to represent the ratio of non-zero elements to all elements. We show that the parameter nnz significant affects the efficient of LinearSVC, but Saddle-SVC is barely affected. We use “skin_nonskin” and “w8a” and synthetic data sets with different parameter nnz to evaluate the performance. We list the details in Table 4.

Saddle-DSVC vs. HOGWILD!: As Saddle-DSVC is the first practical distributed algorithm for $\nu$ -SVM. We use another popular distributed algorithm called HOGWILD! for comparison. Note that HOGWILD! is use to solve $C$ -SVM and $l_{2}$ -SVM but not, $\nu$ -SVM. Thus, instead of the objective function, we use the accuracy to evaluate the performance of the algorithms. Here we choose use HOGWILD! for $C$ -SVM and Saddle-DSVC for $\nu$ -SVM. See the details in Figure 6. For comparison, we also provide the results of Gilbert Algorithm. Here we choose $\alpha=0.85$ for Saddle-DSVC and $C=32$ for HOGWILD!. We can see that Saddle-DSVC converges faster than HOGWILD! w.r.t. communication cost. Moreover, Saddle-DSVC is more stable than HOGWILD! algorithm.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Nir Ailon and Bernard Chazelle. Faster dimension reduction. Communications of the ACM , 53(2):97–104, 2010.
2[2] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. STOC 16 , 2016.
3[3] Zeyuan Allen-Zhu, Zhenyu Liao, and Yang Yuan. Optimization algorithms for faster computational geometry. In LIP Ics , volume 55, 2016.
4[4] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing , 8(1):121–164, 2012.
5[5] Kristin P Bennett and Erin J Bredensteiner. Duality and geometry in svm classifiers. In ICML , pages 57–64, 2000.
6[6] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory , pages 144–152. ACM, 1992.
7[7] Chih-Chung Chang and Chih-Jen Lin. Training v-support vector classifiers: theory and algorithms. Neural computation , 13(9):2119–2147, 2001.
8[8] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. TIST , 2(3):27, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

SVM via Saddle Point Optimization:

Abstract

1 Introduction

1.1 Our Contributions

1.2 Other Related Work

2 Formulate SVM as Saddle Point Optimization

Definition 1** (Hard-margin SVM).**

Lemma 2**.**

Lemma 3**.**

Definition 4** (ν\nuν-SVM).**

Lemma 5**.**

3 Saddle Point Optimization Algorithms for SVM

Theorem 6**.**

Theorem 7**.**

4 Distributed SVM

Theorem 8**.**

Theorem 9** (Theorem 6 in [28]).**

5 Experiments

Appendix A The Equivalence of the Explicit and Implicit Update Rules of η\etaη and ξ\xiξ

Lemma 10** (Update Rules of HM-Saddle).**

Proof.

Lemma 11** (Update Rules of ν\nuν-Saddle).**

Proof.

Remark 12**.**

Appendix B Details for Distributed Algorithms: Saddle-DSVC

Theorem 13**.**

Proof.

Theorem 14** (Theorem 6 in [28]).**

Proof Sketch.

Appendix C Missing Proofs

Lemma 2 **** (restated).

Proof.

Lemma 3 **** (restated).

Proof.

Lemma 5 **** (restated).

Proof.

Lemma 15**.**

Proof.

C.1 Proof of Theorem 6

Lemma 16**.**

Lemma 17**.**

Theorem 18**.**

Proof Sketch.

Lemma 19**.**

Proof.

Theorem 6 **** (restated).

Proof.

Appendix D Supplementary Materials of Experiments

Definition 1 (Hard-margin SVM).

Lemma 2.

Lemma 3.

Definition 4 ( $\nu$ -SVM).

Lemma 5.

Theorem 6.

Theorem 7.

Theorem 8.

Theorem 9 (Theorem 6 in [28]).

Appendix A The Equivalence of the Explicit and Implicit Update Rules of $\eta$ and $\xi$

Lemma 10 (Update Rules of HM-Saddle).

Lemma 11 (Update Rules of $\nu$ -Saddle).

Remark 12.

Theorem 13.

Theorem 14 (Theorem 6 in [28]).

Lemma 2 (restated).

Lemma 3 (restated).

Lemma 5 (restated).

Lemma 15.

Lemma 16.

Lemma 17.

Theorem 18.

Lemma 19.

Theorem 6 (restated).