Insensitive Stochastic Gradient Twin Support Vector Machine for Large   Scale Problems

Zhen Wang; Yuan-Hai Shao; Lan Bai; Li-Ming Liu; Nai-Yang Deng

arXiv:1704.05596·cs.LG·August 17, 2018

Insensitive Stochastic Gradient Twin Support Vector Machine for Large Scale Problems

Zhen Wang, Yuan-Hai Shao, Lan Bai, Li-Ming Liu, Nai-Yang Deng

PDF

TL;DR

This paper introduces SGTSVM, a stochastic gradient method for twin support vector machines that is more insensitive to sampling variations, with proven convergence and superior stability on large datasets.

Contribution

The paper proposes a novel stochastic gradient twin support vector machine (SGTSVM) that is less sensitive to sampling, with theoretical convergence proof and applicability to nonlinear cases.

Findings

01

SGTSVM converges theoretically unlike PEGASOS.

02

SGTSVM demonstrates stable and fast learning on large datasets.

03

Approximation between SGTSVM and twin SVM is established.

Abstract

Stochastic gradient descent algorithm has been successfully applied on support vector machines (called PEGASOS) for many classification problems. In this paper, stochastic gradient descent algorithm is investigated to twin support vector machines for classification. Compared with PEGASOS, the proposed stochastic gradient twin support vector machines (SGTSVM) is insensitive on stochastic sampling for stochastic gradient descent algorithm. In theory, we prove the convergence of SGTSVM instead of almost sure convergence of PEGASOS. For uniformly sampling, the approximation between SGTSVM and twin support vector machines is also given, while PEGASOS only has an opportunity to obtain an approximation of support vector machines. In addition, the nonlinear SGTSVM is derived directly from its linear case. Experimental results on both artificial datasets and large scale problems show the stable…

Tables4

Table 1. Table 1: Mean accuracy (%) with standard deviation of TWSVM and SGTSVM by 10-fold cross validation.

Data	TWSVM^†	SGTSVM^†	TWSVM^♯	SGTSVM^♯
Cross Planes	96.05 $\pm$ 0.70	97.71 $\pm$ 0.41	99.01 $\pm$ 2.24	98.51 $\pm$ 2.15
Australia	86.87 $\pm$ 0.38	87.34 $\pm$ 0.13	87.10 $\pm$ 0.43	85.21 $\pm$ 0.16
Creadit	85.78 $\pm$ 0.32	85.72 $\pm$ 0.23	86.71 $\pm$ 0.33	85.21 $\pm$ 0.45
Hypothyroid	98.21 $\pm$ 0.09	97.28 $\pm$ 0.01	98.08 $\pm$ 0.09	98.07 $\pm$ 0.03

Table 2. Table 2: The details of the large scale datasets.

Data	Name	No. of samples	Dimension	Ratio
(a)	Skin	245,057	3	0.262
(b)	Gashome	928,990	10	0.578
(c)	Susy	5,000,000	18	0.844
(d)	Kddcup	4,898,432	41	0.248
(e)	Gas	8,386,764	16	0.077
(f)	Hepmass	10,500,000	28	1.000

Table 3. Table 3: The results on the large scale datasets.

Data		SVM	PEGASOS	SGTSVM^†	SGTSVM^♯
Skin	validation(%)	78.87	82.46	$85.23$	84.70
245,057 $\times$ 3	testing(%)	84.28	85.39	$87.70$	85.34
Gashome	validation(%)	49.11	70.09	67.50	$74.49$
919,438 $\times$ 10	testing(%)	82.57	72.85	76.09	$89.13$
Susy	validation(%)	$78.41$	54.11	76.14	69.90
5,000,000 $\times$ 18	testing(%)	$78.52$	56.44	75.09	68.61
Kddcup	validation(%)	*	$96.39$	95.24	93.19
4,898,432 $\times$ 41	testing(%)	*	96.42	97.45	$99.20$
Gas	validation(%)	*	69.77	89.73	$92.60$
8,386,764 $\times$ 16	testing(%)	*	50.54	92.45	$92.86$
Hepmass	validation(%)	*	80.63	80.80	$82.18$
10,500,000 $\times$ 28	testing(%)	*	80.84	$81.10$	79.59

Table 4. Table 4: The optimal parameters of SVM, PEGASOS, and SGTSVM.

Data		SVM	PEGASOS	SGTSVM^†	SGTSVM^♯
		c	c	$c_{1} = c_{3}, c_{2} = c_{4}$	$c_{1} = c_{3}, c_{2} = c_{4}, μ$
		$2^{i}$	$2^{i}$	$2^{i}, 2^{j}$	$2^{i}, 2^{j}, 2^{k}$
Skin	validation	-1	-6	0,-5	-6,-5,-3
	testing	-1	-4	1,-6	-1,0,-9
Gashome	validation	0	-6	-4,-5	-3,-5,-2
	testing	-1	-1	-8,-7	-8,-1,-2
Susy	validation	1	0	-2,-6	-3,-1,-4
	testing	0	-7	-1,-3	-3,-3,-3
Kddcup	validation	NA	-6	-8,-4	0,-3,-4
	testing	NA	-2	-8,-4	-6,-1,-8
Gas	validation	NA	-1	-4,0	-1,-1,-6
	testing	NA	1	-3,1	-4,-8,-6
Hepmass	validation	NA	0	-1,-2	-4,-1,-3
	testing	NA	0	0,-2	-4,-2,-3

Equations106

w^{⊤} x + b = 0,

w^{⊤} x + b = 0,

\displaystyle\begin{array}[]{ll}\underset{w,b}{\min}~{}~{}~{}~{}\frac{1}{2}||w||^{2}+\frac{c}{m}e^{\top}\xi\\ \hbox{s.t.\ }~{}~{}~{}~{}~{}~{}D(X^{\top}w+b)\geq e-\xi,~{}~{}\xi\geq 0,\end{array}

\displaystyle\begin{array}[]{ll}\underset{w,b}{\min}~{}~{}~{}~{}\frac{1}{2}||w||^{2}+\frac{c}{m}e^{\top}\xi\\ \hbox{s.t.\ }~{}~{}~{}~{}~{}~{}D(X^{\top}w+b)\geq e-\xi,~{}~{}\xi\geq 0,\end{array}

\begin{array}[]{l}\underset{w}{\min}~{}~{}\frac{1}{2}||w||^{2}+\frac{c}{m}e^{\top}\xi\\ s.t.~{}~{}~{}~{}DX^{\top}w\geq e-\xi,\xi\geq 0,\end{array}

\begin{array}[]{l}\underset{w}{\min}~{}~{}\frac{1}{2}||w||^{2}+\frac{c}{m}e^{\top}\xi\\ s.t.~{}~{}~{}~{}DX^{\top}w\geq e-\xi,\xi\geq 0,\end{array}

\begin{array}[]{l}\underset{w}{\min}~{}~{}\frac{1}{2}||w||^{2}+\frac{c}{m}e^{\top}(e-DX^{\top}w)_{+},\end{array}

\begin{array}[]{l}\underset{w}{\min}~{}~{}\frac{1}{2}||w||^{2}+\frac{c}{m}e^{\top}(e-DX^{\top}w)_{+},\end{array}

\begin{array}[]{l}g_{t}(w)=\frac{1}{2}||w||^{2}+c(1-y_{t}w^{\top}x_{t})_{+}.\end{array}

\begin{array}[]{l}g_{t}(w)=\frac{1}{2}||w||^{2}+c(1-y_{t}w^{\top}x_{t})_{+}.\end{array}

\begin{array}[]{l}\nabla_{w_{t}}g_{t}(w)=w_{t}-cy_{t}x_{t}\text{sign}(1-y_{t}w_{t}^{\top}x_{t})_{+}.\end{array}

\begin{array}[]{l}\nabla_{w_{t}}g_{t}(w)=w_{t}-cy_{t}x_{t}\text{sign}(1-y_{t}w_{t}^{\top}x_{t})_{+}.\end{array}

\begin{array}[]{l}y=\text{sign}(w^{\top}x).\end{array}

\begin{array}[]{l}y=\text{sign}(w^{\top}x).\end{array}

\begin{array}[]{l}g(w_{t},b)=\frac{1}{2}||w_{t}||^{2}+C(1-y_{t}(w_{t}^{\top}x_{t}+b))_{+}.\end{array}

\begin{array}[]{l}g(w_{t},b)=\frac{1}{2}||w_{t}||^{2}+C(1-y_{t}(w_{t}^{\top}x_{t}+b))_{+}.\end{array}

w_{1}^{⊤} x + b_{1} = 0 and w_{2}^{⊤} x + b_{2} = 0,

w_{1}^{⊤} x + b_{1} = 0 and w_{2}^{⊤} x + b_{2} = 0,

\displaystyle\begin{array}[]{ll}\underset{w_{1},b_{1}}{\min}&\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2m_{1}}\|X_{1}^{\top}w_{1}+b_{1}\|^{2}+\frac{c_{2}}{m_{2}}e^{\top}\xi_{1}\\ \hbox{s.t.\ }&X_{2}^{\top}w_{1}+b_{1}-\xi_{1}\leq-e,~{}~{}\xi_{1}\geq 0,\end{array}

\displaystyle\begin{array}[]{ll}\underset{w_{1},b_{1}}{\min}&\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2m_{1}}\|X_{1}^{\top}w_{1}+b_{1}\|^{2}+\frac{c_{2}}{m_{2}}e^{\top}\xi_{1}\\ \hbox{s.t.\ }&X_{2}^{\top}w_{1}+b_{1}-\xi_{1}\leq-e,~{}~{}\xi_{1}\geq 0,\end{array}

\displaystyle\begin{array}[]{ll}\underset{w_{2},b_{2}}{\min}&\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2m_{2}}\|X_{2}^{\top}w_{2}+b_{2}\|^{2}+\frac{c_{4}}{m_{1}}e^{\top}\xi_{2}\\ \hbox{s.t.\ }&X_{1}^{\top}w_{2}+b_{2}+\xi_{2}\geq e,~{}~{}\xi_{2}\geq 0,\end{array}

\displaystyle\begin{array}[]{ll}\underset{w_{2},b_{2}}{\min}&\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2m_{2}}\|X_{2}^{\top}w_{2}+b_{2}\|^{2}+\frac{c_{4}}{m_{1}}e^{\top}\xi_{2}\\ \hbox{s.t.\ }&X_{1}^{\top}w_{2}+b_{2}+\xi_{2}\geq e,~{}~{}\xi_{2}\geq 0,\end{array}

y = i ar g min

y = i ar g min

\begin{array}[]{l}\underset{w_{1},b_{1}}{\min}~{}~{}\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2m_{1}}||X_{1}^{\top}w_{1}+b_{1}||^{2}+\frac{c_{2}}{m_{2}}e^{\top}(e+X_{2}^{\top}w_{1}+b_{1})_{+},\end{array}

\begin{array}[]{l}\underset{w_{1},b_{1}}{\min}~{}~{}\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2m_{1}}||X_{1}^{\top}w_{1}+b_{1}||^{2}+\frac{c_{2}}{m_{2}}e^{\top}(e+X_{2}^{\top}w_{1}+b_{1})_{+},\end{array}

\begin{array}[]{l}\underset{w_{2},b_{2}}{\min}~{}~{}\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2m_{2}}||X_{2}^{\top}w_{2}+b_{2}||^{2}+\frac{c_{4}}{m_{1}}e^{\top}(e-X_{1}^{\top}w_{2}-b_{2})_{+},\end{array}

\begin{array}[]{l}\underset{w_{2},b_{2}}{\min}~{}~{}\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2m_{2}}||X_{2}^{\top}w_{2}+b_{2}||^{2}+\frac{c_{4}}{m_{1}}e^{\top}(e-X_{1}^{\top}w_{2}-b_{2})_{+},\end{array}

\begin{array}[]{l}f_{1,t}=\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2}||w_{1}^{\top}x_{t}+b_{1}||^{2}+c_{2}(1+w_{1}^{\top}\hat{x}_{t}+b_{1})_{+},\end{array}

\begin{array}[]{l}f_{1,t}=\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2}||w_{1}^{\top}x_{t}+b_{1}||^{2}+c_{2}(1+w_{1}^{\top}\hat{x}_{t}+b_{1})_{+},\end{array}

\begin{array}[]{l}f_{2,t}=\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2}||w_{2}^{\top}\hat{x}_{t}+b_{2}||^{2}+c_{4}(1-w_{2}^{\top}x_{t}-b_{2})_{+},\\ \end{array}

\begin{array}[]{l}f_{2,t}=\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2}||w_{2}^{\top}\hat{x}_{t}+b_{2}||^{2}+c_{4}(1-w_{2}^{\top}x_{t}-b_{2})_{+},\\ \end{array}

\begin{array}[]{l}\nabla_{w_{1,t}}f_{1,t}=w_{1,t}+c_{1}(w_{1,t}^{\top}x_{t}+b_{1,t})x_{t}+c_{2}\hat{x}_{t}\text{sign}(1+w_{1,t}^{\top}\hat{x}_{t}+b_{1,t})_{+},\\ \nabla_{b_{1,t}}f_{1,t}=b_{1,t}+c_{1}(w_{1,t}^{\top}x_{t}+b_{1,t})+c_{2}\text{sign}(1+w_{1,t}^{\top}\hat{x}_{t}+b_{1,t})_{+},\end{array}

\begin{array}[]{l}\nabla_{w_{1,t}}f_{1,t}=w_{1,t}+c_{1}(w_{1,t}^{\top}x_{t}+b_{1,t})x_{t}+c_{2}\hat{x}_{t}\text{sign}(1+w_{1,t}^{\top}\hat{x}_{t}+b_{1,t})_{+},\\ \nabla_{b_{1,t}}f_{1,t}=b_{1,t}+c_{1}(w_{1,t}^{\top}x_{t}+b_{1,t})+c_{2}\text{sign}(1+w_{1,t}^{\top}\hat{x}_{t}+b_{1,t})_{+},\end{array}

\begin{array}[]{l}\nabla_{w_{2,t}}f_{2,t}=w_{2,t}+c_{3}(w_{2,t}^{\top}\hat{x}_{t}+b_{2,t})\hat{x}_{t}-c_{4}x_{t}\text{sign}(1-w_{2,t}^{\top}x_{t}-b_{2,t})_{+},\\ \nabla_{b_{2,t}}f_{2,t}=b_{2,t}+c_{3}(w_{2,t}^{\top}\hat{x}_{t}+b_{2,t})-c_{4}\text{sign}(1-w_{2,t}^{\top}x_{t}-b_{1,t})_{+},\end{array}

\begin{array}[]{l}\nabla_{w_{2,t}}f_{2,t}=w_{2,t}+c_{3}(w_{2,t}^{\top}\hat{x}_{t}+b_{2,t})\hat{x}_{t}-c_{4}x_{t}\text{sign}(1-w_{2,t}^{\top}x_{t}-b_{2,t})_{+},\\ \nabla_{b_{2,t}}f_{2,t}=b_{2,t}+c_{3}(w_{2,t}^{\top}\hat{x}_{t}+b_{2,t})-c_{4}\text{sign}(1-w_{2,t}^{\top}x_{t}-b_{1,t})_{+},\end{array}

\begin{array}[]{ll}w_{1,t+1}=w_{1,t}-\eta_{t}\nabla_{w_{1,t}}f_{1,t},\\ b_{1,t+1}=b_{1,t}-\eta_{t}\nabla_{b_{1,t}}f_{1,t},\\ w_{2,t+1}=w_{2,t}-\eta_{t}\nabla_{w_{2,t}}f_{2,t},\\ b_{2,t+1}=b_{2,t}-\eta_{t}\nabla_{b_{2,t}}f_{2,t},\\ \end{array}

\begin{array}[]{ll}w_{1,t+1}=w_{1,t}-\eta_{t}\nabla_{w_{1,t}}f_{1,t},\\ b_{1,t+1}=b_{1,t}-\eta_{t}\nabla_{b_{1,t}}f_{1,t},\\ w_{2,t+1}=w_{2,t}-\eta_{t}\nabla_{w_{2,t}}f_{2,t},\\ b_{2,t+1}=b_{2,t}-\eta_{t}\nabla_{b_{2,t}}f_{2,t},\\ \end{array}

K (x, X)^{⊤} w_{1} + b_{1} = 0 and K (x, X)^{⊤} w_{2} + b_{2} = 0.

K (x, X)^{⊤} w_{1} + b_{1} = 0 and K (x, X)^{⊤} w_{2} + b_{2} = 0.

\begin{array}[]{l}\underset{w_{1},b_{1}}{\min}~{}~{}\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2m_{1}}||K(X_{1},X)^{\top}w_{1}+b_{1}||^{2}+\frac{c_{2}}{m_{2}}e^{\top}(e+K(X_{2},X)^{\top}w_{1}+b_{1})_{+},\end{array}

\begin{array}[]{l}\underset{w_{1},b_{1}}{\min}~{}~{}\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2m_{1}}||K(X_{1},X)^{\top}w_{1}+b_{1}||^{2}+\frac{c_{2}}{m_{2}}e^{\top}(e+K(X_{2},X)^{\top}w_{1}+b_{1})_{+},\end{array}

\begin{array}[]{l}\underset{w_{2},b_{2}}{\min}~{}~{}\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2m_{2}}||K(X_{2},X)^{\top}w_{2}+b_{2}||^{2}+\frac{c_{4}}{m_{1}}e^{\top}(e-K(X_{1},X)^{\top}w_{2}-b_{2})_{+}.\end{array}

\begin{array}[]{l}\underset{w_{2},b_{2}}{\min}~{}~{}\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2m_{2}}||K(X_{2},X)^{\top}w_{2}+b_{2}||^{2}+\frac{c_{4}}{m_{1}}e^{\top}(e-K(X_{1},X)^{\top}w_{2}-b_{2})_{+}.\end{array}

\begin{array}[]{l}h_{1,t}=\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2}||K(x_{t},X)^{\top}w_{1}+b_{1}||^{2}+c_{2}(1+K(\hat{x}_{t},X)^{\top}w_{1}+b_{1})_{+},\end{array}

\begin{array}[]{l}h_{1,t}=\frac{1}{2}(||w_{1}||^{2}+b_{1}^{2})+\frac{c_{1}}{2}||K(x_{t},X)^{\top}w_{1}+b_{1}||^{2}+c_{2}(1+K(\hat{x}_{t},X)^{\top}w_{1}+b_{1})_{+},\end{array}

\begin{array}[]{l}h_{2,t}=\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2}||K(\hat{x}_{t},X)^{\top}w_{2}+b_{2}||^{2}+c_{4}(1-K(x_{t},X)^{\top}w_{2}-b_{2})_{+}.\\ \end{array}

\begin{array}[]{l}h_{2,t}=\frac{1}{2}(||w_{2}||^{2}+b_{2}^{2})+\frac{c_{3}}{2}||K(\hat{x}_{t},X)^{\top}w_{2}+b_{2}||^{2}+c_{4}(1-K(x_{t},X)^{\top}w_{2}-b_{2})_{+}.\\ \end{array}

\begin{array}[]{l}\underset{u}{\min}~{}~{}f(u)=\frac{1}{2}||u||^{2}+\frac{c_{1}}{2m_{1}}||Z_{1}u||^{2}+\frac{c_{2}}{m_{2}}e^{\top}(e+Z_{2}u)_{+}.\end{array}

\begin{array}[]{l}\underset{u}{\min}~{}~{}f(u)=\frac{1}{2}||u||^{2}+\frac{c_{1}}{2m_{1}}||Z_{1}u||^{2}+\frac{c_{2}}{m_{2}}e^{\top}(e+Z_{2}u)_{+}.\end{array}

\begin{array}[]{l}f_{t}(u)=\frac{1}{2}||u||^{2}+\frac{c_{1}}{2}||u^{\top}z_{t}||^{2}+c_{2}(1+u^{\top}\hat{z}_{t})_{+},\end{array}

\begin{array}[]{l}f_{t}(u)=\frac{1}{2}||u||^{2}+\frac{c_{1}}{2}||u^{\top}z_{t}||^{2}+c_{2}(1+u^{\top}\hat{z}_{t})_{+},\end{array}

\begin{array}[]{l}\nabla_{t}=u_{t}+c_{1}(u^{\top}z_{t})z_{t}+c_{2}\hat{z}_{t}\text{sign}(1+u^{\top}\hat{z}_{t})_{+}.\end{array}

\begin{array}[]{l}\nabla_{t}=u_{t}+c_{1}(u^{\top}z_{t})z_{t}+c_{2}\hat{z}_{t}\text{sign}(1+u^{\top}\hat{z}_{t})_{+}.\end{array}

\begin{array}[]{l}u_{t+1}=u_{t}-\eta_{t}\nabla_{t},\end{array}

\begin{array}[]{l}u_{t+1}=u_{t}-\eta_{t}\nabla_{t},\end{array}

\begin{array}[]{l}u_{t+1}=(1-\frac{1}{t})u_{t}-\frac{c_{1}}{t}z_{t}z_{t}^{\top}u_{t}-\frac{c_{2}}{t}\hat{z}_{t}\text{sign}(1+u_{t}^{\top}\hat{z}_{t})_{+}.\end{array}

\begin{array}[]{l}u_{t+1}=(1-\frac{1}{t})u_{t}-\frac{c_{1}}{t}z_{t}z_{t}^{\top}u_{t}-\frac{c_{2}}{t}\hat{z}_{t}\text{sign}(1+u_{t}^{\top}\hat{z}_{t})_{+}.\end{array}

\begin{array}[]{l}u_{t+1}=A_{t}u_{t}+\frac{1}{t}v_{t},\end{array}

\begin{array}[]{l}u_{t+1}=A_{t}u_{t}+\frac{1}{t}v_{t},\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Insensitive Stochastic Gradient Twin Support Vector Machines for Large Scale Problems

Zhen Wang

School of Mathematical Sciences, Inner Mongolia University, Hohhot, 010021, P.R.China

Yuan-Hai Shao

[email protected]

School of Economics and Management, Hainan University, Haikou, 570228, P.R. China

Lan Bai

Li-Ming Liu

School of Statistics, Capital University of Economics and Business, Beijing, 100070, P.R.China

Nai-Yang Deng

College of Science China Agricultural University, Beijing, 100083, P.R.China

Abstract

Stochastic gradient descent algorithm has been successfully applied on support vector machines (called PEGASOS) for many classification problems. In this paper, stochastic gradient descent algorithm is investigated to twin support vector machines for classification. Compared with PEGASOS, the proposed stochastic gradient twin support vector machines (SGTSVM) is insensitive on stochastic sampling for stochastic gradient descent algorithm. In theory, we prove the convergence of SGTSVM instead of almost sure convergence of PEGASOS. For uniformly sampling, the approximation between SGTSVM and twin support vector machines is also given, while PEGASOS only has an opportunity to obtain an approximation of support vector machines. In addition, the nonlinear SGTSVM is derived directly from its linear case. Experimental results on both artificial datasets and large scale problems show the stable performance of SGTSVM with a fast learning speed.

keywords:

Classification, support vector machines, twin support vector machines, stochastic gradient descent, large scale problem.

1 Introduction

Support vector machines (SVM), being powerful tool for classification [7, 20, 42], have already outperformed most other classifiers in a wide variety of applications [23, 17, 11]. Different from SVM with a pair of parallel hyperplanes, twin support vector machines (TWSVM) [12, 35] with a pair of nonparallel hyperplanes has been proposed and developed, e.g., twin bounded support vector machines (TBSVM) [35], twin parametric margin support vector machines (TPMSVM) [24], and weighted Lagrangian twin support vector machines (WLTSVM) [33]. These classifiers have been widely applied in many practical problems [34, 39, 19, 38, 6, 36, 29, 28, 26, 27]. In the training stage, SVM solves a quadratic programming problem (QPP), whereas TWSVM solve two smaller QPPs by traditional solver such as interior method [20, 1, 12]. However, neither SVM nor TWSVM based on these solvers can deal with the large scale problem, especially millions of samples.

In order to deal with the large scale problem, many improvements were proposed, e.g., for SVM, sequential minimal optimization, coordinate decent method, trust region Newton, and stochastic gradient descent algorithm (SGD) in [25, 13, 5, 9, 32], and for TWSVM, successive overrelaxation technique, Newton-Armijo algorithm, and dual coordinate decent method in [35, 39, 37]. The stochastic gradient descent algorithm for SVM (PEGASOS) [16, 43, 32, 41] attracts a great attention, because it partitions the large scale problem into a series of subproblems by stochastic sampling with a suitable size. It has been proved that PEGASOS is almost sure convergent, and thus is able to find an approximation of the desired solution with high probability [2, 43, 32]. The existing experiments confirm the effectiveness of these algorithms with an amazing learning speed.

However, for large scale problem, the stochastic sampling in SGD may bring some difficulties to SVM due to only a small subset of the dataset is selected for training. In fact, if the subset is not suitable, PEGASOS would be weak. It is well known that in SVM the support vectors (SVs), a small subset of the dataset, decides the final classifier. If the stochastic sampling does not include the SVs sufficiently, the classifier would lose some generalizations. Figure 1 is a toy example for PEGASOS. There are two classes in this figure, where the positive and negative classes respectively include 6 and 4 samples, and the circle is one of the potential SVs. The solid blue line is the separating line obtained by PEGASOS with three different sampling: (i) strengthening the circle sample; (ii) infrequently using the circle sample; (iii) ignoring the circle sample. Figure 1 shows that the circle sample plays an important role on the separating line, and infrequently using or ignoring this sample would lead to misclassify.

Compared with SVM, it is significant that TWSVM is more stable for sampling and does not strongly depend on some special samples such as the SVs [12, 35], which indicates SGD is more suitable for TWSVM. Therefore, in this paper, we propose a stochastic gradient twin support vector machines (SGTSVM). Different from PEGASOS, our method selects two samples from different classes randomly in each iteration to construct a pair of nonparallel hyperplanes. Due to TWSVM fits all of the training samples, our method is stable for the stochastic sampling and thus gains well generalizations. Moreover, the characteristics inherited from TWSVM result in that our SGTSVM suits for many cases, e.g., “cross planes” dataset [21] and preferential classification [12]. As the above toy example, Figure 2 shows the corresponding results by SGTSVM. Comparing Figure 2 with Figure 1, it is clear that SGTSVM performs better than PEGASOS.

The main contributions of this paper includes:

(i) a SGD-based TWSVM (SGTSVM) is proposed, and it is very easy to be extended to other TWSVM-type classifiers;

(ii) we prove that the proposed SGTSVM is convergent, instead of almost sure convergence in PEGASOS;

(iii) for the uniformly sampling, it is proved that the original objective of the solution to SGTSVM is bounded by the optimum of TWSVM, which indicates the solution to SGTSVM is an approximation of the optimal solution to TWSVM, while PEGASOS only has an opportunity to obtain an approximation of the optimal solution to SVM (more information please see Corollaries 1 and 2 in [32]);

(iv) the nonlinear case of SGTSVM is obtained directly based on its original problem;

(v) each iteration of SGTSVM includes no more than $8n+4$ multiplications without additional storage, so it is the fastest one than other proposed TWSVM-type classifiers.

The rest of this paper is organized as follow. Section 2 briefly reviews SVM, PEGASOS, and TWSVM. Our linear and nonlinear SGTSVMs together with the theoretical analysis are elaborated in Section 3. Experiments are arranged in Section 4. Finally, we give the conclusions.

2 Related Works

Consider a binary classification problem in the $n$ -dimensional real space $R^{n}$ . The set of training samples is represented by $X\in R^{n\times m}$ , where $x\in R^{n}$ is the sample with the label $y\in\{+1,-1\}$ . We further organize the $m_{1}$ samples of Class $+1$ into a matrix $X_{1}\in R^{n\times m_{1}}$ and the $m_{2}$ samples of Class $-1$ into a matrix $X_{2}\in R^{n\times m_{2}}$ . Below, we give a brief outlines of some related works.

2.1 SVM

Support vector machines (SVM) [7, 3] searches for a separating hyperplane

[TABLE]

where $w\in R^{n}$ and $b\in R$ . By introducing the regularization term, the primal problem of SVM can be expressed as a QPP as follow

[TABLE]

where $||\cdot||$ denotes the $L_{2}$ norm, $c>0$ is a parameter with some quantitative meanings [3], $e$ is a vector of ones with an appropriate dimension, $\xi\in R^{m}$ is the slack vector, and $D=\text{diag}(y_{1},\ldots,y_{m})$ . Note that the minimization of the regularization term $\|w\|^{2}$ is equivalent to maximize the margin between two parallel supporting hyperplanes $w^{\top}x+b=\pm 1$ . And the structural risk minimization principle is implemented in this problem [7].

2.2 PEGASOS

PEGASOS [43, 32] considers a strongly convex problem by modifying (4) as follow

[TABLE]

and recasts the above problem to

[TABLE]

where $(\cdot)_{+}$ replaces negative components of a vector by zeros.

In the $t$ th iteration ( $t\geq 1$ ), PEGASOS constructs a temporary function, which is defined by a random sample $x_{t}\in X$ as

[TABLE]

Then, starting with an initial $w_{1}$ , PEGASOS iteratively updates $w_{t+1}=w_{t}-\eta_{t}\nabla_{w_{t}}g_{t}(w)$ for $t\geq 1$ , where $\eta_{t}=1/t$ is the step size and $\nabla_{w_{t}}g_{t}(w)$ is the sub-gradient of $g_{t}(w)$ at $w_{t}$ ,

[TABLE]

When some terminate conditions are satisfied, the last $w_{t}$ is outputted as $w$ . And a new sample $x$ can be predicted by

[TABLE]

It has been proved that the average solution $\bar{w}=\frac{1}{T}\sum\limits_{t=1}^{T}w_{t}$ is bounded by the optimal solution $w^{*}$ to (6) with $o(1)$ , and thus PEGASOS has with a probability of at least $1/2$ to find a good approximation of $w^{*}$ [32]. The authors of [32] also pointed out that $w_{T}$ is often used instead of $\bar{w}$ in practice. The sample $x_{t}$ which is selected randomly can be replaced with a small subset belonging to the whole dataset, and the subset only including a sample is often used in practice [43, 32, 41]. In order to extend the generalization ability of PEGASOS, the bias term $b$ in SVM can be appended to PEGASOS by replacing $g(w_{t})$ of (7) with

[TABLE]

However, this modification would lead to the function not to be strongly convex and thus yield a slow convergence rate [32].

2.3 TWSVM

TWSVM [12, 35] seeks a pair of nonparallel hyperplanes in $R^{n}$ which can be expressed as

[TABLE]

such that each hyperplane is close to samples of one class and has a certain distance from the other class. To find the pair of nonparallel hyperplanes, it is required to get the solutions to the primal problems

[TABLE]

and

[TABLE]

where $c_{1}$ , $c_{2}$ , $c_{3}$ , and $c_{4}$ are positive parameters, $\xi_{1}\in R^{m_{2}}$ and $\xi_{2}\in R^{m_{1}}$ are slack vectors. Their geometric meaning is clear. For example, for (14), its objective function makes the samples of Class $+1$ proximal to the hyperplane $w_{1}^{\top}x+b_{1}=0$ together with the regularization term, while the constraints make each sample of Class $-1$ has a distance more than $1/||w_{1}||$ away from the hyperplane $w_{1}^{\top}x+b_{1}=-1$ .

Once the solutions $(w_{1},b_{1})$ and $(w_{2},b_{2})$ to the problems (14) and (17) are respectively obtained, a new point $x\in R^{n}$ is assigned to which class depends on the distance to the two hyperplanes in (11), i.e.,

[TABLE]

where $|\cdot|$ is the absolute value.

3 SGTSVM

In this section, we elaborate our SGTSVM and give its convergence analysis together with the boundedness.

3.1 Linear Formation

Following the notations in Section 2, we recast the QPPs (14) and (17) in TWSVM to unconstrained problems

[TABLE]

and

[TABLE]

respectively.

In order to solve the above two problems, we construct a series of strictly convex functions $f_{1,t}(w_{1},b_{1})$ and $f_{2,t}(w_{2},b_{2})$ with $t\geq 1$ as

[TABLE]

and

[TABLE]

where $x_{t}$ and $\hat{x}_{t}$ are selected randomly from $X_{1}$ and $X_{2}$ , respectively.

The sub-gradients of the above functions at $(w_{1,t},b_{1,t})$ and $(w_{2,t},b_{2,t})$ can be obtained as

[TABLE]

and

[TABLE]

respectively.

Our SGTSVM starts from the initial $(w_{1,1},b_{1,1})$ and $(w_{2,t},b_{2,t})$ . Then, for $t\geq 1$ , the updates are given by

[TABLE]

where $\eta_{t}$ is the step size and typically is set to $1/t$ . If the terminated condition is satisfied, $(w_{1,t},b_{1,t})$ is assigned to $(w_{1},b_{1})$ , and $(w_{2,t},b_{2,t})$ is assigned to $(w_{2},b_{2})$ . Then, a new sample $x\in R^{n}$ can be predicted by (18).

The above procedures are summarized in Algorithm 1.

3.2 Nonlinear Formation

Now, we extend our SGTSVM to nonlinear case by the kernel trick [21, 12, 35, 31, 15, 18]. Suppose $K(\cdot,\cdot)$ is the predefined kernel function, then the nonparallel hyperplanes can be expressed as

[TABLE]

The counterparts of (19) and (20) can be formulated as

[TABLE]

and

[TABLE]

Then, we construct a series of functions with $t\geq 1$ as

[TABLE]

and

[TABLE]

Similar to (23), (24), and (25), the sub-gradients and updates can be obtained. The details are omitted.

For large scale problem, it is time consuming to calculate the kernel $K(\cdot,X)$ . However, the reduced kernel strategy, which has been successfully applied for SVM and TWSVM [18, 40, 39], can also be applied for our SGTSVM. The reduced kernel strategy replaces $K(\cdot,X)$ with $K(\cdot,\tilde{X})$ , where $\tilde{X}$ is a random sampled subset of $X$ . In practice, $\tilde{X}$ just needs $0.01\%\sim 1\%$ samples from $X$ to get a well performance, reducing the learning time without loss of generalization [40].

3.3 Analysis

In this subsection, we discuss two issues: (i) the convergence of the solution in SGTSVM; (ii) the relation between the solution in SGTSVM and the optimal one in TWSVM. For convenience, we just consider the first QPP (19) of linear TWSVM together with the SGD formation of linear SGTSVM. The conclusions on another QPP (20) and the nonlinear formations can be obtained easily as the first one.

Let $u=(w^{\top},b)^{\top}$ , $Z_{1}=(X_{1}^{\top},e)^{\top}$ , $Z_{2}=(X_{2}^{\top},e)^{\top}$ , $z=(x^{\top},1)^{\top}$ , and the notations with the subscripts in SGTSVM also comply with this definition. Then, the first QPP (19) is reformulated as

[TABLE]

Next, we reformulate the $t$ th ( $t\geq 1$ ) function in SGTSVM as

[TABLE]

where $z_{t}$ and $\hat{z}_{t}$ are the samples selected randomly from $Z_{1}$ and $Z_{2}$ for the $t$ th iteration, respectively. The sub-gradient of $f_{t}(u)$ at $u_{t}$ is denoted as

[TABLE]

Given $u_{1}$ and the step size $\eta_{t}=1/t$ , $u_{t+1}$ with $t\geq 1$ is updated by

[TABLE]

i.e.,

[TABLE]

Lemma 3.1.

For all $t\geq 1$ , $||\nabla_{t}||$ and $||u_{t}||$ have the upper bounds.

Proof.

The formation (35) can be rewritten as

[TABLE]

where $A_{t}=\frac{1}{t}((t-1)I-c_{1}z_{t}z_{t}^{\top})$ , $I$ is the identity matrix, and $v_{t}=-c_{2}\hat{z}_{t}\text{sign}(1+u_{t}^{\top}\hat{z}_{t})_{+}$ . Note that for sufficient $t$ , there is a positive integer $N$ such that for $t>N$ , $A_{t}$ is positive definite, and the largest eigenvalue $\lambda_{t}$ of $A_{t}$ is smaller than or equal to $\frac{t-1}{t}$ . Based on (36), we have

[TABLE]

For $i\geq N+1$ , $||A_{t+N+1-i}u_{N+1}||\leq\lambda_{i}||u_{N+1}||\leq\frac{i-1}{i}||u_{N+1}||$ [10]. Therefore,

[TABLE]

and

[TABLE]

Thus, we have

[TABLE]

Let $M$ be the largest norm of the samples in the dataset and

[TABLE]

This leads to that $G_{1}$ is an upper bound of $||u_{t}||$ , and $G_{2}=G_{1}+c_{1}G_{1}M^{2}+c_{2}M$ is an upper bound of $||\nabla_{t}||$ , for $t\geq 1$ . ∎

Theorem 3.1.

The iterative formation (35) of our SGTSVM is convergent.

Proof.

On the one hand, from (38) in the proof of Lemma 3.1, we have

[TABLE]

which indicates

[TABLE]

On the other hand, from (39), we have

[TABLE]

which indicates that the following limit exists

[TABLE]

Note that an infinite series of vectors is convergent if its norm series is convergent [30]. Therefore, the following limit exists

[TABLE]

Combine (43) with (46), we conclude that the series $w_{t+1}$ is convergent for $t\rightarrow\infty$ . ∎

Based on the above theorem, it is reasonable to take the terminate condition to be $||u_{t+1}-u_{t}||<tol$ . Moreover, if we reform (37) by $u_{1}$ , then

[TABLE]

In order to keep $u_{t+1}$ to be convergent fast, it is suggested to set $u_{1}=0$ .

In the following, we analyse the relation between the solution $u_{t}$ in SGTSVM and the optimal solution $u^{*}=(w^{*\top},b^{*})^{\top}$ in TWSVM.

Lemma 3.2.

Let $f_{1},\ldots,f_{T}$ be a sequence of convex functions, and $u_{1},\ldots,u_{T+1}\in R^{n}$ be a sequence of vectors. For $t\geq 1$ , $u_{t+1}=u_{t}-\eta_{t}\nabla_{t}$ , where $\nabla_{t}$ belongs to the sub-gradient set of $f_{t}$ at $u_{t}$ and $\eta_{t}=1/t$ . Suppose $||u_{t}||$ and $||\nabla_{t}||$ have the upper bounds $G_{1}$ and $G_{2}$ , respectively. Then, for all $\theta\in R^{n}$ , we have

(i) $\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(u_{t})\leq\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(\theta)+G_{2}(G_{1}+||\theta||)+\frac{1}{2T}G_{2}^{2}(1+\ln T)$ ;

(ii) for sufficiently large $T$ , given any $\varepsilon>0$ , then $\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(u_{t})\leq\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(\theta)+\varepsilon$ .

Proof.

Since $f_{t}$ is convex and $\nabla_{t}$ is the sub-gradient of $f_{t}$ at $u_{t}$ , we have that

[TABLE]

Note that

[TABLE]

Combine (48) and (49), we have

[TABLE]

Multiplying (50) by $1/T$ leads to the conclusion (i).

On the other hand, suppose $\lim\limits_{T\rightarrow\infty}u_{T}=\tilde{u}$ , we have $\lim\limits_{T\rightarrow\infty}||u_{T}||=||\tilde{u}||$ . Then, $\lim\limits_{T\rightarrow\infty}\frac{1}{T}\sum\limits_{t=1}^{T}||u_{t}-\theta||=\lim\limits_{T\rightarrow\infty}||u_{T}-\theta||=||\tilde{u}-\theta||$ . Note that $\lim\limits_{T\rightarrow\infty}\frac{G_{2}^{2}(1+lnT)}{T}=0$ . Given any $\varepsilon>0$ , for sufficiently large $T$ ,

[TABLE]

∎

We are now ready to bound the average instantaneous objective (32).

Theorem 3.2.

For $f_{t}$ ( $t=1,\ldots,T$ ) defined as (32) in SGTSVM, $u_{t}$ ( $t=1,\ldots,T$ ) is constructed by (35), and $u^{*}$ is the optimal solution to (31). Then,

(i) there are two constants $G_{1}$ and $G_{2}$ (actually, they are the upper bounds of $||w_{t}||$ and $||\nabla_{t}||$ , respectively) such that $\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(u_{t})\leq\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(u^{*})+G_{2}(G_{1}+||u^{*}||)+\frac{1}{2T}G_{2}^{2}(1+\ln T)$ ;

(ii) for sufficiently large $T$ , given any $\varepsilon>0$ , then $\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(u_{t})\leq\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(u^{*})+\varepsilon$ .

Proof.

Obviously, $f_{t}$ ( $t=1,\ldots,T$ ) is convex. Let $G_{1}$ and $G_{2}$ respectively be the upper bounds of $||u_{t}||$ and $||\nabla_{t}||$ , the conclusions come from Lemmas 3.1 and 3.2. ∎

In the following, let us discuss the relation between the solutions to SGTSVM and TWSVM with the uniform sampling.

Corollary 3.1.

Assume the conditions stated in Theorem 3.1 and $m_{1}=m_{2}$ , where $m_{1}$ and $m_{2}$ are the sample number of $X_{1}$ and $X_{2}$ , respectively. Suppose $T=km_{1}$ , where $k>0$ is an integer, and each sample is selected $k$ times at random. Then

(i) $f(u_{T})\leq f(u^{*})+G_{2}(G_{1}+||u^{*}||+G_{2})+\frac{1}{2T}G_{1}^{2}(1+\ln T)$ ;

(ii) for sufficiently large $T$ , given any $\varepsilon>0$ , then $f(u_{T})\leq f(u^{*})+G_{2}^{2}+\varepsilon$ .

Proof.

First, we prove that for all $i,j=1,2,\ldots,T$ ,

[TABLE]

From the formation of $f_{t}(u)$ , we have

[TABLE]

Since $G_{1}$ is the upper bound of $||u_{t}||$ ( $t\geq 1$ ) and $M$ is the largest norm of the samples in the dataset, the first part, the second part, and the third part on the right hand of (53) are respectively

[TABLE]

and

[TABLE]

Therefore, there is a constant $G_{2}=G_{1}+c_{1}G_{1}M^{2}+c_{2}M$ satisfying (52).

From $u_{t+1}=u_{t}-\frac{1}{t}\nabla_{t}$ , it is easy to obtain

[TABLE]

Thus, for $1\leq i<j\leq T$ ,

[TABLE]

Since $T=km_{1}=km_{2}$ , for all $u\in R^{n}$ , $\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(u)=f(u)$ . Note that $f(u)$ is the objective of TWSVM. Based on (52) and (58), we have

[TABLE]

Using the Theorem 3.1, we have the conclusion immediately. ∎

If $m_{1}\neq m_{2}$ , we can modify the sampling rule to obtain the same result as one in Corollary 3.1.

Corollary 3.2.

Assume the conditions stated in Corollary 3.1, but $m_{1}\neq m_{2}$ . Suppose $T=kd(m_{1},m_{2})$ , where $k>0$ is an integer and $d$ is the least common multiple of $m_{1}$ and $m_{2}$ . The sample in $X_{1}$ is selected $kd/m_{1}$ times at random, and the one in $X_{2}$ is $kd/m_{2}$ times at random. Then

(i) $f(u_{T})\leq f(u^{*})+G_{2}(G_{1}+||u^{*}||+G_{2})+\frac{1}{2T}G_{1}^{2}(1+\ln T)$ ;

(ii) for sufficiently large $T$ , given any $\varepsilon>0$ , then $f(u_{T})\leq f(u^{*})+G_{2}^{2}+\varepsilon$ .

Note that for all $u\in R^{n}$ , $\frac{1}{T}\sum\limits_{t=1}^{T}f_{t}(u)=f(u)$ . The proof of the above corollary is the same as Corollary 3.1.

The above corollaries provide the approximations of $u^{*}$ by $u_{T}$ . If the sampling rule is not as stated in these corollaries, these upper bounds no longer holds. However, Kakade and Tewari [14] have shown a way to obtain a similar bounds with high probability.

4 Experiments

In the experiments, we compared our SGTSVM with SVM [7], PEGASOS [32], and TWSVM [12, 35] on several artificial and large scale problems. All of the methods were implemented on a PC with an Intel Core Duo processor (3.4 GHz) with 4 GB RAM.

4.1 Artificial datasets

On the artificial datasets, PEGASOS, TWSVM, and our SGTSVM were implemented by Matlab [22], and the corresponding SGTSVM Matlab codes were uploaded upon http://www.optimal-group.org/Resource/SGTSVM.html.

First of all, we consider the similarity between TWSVM and SGTSVM. These two methods were implemented on the “cross planes” dataset, where TWSVM was superior on this dataset [12]. Figure 3 shows the proximal lines on the dataset. It is obvious that the two proximal lines by SGTSVM is similar as the ones by TWSVM, so TWSVM and SGTSVM can precisely capture the data distribution, and thus both of them obtain the well classifier. To measure the similarity quantitatively, the optimums $f_{1}$ of (14) and $f_{2}$ of (17) in TWSVM were calculated compared with the ones of each iteration in SGTSVM on the “cross planes” and some UCI datasets [4] (e.g., dataset Australia which includes $690$ samples with $14$ features, dataset Creadit which includes $690$ samples with $15$ features, and dataset Hypothyroid which includes $3,163$ samples with $25$ features). Linear TWSVM, SGTSVM, and their nonlinear versions were implemented, where the Gaussian kernel $K(x,y)=\exp\{-\mu||x-y||^{2}\}$ was used for nonlinear versions. The parameters $c_{1}$ , $c_{2}$ , $c_{3}$ , $c_{4}$ , and $\mu$ are fixed to $0.1$ . Figure 4 shows the results from the two linear classifiers, and Figure 5 corresponds to the nonlinear case. In Figures 4 and 5, the horizontal axis denotes the iteration of SGTSVM and the vertical axis denotes the objectives $f_{1}$ and $f_{2}$ of TWSVM and SGTSVM. Due to the objectives of TWSVM are constant, they are denoted by two horizontal lines, while the objectives of SGTSVM for each iteration are denoted by two broken lines in these figures. For different datasets, it can be seen that our SGTSVM converges to TWSVM after different iterations. For instance, linear SGTSVM converges to TWSVM after $20$ iterations in Figure 4 (a), whereas the same thing appears in Figure 4 (b) after $180$ iterations. Generally, SGTSVM converges to TWSVM after $150$ iterations on these datasets either for linear or nonlinear case. Furthermore, the 10-fold cross validation [8] was used on these datasets. We ran TWSVM and SGTSVM $10$ times, and reported the mean accuracy and standard deviation on Table 1. The differences of the mean accuracies are no more than $2\%$ , which implies the classifiers obtained by TWSVM and SGTSVM do not have significant difference.

Secondly, we test the stability of SGTSVM compared with PEGASOS. $100$ datasets were generated randomly, and each dataset contain $10,000$ samples in $R$ , where $5,000$ negative samples are from normal distribution $N(-2,1)$ and $5,000$ positive ones are from $N(2,1)$ . The best classification point is at zero. We implemented PEGASOS and SGTSVM without any restrictions on the $100$ datasets and obtained $100$ classifiers shown in Figure 6, where the upper right digit is the mean of these lines together with their standard deviation (the parameters $c$ in PEGASOS, $c_{1}$ , $c_{2}$ , $c_{3}$ , and $c_{4}$ in SGTSVM were fixed to $0.1$ ). It is clear that our SGTSVM obtains much more compact classification lines than PEGASOS. The mean line of SGTSVM is at $-0.0016$ which is closer to zero and its standard deviation is smaller than PEGASOS. In order to investigate the effect of sampling, PEGASOS and SGTSVM were implemented on the above $100$ datasets with the restricted sampling (i.e., some possible support vectors from negative samples in SVM and the samples close to these support vectors are invisible for sampling). Figure 7 shows the results of PEGASOS and SGTSVM, where the dash line denotes that the samples in this scope are invisible for sampling. From Figure 7, it can be seen that the classification lines by PEGASOS fall into two regions, while SGTSVM obtains a compact region. Thus, it means that the possible support vectors significantly influence PEGASOS, while SGTSVM relatively relies on the data distribution. From Figures 6 and 7, PEGASOS always acquires a mean classification line further from zero with a larger standard deviation than SGTSVM. Therefore, SGTSVM is more stable than PEGASOS on these datasets with or without restricted sampling. To further show the classifiers’ stability, we recorded the classification accuracies ( $\%$ ) of PEGASOS and SGTSVM on one of the $100$ datasets. PEGASOS and SGTSVM were implemented $100$ times on this dataset, where the parameters were set as before and two methods were iterated $200$ times. Every accuracies of these methods are reported in Figure 8. From Figure 8, the accuracies of SGTSVM belong to $[99.0,99.5]$ while PEGASOS is $[96.5,99.5]$ , which indicates SGTSVM is more stable than PEGASOS from the aspect of classification result. Although PEGASOS obtains the highest accuracy in this test, SGTSVM obtains higher accuracies than PEGASOS in most cases.

Finally, we test the convergence of PEGASOS and SGTSVM. A dataset contains $20,000$ samples in $R$ was generated randomly, where $10,000$ negative samples are from normal distribution $N(-2,1)$ and $10,000$ positive ones are from $N(2,1)$ . PEGASOS and SGTSVM were implemented $10$ times and each method was iterated $1,000$ times. The current classification locations for different iterations were reported in Figure 9, where the horizontal axis is the iteration and the vertical one is the classification location. From Figure 9, it can be seen that: (i) the initial selected samples do not very affect both PEGASOS and SGTSVM after iterating $150$ times; (ii) after iterating $100$ times, the classification locations of two methods are centralized to zero and the error is less than $0.1$ ; (iii) it is important that PEGASOS gets higher error after iterating $800$ times than SGTSVM, indicates PEGASOS converges slower than SGTSVM. To more precisely discuss the convergence, PEGASOS and SGTSVM were implemented $100$ times and each method was terminated by the solution error parameter $tol$ (more details about $tol$ can be found in Algorithm 1). $tol$ is selected from $\{10^{i}|i=-1,-2,\ldots,-6\}$ , and the corresponding iteration and spent time are reported in Figure 10. It is clear from Figure 10 that our SGTSVM converges faster than PEGASOS when $tol\leq 10^{-3}$ . Moreover, if one needs smaller solution error such as $tol=10^{-4}$ or $tol=10^{-5}$ , the iterations of PEGASOS would be about $10$ times more than SGTSVM, and it would be $100$ times when $tol=10^{-6}$ (thus the learning time between PEGASOS and SGTSVM is more than a hundredfold). Therefore, SGTSVM converges much faster than PEGASOS.

4.2 Large scale datasets

To test the feasibility of these methods on large scale datasets, we ran SVM, PEGASOS, and SGTSVM on six large scale datasets [4]. Table 2 shows the details of the large scale datasets, where Ratio in Table 2 is the sample number of positive class than negative one. Each dataset is split into two subsets where one (including $90\%$ samples) is used for training and the other (including $10\%$ samples) is used for testing. SVM is implemented by Liblinear [9], while PEGASOS and SGTSVM are implemented by the softwares written in C language. The corresponding softwares can be downloaded from http://www.optimal-group.org/Resource/SGTSVM.html. For nonlinear SGTSVM, the reduced kernel [18] is used and the kernel size is fixed to $100$ .

First, let us test the influence of parameter $tol$ on PEGASOS and SGTSVM. These methods were implemented on the large scale datasets, where $tol$ was respectively set to $\{10^{i}|i=-1,-2,\ldots,-6\}$ and other parameters were fixed to $0.1$ . The testing accuracy and learning time are reported in Figure 11. By comparing Figure 11 (a), (c), and (e), it can bee seen that our SGTSVM (including linear and nonlinear cases) is more stable than PEGASOS when $tol\leq 10^{-4}$ . In order to select a high accuracy with an acceptable learning time from Figure 11, $tol$ is set to $10^{-6}$ for PEGASOS, and it is set to $10^{-4}$ for SGTSVM.

Then, we compare SVM and PEGASOS with our SGTSVM with fixed $tol$ on these datasets. These methods’ accuracies are recorded in Table 3, where validation accuracy is obtained by 5-fold cross validation on the training subset, and testing accuracy is obtained by the testing subset. The parameters $c$ in SVM and PEGASOS, $c_{1}$ , $c_{2}$ , $c_{3}$ , and $c_{4}$ in SGTSVM are selected from $\{2^{i}|i=-8,-7,\ldots,1\}$ , and the Gaussian kernel parameter $\mu$ in nonlinear SGTSVM is selected from $\{2^{i}|i=-10,-9,\ldots,-1\}$ . For simplicity, we also set $c_{1}=c_{3}$ and $c_{2}=c_{4}$ in SGTSVM. The optimal parameters are recorded in Table 4. From Table 3, it is obvious that our SGTSVM owns the highest accuracies on $9$ groups of comparisons, and performs as well as SVM or PEGASOS on the other $3$ groups. However, SVM performs much worse than SGTSVM on the dataset Gashome and cannot work on three much larger datasets. Though PEGASOS can work on these datasets, it performs much worse than SGTSVM on Susy and Gas. To further comparing the learning time of these methods, we report the one-run time in Figure 12 with the optimal parameters. It is obvious that SGTSVM (including linear and nonlinear cases) is much faster than the others. Thus, our SGTSVM is comparable to SVM and PEGASOS on these large scale datasets. In addition, the softwares of SGTSVM and PEGASOS need much less RAM than Liblinear (the software of SVM). In detail, Liblinear needs store the entire training set in RAM, while PEGASOS and SGTSVM only store a subset related to the iteration. Due to the required memory of Liblinear increases with the size of dataset, it tends to out of memory with the increasing data size, while the same thing does not appear in PEGASOS or SGTSVM.

5 Conclusion

The stochastic gradient twin support vector machines (SGTSVM) based on stochastic gradient decent algorithm has been proposed. By hiring the nonparallel hyperplanes, SGTSVM is more stable on stochastic sampling than PEGASOS. In theory, we prove that SGTSVM is convergent, and it is an approximation of TWSVM with uniform sampling. Experimental results have confirmed the merits of SGTSVM and shown our SGTSVM has better accuracy compared with Liblinear and PEGASOS with the fastest learning speed. For practical convenience, the corresponding SGTSVM codes (including Matlab and C language) can be downloaded from http://www.optimal-group.org/Resource/SGTSVM.html. For the future work, it is possible to design some special sampling for SGTSVM to obtain more powerful performance, together with applying SGTSVM on the bigdata problems.

Acknowledgment

This work is supported by the National Natural Science Foundation of China (Nos. 11501310, 11201426, and 11371365), the Natural Science Foundation of Inner Mongolia Autonomous Region of China (No. 2015BS0606), and the Zhejiang Provincial Natural Science Foundation of China (No. LY15F

030013).

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M.S. Bazarra, H.D. Sherali, and C.M. Shetty. Nonlinear Programming Theory and Algorithms, second ed. Wiley, 2004.
2[2] A. Bennar and J.M. Monnez. Almost sure convergence of a stochastic approximation process in a convex set. International Journal of Applied Mathematics , 20(5):713–722, 2007.
3[3] J.B. Bi and V.N. Vapnik. Learning with rigorous support vector machines . Springer, 2003.
4[4] C.L. Blake and C.J. Merz. UCI Repository for Machine Learning Databases . http://www.ics.uci.edu/~mlearn/ML Repository.html , 1998.
5[5] C.C. Chang and C.J. Lin. LIBSVM : A library for support vector machines . http://www.csie.ntu.edu.tw/~cjlin , 2001.
6[6] W.J. Chen, Y.H. Shao, C.N. Li, and N.Y. Deng. Mltsvm: A novel twin support vector machine to multi-label learning. Pattern Recognition , 52:61–74, 2015.
7[7] C. Cortes and V.N. Vapnik. Support vector networks. Machine Learning , 20:273–297, 1995.
8[8] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, 2nd Edition . John Wiley and Sons, 2001.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Insensitive Stochastic Gradient Twin Support Vector Machines for Large Scale Problems

Abstract

keywords:

1 Introduction

2 Related Works

2.1 SVM

2.2 PEGASOS

2.3 TWSVM

3 SGTSVM

3.1 Linear Formation

3.2 Nonlinear Formation

3.3 Analysis

Lemma 3.1**.**

Proof.

Theorem 3.1**.**

Proof.

Lemma 3.2**.**

Proof.

Theorem 3.2**.**

Proof.

Corollary 3.1**.**

Proof.

Corollary 3.2**.**

4 Experiments

4.1 Artificial datasets

4.2 Large scale datasets

5 Conclusion

Acknowledgment

Lemma 3.1.

Theorem 3.1.

Lemma 3.2.

Theorem 3.2.

Corollary 3.1.

Corollary 3.2.