Insensitive Stochastic Gradient Twin Support Vector Machine for Large Scale Problems
Zhen Wang, Yuan-Hai Shao, Lan Bai, Li-Ming Liu, Nai-Yang Deng

TL;DR
This paper introduces SGTSVM, a stochastic gradient method for twin support vector machines that is more insensitive to sampling variations, with proven convergence and superior stability on large datasets.
Contribution
The paper proposes a novel stochastic gradient twin support vector machine (SGTSVM) that is less sensitive to sampling, with theoretical convergence proof and applicability to nonlinear cases.
Findings
SGTSVM converges theoretically unlike PEGASOS.
SGTSVM demonstrates stable and fast learning on large datasets.
Approximation between SGTSVM and twin SVM is established.
Abstract
Stochastic gradient descent algorithm has been successfully applied on support vector machines (called PEGASOS) for many classification problems. In this paper, stochastic gradient descent algorithm is investigated to twin support vector machines for classification. Compared with PEGASOS, the proposed stochastic gradient twin support vector machines (SGTSVM) is insensitive on stochastic sampling for stochastic gradient descent algorithm. In theory, we prove the convergence of SGTSVM instead of almost sure convergence of PEGASOS. For uniformly sampling, the approximation between SGTSVM and twin support vector machines is also given, while PEGASOS only has an opportunity to obtain an approximation of support vector machines. In addition, the nonlinear SGTSVM is derived directly from its linear case. Experimental results on both artificial datasets and large scale problems show the stable…
| Data | TWSVM† | SGTSVM† | TWSVM♯ | SGTSVM♯ |
|---|---|---|---|---|
| Cross Planes | 96.050.70 | 97.710.41 | 99.012.24 | 98.512.15 |
| Australia | 86.870.38 | 87.340.13 | 87.100.43 | 85.210.16 |
| Creadit | 85.780.32 | 85.720.23 | 86.710.33 | 85.210.45 |
| Hypothyroid | 98.210.09 | 97.280.01 | 98.080.09 | 98.070.03 |
| Data | Name | No. of samples | Dimension | Ratio |
|---|---|---|---|---|
| (a) | Skin | 245,057 | 3 | 0.262 |
| (b) | Gashome | 928,990 | 10 | 0.578 |
| (c) | Susy | 5,000,000 | 18 | 0.844 |
| (d) | Kddcup | 4,898,432 | 41 | 0.248 |
| (e) | Gas | 8,386,764 | 16 | 0.077 |
| (f) | Hepmass | 10,500,000 | 28 | 1.000 |
| Data | SVM | PEGASOS | SGTSVM† | SGTSVM♯ | |
|---|---|---|---|---|---|
| Skin | validation(%) | 78.87 | 82.46 | 84.70 | |
| 245,0573 | testing(%) | 84.28 | 85.39 | 85.34 | |
| Gashome | validation(%) | 49.11 | 70.09 | 67.50 | |
| 919,43810 | testing(%) | 82.57 | 72.85 | 76.09 | |
| Susy | validation(%) | 54.11 | 76.14 | 69.90 | |
| 5,000,00018 | testing(%) | 56.44 | 75.09 | 68.61 | |
| Kddcup | validation(%) | * | 95.24 | 93.19 | |
| 4,898,43241 | testing(%) | * | 96.42 | 97.45 | |
| Gas | validation(%) | * | 69.77 | 89.73 | |
| 8,386,76416 | testing(%) | * | 50.54 | 92.45 | |
| Hepmass | validation(%) | * | 80.63 | 80.80 | |
| 10,500,00028 | testing(%) | * | 80.84 | 79.59 |
| Data | SVM | PEGASOS | SGTSVM† | SGTSVM♯ | |
| c | c | ||||
| Skin | validation | -1 | -6 | 0,-5 | -6,-5,-3 |
| testing | -1 | -4 | 1,-6 | -1,0,-9 | |
| Gashome | validation | 0 | -6 | -4,-5 | -3,-5,-2 |
| testing | -1 | -1 | -8,-7 | -8,-1,-2 | |
| Susy | validation | 1 | 0 | -2,-6 | -3,-1,-4 |
| testing | 0 | -7 | -1,-3 | -3,-3,-3 | |
| Kddcup | validation | NA | -6 | -8,-4 | 0,-3,-4 |
| testing | NA | -2 | -8,-4 | -6,-1,-8 | |
| Gas | validation | NA | -1 | -4,0 | -1,-1,-6 |
| testing | NA | 1 | -3,1 | -4,-8,-6 | |
| Hepmass | validation | NA | 0 | -1,-2 | -4,-1,-3 |
| testing | NA | 0 | 0,-2 | -4,-2,-3 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Insensitive Stochastic Gradient Twin Support Vector Machines for Large Scale Problems
Zhen Wang
School of Mathematical Sciences, Inner Mongolia University, Hohhot, 010021, P.R.China
Yuan-Hai Shao
School of Economics and Management, Hainan University, Haikou, 570228, P.R. China
Lan Bai
Li-Ming Liu
School of Statistics, Capital University of Economics and Business, Beijing, 100070, P.R.China
Nai-Yang Deng
College of Science China Agricultural University, Beijing, 100083, P.R.China
Abstract
Stochastic gradient descent algorithm has been successfully applied on support vector machines (called PEGASOS) for many classification problems. In this paper, stochastic gradient descent algorithm is investigated to twin support vector machines for classification. Compared with PEGASOS, the proposed stochastic gradient twin support vector machines (SGTSVM) is insensitive on stochastic sampling for stochastic gradient descent algorithm. In theory, we prove the convergence of SGTSVM instead of almost sure convergence of PEGASOS. For uniformly sampling, the approximation between SGTSVM and twin support vector machines is also given, while PEGASOS only has an opportunity to obtain an approximation of support vector machines. In addition, the nonlinear SGTSVM is derived directly from its linear case. Experimental results on both artificial datasets and large scale problems show the stable performance of SGTSVM with a fast learning speed.
keywords:
Classification, support vector machines, twin support vector machines, stochastic gradient descent, large scale problem.
1 Introduction
Support vector machines (SVM), being powerful tool for classification [7, 20, 42], have already outperformed most other classifiers in a wide variety of applications [23, 17, 11]. Different from SVM with a pair of parallel hyperplanes, twin support vector machines (TWSVM) [12, 35] with a pair of nonparallel hyperplanes has been proposed and developed, e.g., twin bounded support vector machines (TBSVM) [35], twin parametric margin support vector machines (TPMSVM) [24], and weighted Lagrangian twin support vector machines (WLTSVM) [33]. These classifiers have been widely applied in many practical problems [34, 39, 19, 38, 6, 36, 29, 28, 26, 27]. In the training stage, SVM solves a quadratic programming problem (QPP), whereas TWSVM solve two smaller QPPs by traditional solver such as interior method [20, 1, 12]. However, neither SVM nor TWSVM based on these solvers can deal with the large scale problem, especially millions of samples.
In order to deal with the large scale problem, many improvements were proposed, e.g., for SVM, sequential minimal optimization, coordinate decent method, trust region Newton, and stochastic gradient descent algorithm (SGD) in [25, 13, 5, 9, 32], and for TWSVM, successive overrelaxation technique, Newton-Armijo algorithm, and dual coordinate decent method in [35, 39, 37]. The stochastic gradient descent algorithm for SVM (PEGASOS) [16, 43, 32, 41] attracts a great attention, because it partitions the large scale problem into a series of subproblems by stochastic sampling with a suitable size. It has been proved that PEGASOS is almost sure convergent, and thus is able to find an approximation of the desired solution with high probability [2, 43, 32]. The existing experiments confirm the effectiveness of these algorithms with an amazing learning speed.
However, for large scale problem, the stochastic sampling in SGD may bring some difficulties to SVM due to only a small subset of the dataset is selected for training. In fact, if the subset is not suitable, PEGASOS would be weak. It is well known that in SVM the support vectors (SVs), a small subset of the dataset, decides the final classifier. If the stochastic sampling does not include the SVs sufficiently, the classifier would lose some generalizations. Figure 1 is a toy example for PEGASOS. There are two classes in this figure, where the positive and negative classes respectively include 6 and 4 samples, and the circle is one of the potential SVs. The solid blue line is the separating line obtained by PEGASOS with three different sampling: (i) strengthening the circle sample; (ii) infrequently using the circle sample; (iii) ignoring the circle sample. Figure 1 shows that the circle sample plays an important role on the separating line, and infrequently using or ignoring this sample would lead to misclassify.
Compared with SVM, it is significant that TWSVM is more stable for sampling and does not strongly depend on some special samples such as the SVs [12, 35], which indicates SGD is more suitable for TWSVM. Therefore, in this paper, we propose a stochastic gradient twin support vector machines (SGTSVM). Different from PEGASOS, our method selects two samples from different classes randomly in each iteration to construct a pair of nonparallel hyperplanes. Due to TWSVM fits all of the training samples, our method is stable for the stochastic sampling and thus gains well generalizations. Moreover, the characteristics inherited from TWSVM result in that our SGTSVM suits for many cases, e.g., “cross planes” dataset [21] and preferential classification [12]. As the above toy example, Figure 2 shows the corresponding results by SGTSVM. Comparing Figure 2 with Figure 1, it is clear that SGTSVM performs better than PEGASOS.
The main contributions of this paper includes:
(i) a SGD-based TWSVM (SGTSVM) is proposed, and it is very easy to be extended to other TWSVM-type classifiers;
(ii) we prove that the proposed SGTSVM is convergent, instead of almost sure convergence in PEGASOS;
(iii) for the uniformly sampling, it is proved that the original objective of the solution to SGTSVM is bounded by the optimum of TWSVM, which indicates the solution to SGTSVM is an approximation of the optimal solution to TWSVM, while PEGASOS only has an opportunity to obtain an approximation of the optimal solution to SVM (more information please see Corollaries 1 and 2 in [32]);
(iv) the nonlinear case of SGTSVM is obtained directly based on its original problem;
(v) each iteration of SGTSVM includes no more than multiplications without additional storage, so it is the fastest one than other proposed TWSVM-type classifiers.
The rest of this paper is organized as follow. Section 2 briefly reviews SVM, PEGASOS, and TWSVM. Our linear and nonlinear SGTSVMs together with the theoretical analysis are elaborated in Section 3. Experiments are arranged in Section 4. Finally, we give the conclusions.
2 Related Works
Consider a binary classification problem in the -dimensional real space . The set of training samples is represented by , where is the sample with the label . We further organize the samples of Class into a matrix and the samples of Class into a matrix . Below, we give a brief outlines of some related works.
2.1 SVM
Support vector machines (SVM) [7, 3] searches for a separating hyperplane
[TABLE]
where and . By introducing the regularization term, the primal problem of SVM can be expressed as a QPP as follow
[TABLE]
where denotes the norm, is a parameter with some quantitative meanings [3], is a vector of ones with an appropriate dimension, is the slack vector, and . Note that the minimization of the regularization term is equivalent to maximize the margin between two parallel supporting hyperplanes . And the structural risk minimization principle is implemented in this problem [7].
2.2 PEGASOS
PEGASOS [43, 32] considers a strongly convex problem by modifying (4) as follow
[TABLE]
and recasts the above problem to
[TABLE]
where replaces negative components of a vector by zeros.
In the th iteration (), PEGASOS constructs a temporary function, which is defined by a random sample as
[TABLE]
Then, starting with an initial , PEGASOS iteratively updates for , where is the step size and is the sub-gradient of at ,
[TABLE]
When some terminate conditions are satisfied, the last is outputted as . And a new sample can be predicted by
[TABLE]
It has been proved that the average solution is bounded by the optimal solution to (6) with , and thus PEGASOS has with a probability of at least to find a good approximation of [32]. The authors of [32] also pointed out that is often used instead of in practice. The sample which is selected randomly can be replaced with a small subset belonging to the whole dataset, and the subset only including a sample is often used in practice [43, 32, 41]. In order to extend the generalization ability of PEGASOS, the bias term in SVM can be appended to PEGASOS by replacing of (7) with
[TABLE]
However, this modification would lead to the function not to be strongly convex and thus yield a slow convergence rate [32].
2.3 TWSVM
TWSVM [12, 35] seeks a pair of nonparallel hyperplanes in which can be expressed as
[TABLE]
such that each hyperplane is close to samples of one class and has a certain distance from the other class. To find the pair of nonparallel hyperplanes, it is required to get the solutions to the primal problems
[TABLE]
and
[TABLE]
where , , , and are positive parameters, and are slack vectors. Their geometric meaning is clear. For example, for (14), its objective function makes the samples of Class proximal to the hyperplane together with the regularization term, while the constraints make each sample of Class has a distance more than away from the hyperplane .
Once the solutions and to the problems (14) and (17) are respectively obtained, a new point is assigned to which class depends on the distance to the two hyperplanes in (11), i.e.,
[TABLE]
where is the absolute value.
3 SGTSVM
In this section, we elaborate our SGTSVM and give its convergence analysis together with the boundedness.
3.1 Linear Formation
Following the notations in Section 2, we recast the QPPs (14) and (17) in TWSVM to unconstrained problems
[TABLE]
and
[TABLE]
respectively.
In order to solve the above two problems, we construct a series of strictly convex functions and with as
[TABLE]
and
[TABLE]
where and are selected randomly from and , respectively.
The sub-gradients of the above functions at and can be obtained as
[TABLE]
and
[TABLE]
respectively.
Our SGTSVM starts from the initial and . Then, for , the updates are given by
[TABLE]
where is the step size and typically is set to . If the terminated condition is satisfied, is assigned to , and is assigned to . Then, a new sample can be predicted by (18).
The above procedures are summarized in Algorithm 1.
3.2 Nonlinear Formation
Now, we extend our SGTSVM to nonlinear case by the kernel trick [21, 12, 35, 31, 15, 18]. Suppose is the predefined kernel function, then the nonparallel hyperplanes can be expressed as
[TABLE]
The counterparts of (19) and (20) can be formulated as
[TABLE]
and
[TABLE]
Then, we construct a series of functions with as
[TABLE]
and
[TABLE]
Similar to (23), (24), and (25), the sub-gradients and updates can be obtained. The details are omitted.
For large scale problem, it is time consuming to calculate the kernel . However, the reduced kernel strategy, which has been successfully applied for SVM and TWSVM [18, 40, 39], can also be applied for our SGTSVM. The reduced kernel strategy replaces with , where is a random sampled subset of . In practice, just needs samples from to get a well performance, reducing the learning time without loss of generalization [40].
3.3 Analysis
In this subsection, we discuss two issues: (i) the convergence of the solution in SGTSVM; (ii) the relation between the solution in SGTSVM and the optimal one in TWSVM. For convenience, we just consider the first QPP (19) of linear TWSVM together with the SGD formation of linear SGTSVM. The conclusions on another QPP (20) and the nonlinear formations can be obtained easily as the first one.
Let , , , , and the notations with the subscripts in SGTSVM also comply with this definition. Then, the first QPP (19) is reformulated as
[TABLE]
Next, we reformulate the th () function in SGTSVM as
[TABLE]
where and are the samples selected randomly from and for the th iteration, respectively. The sub-gradient of at is denoted as
[TABLE]
Given and the step size , with is updated by
[TABLE]
i.e.,
[TABLE]
Lemma 3.1**.**
For all , and have the upper bounds.
Proof.
The formation (35) can be rewritten as
[TABLE]
where , is the identity matrix, and . Note that for sufficient , there is a positive integer such that for , is positive definite, and the largest eigenvalue of is smaller than or equal to . Based on (36), we have
[TABLE]
For , [10]. Therefore,
[TABLE]
and
[TABLE]
Thus, we have
[TABLE]
Let be the largest norm of the samples in the dataset and
[TABLE]
This leads to that is an upper bound of , and is an upper bound of , for . ∎
Theorem 3.1**.**
The iterative formation (35) of our SGTSVM is convergent.
Proof.
On the one hand, from (38) in the proof of Lemma 3.1, we have
[TABLE]
which indicates
[TABLE]
On the other hand, from (39), we have
[TABLE]
which indicates that the following limit exists
[TABLE]
Note that an infinite series of vectors is convergent if its norm series is convergent [30]. Therefore, the following limit exists
[TABLE]
Combine (43) with (46), we conclude that the series is convergent for . ∎
Based on the above theorem, it is reasonable to take the terminate condition to be . Moreover, if we reform (37) by , then
[TABLE]
In order to keep to be convergent fast, it is suggested to set .
In the following, we analyse the relation between the solution in SGTSVM and the optimal solution in TWSVM.
Lemma 3.2**.**
Let be a sequence of convex functions, and be a sequence of vectors. For , , where belongs to the sub-gradient set of at and . Suppose and have the upper bounds and , respectively. Then, for all , we have
(i) ;
(ii) for sufficiently large , given any , then .
Proof.
Since is convex and is the sub-gradient of at , we have that
[TABLE]
Note that
[TABLE]
Combine (48) and (49), we have
[TABLE]
Multiplying (50) by leads to the conclusion (i).
On the other hand, suppose , we have . Then, . Note that . Given any , for sufficiently large ,
[TABLE]
∎
We are now ready to bound the average instantaneous objective (32).
Theorem 3.2**.**
For () defined as (32) in SGTSVM, () is constructed by (35), and is the optimal solution to (31). Then,
(i) there are two constants and (actually, they are the upper bounds of and , respectively) such that ;
(ii) for sufficiently large , given any , then .
Proof.
Obviously, () is convex. Let and respectively be the upper bounds of and , the conclusions come from Lemmas 3.1 and 3.2. ∎
In the following, let us discuss the relation between the solutions to SGTSVM and TWSVM with the uniform sampling.
Corollary 3.1**.**
Assume the conditions stated in Theorem 3.1 and , where and are the sample number of and , respectively. Suppose , where is an integer, and each sample is selected times at random. Then
(i) ;
(ii) for sufficiently large , given any , then .
Proof.
First, we prove that for all ,
[TABLE]
From the formation of , we have
[TABLE]
Since is the upper bound of () and is the largest norm of the samples in the dataset, the first part, the second part, and the third part on the right hand of (53) are respectively
[TABLE]
[TABLE]
and
[TABLE]
Therefore, there is a constant satisfying (52).
From , it is easy to obtain
[TABLE]
Thus, for ,
[TABLE]
Since , for all , . Note that is the objective of TWSVM. Based on (52) and (58), we have
[TABLE]
Using the Theorem 3.1, we have the conclusion immediately. ∎
If , we can modify the sampling rule to obtain the same result as one in Corollary 3.1.
Corollary 3.2**.**
Assume the conditions stated in Corollary 3.1, but . Suppose , where is an integer and is the least common multiple of and . The sample in is selected times at random, and the one in is times at random. Then
(i) ;
(ii) for sufficiently large , given any , then .
Note that for all , . The proof of the above corollary is the same as Corollary 3.1.
The above corollaries provide the approximations of by . If the sampling rule is not as stated in these corollaries, these upper bounds no longer holds. However, Kakade and Tewari [14] have shown a way to obtain a similar bounds with high probability.
4 Experiments
In the experiments, we compared our SGTSVM with SVM [7], PEGASOS [32], and TWSVM [12, 35] on several artificial and large scale problems. All of the methods were implemented on a PC with an Intel Core Duo processor (3.4 GHz) with 4 GB RAM.
4.1 Artificial datasets
On the artificial datasets, PEGASOS, TWSVM, and our SGTSVM were implemented by Matlab [22], and the corresponding SGTSVM Matlab codes were uploaded upon http://www.optimal-group.org/Resource/SGTSVM.html.
First of all, we consider the similarity between TWSVM and SGTSVM. These two methods were implemented on the “cross planes” dataset, where TWSVM was superior on this dataset [12]. Figure 3 shows the proximal lines on the dataset. It is obvious that the two proximal lines by SGTSVM is similar as the ones by TWSVM, so TWSVM and SGTSVM can precisely capture the data distribution, and thus both of them obtain the well classifier. To measure the similarity quantitatively, the optimums of (14) and of (17) in TWSVM were calculated compared with the ones of each iteration in SGTSVM on the “cross planes” and some UCI datasets [4] (e.g., dataset Australia which includes samples with features, dataset Creadit which includes samples with features, and dataset Hypothyroid which includes samples with features). Linear TWSVM, SGTSVM, and their nonlinear versions were implemented, where the Gaussian kernel was used for nonlinear versions. The parameters , , , , and are fixed to . Figure 4 shows the results from the two linear classifiers, and Figure 5 corresponds to the nonlinear case. In Figures 4 and 5, the horizontal axis denotes the iteration of SGTSVM and the vertical axis denotes the objectives and of TWSVM and SGTSVM. Due to the objectives of TWSVM are constant, they are denoted by two horizontal lines, while the objectives of SGTSVM for each iteration are denoted by two broken lines in these figures. For different datasets, it can be seen that our SGTSVM converges to TWSVM after different iterations. For instance, linear SGTSVM converges to TWSVM after iterations in Figure 4 (a), whereas the same thing appears in Figure 4 (b) after iterations. Generally, SGTSVM converges to TWSVM after iterations on these datasets either for linear or nonlinear case. Furthermore, the 10-fold cross validation [8] was used on these datasets. We ran TWSVM and SGTSVM times, and reported the mean accuracy and standard deviation on Table 1. The differences of the mean accuracies are no more than , which implies the classifiers obtained by TWSVM and SGTSVM do not have significant difference.
Secondly, we test the stability of SGTSVM compared with PEGASOS. datasets were generated randomly, and each dataset contain samples in , where negative samples are from normal distribution and positive ones are from . The best classification point is at zero. We implemented PEGASOS and SGTSVM without any restrictions on the datasets and obtained classifiers shown in Figure 6, where the upper right digit is the mean of these lines together with their standard deviation (the parameters in PEGASOS, , , , and in SGTSVM were fixed to ). It is clear that our SGTSVM obtains much more compact classification lines than PEGASOS. The mean line of SGTSVM is at which is closer to zero and its standard deviation is smaller than PEGASOS. In order to investigate the effect of sampling, PEGASOS and SGTSVM were implemented on the above datasets with the restricted sampling (i.e., some possible support vectors from negative samples in SVM and the samples close to these support vectors are invisible for sampling). Figure 7 shows the results of PEGASOS and SGTSVM, where the dash line denotes that the samples in this scope are invisible for sampling. From Figure 7, it can be seen that the classification lines by PEGASOS fall into two regions, while SGTSVM obtains a compact region. Thus, it means that the possible support vectors significantly influence PEGASOS, while SGTSVM relatively relies on the data distribution. From Figures 6 and 7, PEGASOS always acquires a mean classification line further from zero with a larger standard deviation than SGTSVM. Therefore, SGTSVM is more stable than PEGASOS on these datasets with or without restricted sampling. To further show the classifiers’ stability, we recorded the classification accuracies () of PEGASOS and SGTSVM on one of the datasets. PEGASOS and SGTSVM were implemented times on this dataset, where the parameters were set as before and two methods were iterated times. Every accuracies of these methods are reported in Figure 8. From Figure 8, the accuracies of SGTSVM belong to while PEGASOS is , which indicates SGTSVM is more stable than PEGASOS from the aspect of classification result. Although PEGASOS obtains the highest accuracy in this test, SGTSVM obtains higher accuracies than PEGASOS in most cases.
Finally, we test the convergence of PEGASOS and SGTSVM. A dataset contains samples in was generated randomly, where negative samples are from normal distribution and positive ones are from . PEGASOS and SGTSVM were implemented times and each method was iterated times. The current classification locations for different iterations were reported in Figure 9, where the horizontal axis is the iteration and the vertical one is the classification location. From Figure 9, it can be seen that: (i) the initial selected samples do not very affect both PEGASOS and SGTSVM after iterating times; (ii) after iterating times, the classification locations of two methods are centralized to zero and the error is less than ; (iii) it is important that PEGASOS gets higher error after iterating times than SGTSVM, indicates PEGASOS converges slower than SGTSVM. To more precisely discuss the convergence, PEGASOS and SGTSVM were implemented times and each method was terminated by the solution error parameter (more details about can be found in Algorithm 1). is selected from , and the corresponding iteration and spent time are reported in Figure 10. It is clear from Figure 10 that our SGTSVM converges faster than PEGASOS when . Moreover, if one needs smaller solution error such as or , the iterations of PEGASOS would be about times more than SGTSVM, and it would be times when (thus the learning time between PEGASOS and SGTSVM is more than a hundredfold). Therefore, SGTSVM converges much faster than PEGASOS.
4.2 Large scale datasets
To test the feasibility of these methods on large scale datasets, we ran SVM, PEGASOS, and SGTSVM on six large scale datasets [4]. Table 2 shows the details of the large scale datasets, where Ratio in Table 2 is the sample number of positive class than negative one. Each dataset is split into two subsets where one (including samples) is used for training and the other (including samples) is used for testing. SVM is implemented by Liblinear [9], while PEGASOS and SGTSVM are implemented by the softwares written in C language. The corresponding softwares can be downloaded from http://www.optimal-group.org/Resource/SGTSVM.html. For nonlinear SGTSVM, the reduced kernel [18] is used and the kernel size is fixed to .
First, let us test the influence of parameter on PEGASOS and SGTSVM. These methods were implemented on the large scale datasets, where was respectively set to and other parameters were fixed to . The testing accuracy and learning time are reported in Figure 11. By comparing Figure 11 (a), (c), and (e), it can bee seen that our SGTSVM (including linear and nonlinear cases) is more stable than PEGASOS when . In order to select a high accuracy with an acceptable learning time from Figure 11, is set to for PEGASOS, and it is set to for SGTSVM.
Then, we compare SVM and PEGASOS with our SGTSVM with fixed on these datasets. These methods’ accuracies are recorded in Table 3, where validation accuracy is obtained by 5-fold cross validation on the training subset, and testing accuracy is obtained by the testing subset. The parameters in SVM and PEGASOS, , , , and in SGTSVM are selected from , and the Gaussian kernel parameter in nonlinear SGTSVM is selected from . For simplicity, we also set and in SGTSVM. The optimal parameters are recorded in Table 4. From Table 3, it is obvious that our SGTSVM owns the highest accuracies on groups of comparisons, and performs as well as SVM or PEGASOS on the other groups. However, SVM performs much worse than SGTSVM on the dataset Gashome and cannot work on three much larger datasets. Though PEGASOS can work on these datasets, it performs much worse than SGTSVM on Susy and Gas. To further comparing the learning time of these methods, we report the one-run time in Figure 12 with the optimal parameters. It is obvious that SGTSVM (including linear and nonlinear cases) is much faster than the others. Thus, our SGTSVM is comparable to SVM and PEGASOS on these large scale datasets. In addition, the softwares of SGTSVM and PEGASOS need much less RAM than Liblinear (the software of SVM). In detail, Liblinear needs store the entire training set in RAM, while PEGASOS and SGTSVM only store a subset related to the iteration. Due to the required memory of Liblinear increases with the size of dataset, it tends to out of memory with the increasing data size, while the same thing does not appear in PEGASOS or SGTSVM.
5 Conclusion
The stochastic gradient twin support vector machines (SGTSVM) based on stochastic gradient decent algorithm has been proposed. By hiring the nonparallel hyperplanes, SGTSVM is more stable on stochastic sampling than PEGASOS. In theory, we prove that SGTSVM is convergent, and it is an approximation of TWSVM with uniform sampling. Experimental results have confirmed the merits of SGTSVM and shown our SGTSVM has better accuracy compared with Liblinear and PEGASOS with the fastest learning speed. For practical convenience, the corresponding SGTSVM codes (including Matlab and C language) can be downloaded from http://www.optimal-group.org/Resource/SGTSVM.html. For the future work, it is possible to design some special sampling for SGTSVM to obtain more powerful performance, together with applying SGTSVM on the bigdata problems.
Acknowledgment
This work is supported by the National Natural Science Foundation of China (Nos. 11501310, 11201426, and 11371365), the Natural Science Foundation of Inner Mongolia Autonomous Region of China (No. 2015BS0606), and the Zhejiang Provincial Natural Science Foundation of China (No. LY15F
030013).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M.S. Bazarra, H.D. Sherali, and C.M. Shetty. Nonlinear Programming Theory and Algorithms, second ed. Wiley, 2004.
- 2[2] A. Bennar and J.M. Monnez. Almost sure convergence of a stochastic approximation process in a convex set. International Journal of Applied Mathematics , 20(5):713–722, 2007.
- 3[3] J.B. Bi and V.N. Vapnik. Learning with rigorous support vector machines . Springer, 2003.
- 4[4] C.L. Blake and C.J. Merz. UCI Repository for Machine Learning Databases . http://www.ics.uci.edu/~mlearn/ML Repository.html , 1998.
- 5[5] C.C. Chang and C.J. Lin. LIBSVM : A library for support vector machines . http://www.csie.ntu.edu.tw/~cjlin , 2001.
- 6[6] W.J. Chen, Y.H. Shao, C.N. Li, and N.Y. Deng. Mltsvm: A novel twin support vector machine to multi-label learning. Pattern Recognition , 52:61–74, 2015.
- 7[7] C. Cortes and V.N. Vapnik. Support vector networks. Machine Learning , 20:273–297, 1995.
- 8[8] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification, 2nd Edition . John Wiley and Sons, 2001.
