A Semismooth Newton Method for Support Vector Classification and Regression
Juan Yin, Qingna Li

TL;DR
This paper introduces a semismooth Newton method tailored for support vector machine models, achieving fast convergence and reduced computational complexity, especially effective for large-scale datasets.
Contribution
It explores the sparse structure of SVM models to significantly lower computational complexity while maintaining quadratic convergence, outperforming existing solvers on large datasets.
Findings
The method converges quadratically and is computationally efficient.
It outperforms leading solvers like DCD and TRON on large-scale problems.
It solves large SVM problems in seconds, demonstrating practical efficiency.
Abstract
Support vector machine is an important and fundamental technique in machine learning. In this paper, we apply a semismooth Newton method to solve two typical SVM models: the L2-loss SVC model and the \epsilon-L2-loss SVR model. The semismooth Newton method is widely used in optimization community. A common belief on the semismooth Newton method is its fast convergence rate as well as high computational complexity. Our contribution in this paper is that by exploring the sparse structure of the models, we significantly reduce the computational complexity, meanwhile keeping the quadratic convergence rate. Extensive numerical experiments demonstrate the outstanding performance of the semismooth Newton method, especially for problems with huge size of sample data (for news20.binary problem with 19996 features and 1355191 samples, it only takes three seconds). In particular, for the…
| Step | Formula | Computational Cost |
|---|---|---|
| Form | ||
| Calculate |
| Formula | Computational Cost | |
|---|---|---|
| Calculate directly | ||
| With definition of |
| Data set | l | n | nonzeros | density |
|---|---|---|---|---|
| a1a | 30956 | 123 | 429343 | 11.28 |
| a2a | 30296 | 123 | 420188 | 11.28 |
| a3a | 29376 | 123 | 407430 | 11.28 |
| a4a | 27780 | 123 | 385302 | 11.28 |
| a5a | 26147 | 123 | 362653 | 11.28 |
| a6a | 21341 | 123 | 295984 | 11.28 |
| a7a | 16461 | 123 | 228288 | 11.28 |
| a8a | 22696 | 123 | 314815 | 11.28 |
| a9a | 32561 | 123 | 451592 | 11.28 |
| australian | 690 | 14 | 8447 | 87.44 |
| breast-cancer | 638 | 10 | 6380 | 100 |
| cod-rna | 59535 | 8 | 476280 | 100 |
| colon-cancer | 62 | 2000 | 124000 | 100 |
| diabetes | 768 | 8 | 6135 | 99.85 |
| duke breast-cancer | 38 | 7129 | 270902 | 100 |
| fourclass | 862 | 2 | 1717 | 99.59 |
| german.numer | 1000 | 24 | 23001 | 95.84 |
| gisette | 6000 | 5000 | 29729997 | 99.10 |
| heart | 270 | 13 | 3510 | 100 |
| ijcnn1 | 49990 | 22 | 649870 | 59.09 |
| Data set | l | n | nonzeros | density |
|---|---|---|---|---|
| ionosphere | 351 | 34 | 10551 | 88.41 |
| leukemia | 38 | 7129 | 270902 | 100 |
| liver-disorders | 145 | 5 | 725 | 100 |
| mushrooms | 8124 | 112 | 170604 | 18.75 |
| news20.binary | 19996 | 1355191 | 9097916 | 0.03 |
| phishing | 11055 | 68 | 331610 | 44.11 |
| rcv1.binary | 20242 | 47236 | 1498952 | 0.16 |
| real-sim | 72309 | 20958 | 3709083 | 0.24 |
| skinnonskin | 245057 | 3 | 735171 | 100 |
| splice | 2175 | 60 | 130500 | 100 |
| sonar | 208 | 60 | 12479 | 99.99 |
| svmguide1 | 3089 | 4 | 12356 | 100 |
| svmguide3 | 1243 | 22 | 27208 | 99.50 |
| w1a | 47272 | 300 | 551176 | 3.89 |
| w2a | 46279 | 300 | 539213 | 3.89 |
| w3a | 44837 | 300 | 522338 | 3.89 |
| w4a | 42383 | 300 | 493583 | 3.89 |
| w5a | 39861 | 300 | 464466 | 3.89 |
| w6a | 32561 | 300 | 379116 | 3.89 |
| w7a | 25057 | 300 | 291438 | 3.89 |
| w8a | 49749 | 300 | 579586 | 3.89 |
| covtype.binary | 581012 | 54 | 6940438 | 22.12 |
| Data set | l | n | nonzeros | density | range of y |
|---|---|---|---|---|---|
| abalone | 4177 | 8 | 32080 | 96.00 | [4, 29] |
| bodyfat | 252 | 14 | 3528 | 100 | [1.00, 1.11] |
| cpusmall | 8192 | 12 | 98304 | 100 | [0, 99] |
| tfidf.train | 16087 | 150360 | 19971015 | 0.83 | [-7.90, -0.52] |
| tfidf.test | 3308 | 150360 | 4559533 | 0.92 | [-7.14, -1.69] |
| eunite2001 | 336 | 16 | 2651 | 49.31 | [612, 876] |
| housing | 506 | 13 | 6578 | 100 | [5, 50] |
| mg | 1385 | 6 | 8310 | 100 | [0.42, 1.32] |
| mpg | 392 | 7 | 2614 | 95.26 | [9, 46.6] |
| pyrim | 74 | 27 | 1720 | 86.09 | [0.1, 0.9] |
| spacega | 3107 | 6 | 18642 | 100 | [-3.06, 0.10] |
| triazines | 186 | 60 | 9982 | 89.44 | [0.1, 0.9] |
| data | cg | k | res | t(s) | data | cg | k | res | t(s) | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| a1a | 5 | 3 | 1.80e-08 | 0.03 | a2a | 5 | 3 | 1.80e-08 | 0.03 | ||
| 7 | 3 | 4.01e-07 | 0.03 | 7 | 3 | 4.02e-07 | 0.03 | ||||
| 20 | 5 | 3.21e-08 | 0.07 | 1 | 20 | 5 | 3.22e-08 | 0.07 | |||
| 47 | 7 | 7.69e-08 | 0.10 | 47 | 7 | 8.53e-08 | 0.10 | ||||
| 121 | 9 | 1.87e-08 | 0.18 | 123 | 9 | 2.05e-08 | 0.18 | ||||
| a3a | 5 | 3 | 1.81e-08 | 0.03 | a4a | 5 | 3 | 1.80e-08 | 0.03 | ||
| 7 | 3 | 4.01e-07 | 0.03 | 7 | 3 | 4.00e-07 | 0.03 | ||||
| 20 | 5 | 3.31e-08 | 0.06 | 1 | 20 | 5 | 3.32e-08 | 0.06 | |||
| 47 | 7 | 9.29e-08 | 0.09 | 47 | 7 | 6.46e-08 | 0.08 | ||||
| 122 | 9 | 3.49e-08 | 0.14 | 121 | 9 | 2.98e-08 | 0.14 | ||||
| a5a | 5 | 3 | 1.80e-08 | 0.03 | a6a | 5 | 3 | 1.80e-08 | 0.02 | ||
| 7 | 3 | 4.02e-07 | 0.03 | 7 | 3 | 3.97e-07 | 0.03 | ||||
| 20 | 5 | 3.23e-08 | 0.05 | 1 | 20 | 5 | 3.21e-08 | 0.04 | |||
| 47 | 7 | 6.60e-08 | 0.08 | 47 | 7 | 9.41e-08 | 0.06 | ||||
| 121 | 9 | 2.88e-08 | 0.12 | 123 | 9 | 1.93e-08 | 0.10 |
| data | cg | k | res | t(s) | data | cg | k | res | t(s) | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| a7a | 5 | 3 | 1.83e-08 | 0.02 | a8a | 5 | 3 | 1.77e-08 | 0.02 | ||
| 7 | 3 | 3.99e-07 | 0.02 | 7 | 3 | 3.97e-07 | 0.02 | ||||
| 20 | 5 | 3.24e-08 | 0.04 | 1 | 20 | 5 | 3.02e-08 | 0.04 | |||
| 47 | 7 | 9.89e-08 | 0.05 | 47 | 7 | 5.95e-08 | 0.06 | ||||
| 119 | 9 | 4.53e-08 | 0.07 | 120 | 9 | 1.84e-08 | 0.11 | ||||
| a9a | 5 | 3 | 1.80e-08 | 0.03 | australian | 5 | 3 | 1.84e-08 | 0.00 | ||
| 7 | 3 | 4.02e-07 | 0.04 | 10 | 4 | 1.14e-08 | 0.00 | ||||
| 20 | 5 | 3.23e-08 | 0.08 | 1 | 23 | 5 | 3.85e-09 | 0.00 | |||
| 47 | 7 | 6.23e-08 | 0.11 | 45 | 6 | 3.57e-08 | 0.00 | ||||
| 121 | 9 | 2.47e-08 | 0.18 | 73 | 7 | 2.02e-07 | 0.01 | ||||
| breast-cancer | 5 | 4 | 1.32e-09 | 0.00 | cod-rna | 14 | 6 | 2.28e-10 | 0.07 | ||
| 6 | 3 | 7.53e-07 | 0.00 | 16 | 7 | 4.82e-08 | 0.08 | ||||
| 13 | 4 | 7.42e-07 | 0.00 | 1 | 21 | 8 | 6.62e-09 | 0.10 | |||
| 34 | 8 | 4.10e-07 | 0.00 | 31 | 11 | 1.67e-08 | 0.13 | ||||
| 63 | 9 | 1.17e-09 | 0.01 | 44 | 13 | 9.29e-09 | 0.15 | ||||
| colon-cancer | 28 | 5 | 1.02e-08 | 0.02 | diabetes | 4 | 3 | 2.10e-07 | 0.00 | ||
| 77 | 7 | 1.25e-08 | 0.04 | 6 | 3 | 3.09e-08 | 0.00 | ||||
| 100 | 7 | 1.88e-07 | 0.03 | 1 | 11 | 4 | 2.84e-07 | 0.00 | |||
| 197 | 12 | 7.44e-08 | 0.06 | 27 | 6 | 3.43e-09 | 0.00 | ||||
| 323 | 19 | 7.50e-08 | 0.09 | 30 | 6 | 3.31e-07 | 0.01 | ||||
| duke breast-cancer | 46 | 7 | 6.46e-09 | 0.06 | fourclass | 3 | 3 | 1.66e-07 | 0.00 | ||
| 67 | 7 | 6.01e-07 | 0.07 | 5 | 3 | 9.09e-10 | 0.00 | ||||
| 128 | 11 | 4.32e-08 | 0.11 | 1 | 6 | 3 | 2.73e-08 | 0.00 | |||
| 207 | 18 | 4.07e-08 | 0.18 | 12 | 4 | 7.03e-15 | 0.00 | ||||
| 406 | 32 | 3.50e-07 | 0.36 | 12 | 5 | 7.38e-07 | 0.00 | ||||
| german .numer | 6 | 3 | 7.34e-10 | 0.00 | gisette | 18 | 5 | 2.54e-08 | 3.19 | ||
| 11 | 4 | 1.19e-08 | 0.00 | 42 | 7 | 3.74e-07 | 5.44 | ||||
| 24 | 5 | 6.33e-09 | 0.00 | 1 | 123 | 9 | 2.27e-08 | 9.00 | |||
| 48 | 6 | 4.15e-07 | 0.01 | 292 | 11 | 9.16e-08 | 14.20 | ||||
| 87 | 7 | 1.42e-07 | 0.01 | 680 | 14 | 6.55e-07 | 24.13 | ||||
| heart | 5 | 3 | 1.18e-09 | 0.00 | ijcnn1 | 4 | 4 | 3.14e-08 | 0.06 | ||
| 10 | 4 | 4.59e-09 | 0.00 | 6 | 3 | 1.01e-08 | 0.05 | ||||
| 18 | 4 | 7.14e-08 | 0.00 | 1 | 11 | 4 | 7.53e-09 | 0.09 | |||
| 45 | 6 | 3.35e-08 | 0.01 | 20 | 5 | 3.13e-07 | 0.13 | ||||
| 69 | 7 | 8.64e-08 | 0.01 | 49 | 7 | 2.86e-07 | 0.18 | ||||
| ionosphere | 5 | 3 | 6.00e-08 | 0.00 | leukemia | 39 | 6 | 8.99e-07 | 0.05 | ||
| 10 | 4 | 2.18e-08 | 0.00 | 64 | 8 | 9.30e-08 | 0.07 | ||||
| 23 | 5 | 1.33e-08 | 0.00 | 1 | 86 | 9 | 9.80e-07 | 0.08 | |||
| 46 | 6 | 4.34e-07 | 0.01 | 282 | 25 | 9.25e-08 | 0.30 | ||||
| 99 | 8 | 4.61e-07 | 0.01 | 311 | 26 | 9.25e-08 | 0.28 | ||||
| liver-disorders | 23 | 5 | 7.05e-15 | 0.00 | mush rooms | 5 | 3 | 1.47e-07 | 0.01 | ||
| 23 | 5 | 1.29e-07 | 0.00 | 10 | 4 | 4.84e-08 | 0.02 | ||||
| 28 | 7 | 1.14e-11 | 0.00 | 1 | 24 | 5 | 6.03e-08 | 0.02 | |||
| 36 | 8 | 5.13e-13 | 0.00 | 56 | 7 | 5.52e-07 | 0.03 | ||||
| 36 | 8 | 7.84e-12 | 0.00 | 145 | 10 | 7.80e-08 | 0.04 | ||||
| news20 .binary | 3 | 3 | 2.71e-07 | 1.04 | phishing | 3 | 2 | 2.32e-07 | 0.01 | ||
| 5 | 4 | 4.46e-10 | 1.44 | 5 | 3 | 1.14e-08 | 0.02 | ||||
| 7 | 4 | 6.22e-09 | 1.79 | 1 | 8 | 4 | 2.51e-08 | 0.03 | |||
| 12 | 5 | 3.68e-09 | 2.48 | 15 | 5 | 3.52e-07 | 0.04 | ||||
| 19 | 6 | 2.04e-07 | 3.60 | 33 | 6 | 9.53e-07 | 0.05 | ||||
| rcv1 .binary | 3 | 3 | 1.59e-07 | 0.10 | real-sim | 3 | 3 | 1.16e-07 | 0.20 | ||
| 6 | 4 | 1.93e-10 | 0.16 | 6 | 5 | 2.59e-11 | 0.32 | ||||
| 7 | 4 | 1.17e-08 | 0.15 | 1 | 5 | 3 | 8.42e-07 | 0.24 | |||
| 12 | 5 | 6.07e-08 | 0.21 | 12 | 5 | 2.19e-09 | 0.44 | ||||
| 24 | 6 | 2.31e-08 | 0.33 | 20 | 6 | 1.38e-07 | 0.60 |
| data | cg | k | res | t(s) | data | cg | k | res | t(s) | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| skin nonskin | 15 | 5 | 4.18e-07 | 0.18 | splice | 8 | 4 | 1.21e-09 | 0.01 | ||
| 18 | 6 | 1.13e-08 | 0.22 | 15 | 5 | 3.11e-09 | 0.01 | ||||
| 18 | 6 | 8.82e-08 | 0.22 | 1 | 24 | 6 | 3.84e-07 | 0.01 | |||
| 34 | 11 | 2.84e-07 | 0.43 | 38 | 7 | 1.08e-07 | 0.02 | ||||
| 37 | 12 | 5.72e-08 | 0.39 | 54 | 8 | 1.22e-08 | 0.02 | ||||
| sonar | 6 | 3 | 4.68e-09 | 0.00 | svmguide1 | 26 | 8 | 5.70e-12 | 0.00 | ||
| 11 | 4 | 1.71e-08 | 0.00 | 29 | 9 | 6.69e-09 | 0.00 | ||||
| 23 | 5 | 5.77e-07 | 0.00 | 1 | 31 | 9 | 7.89e-14 | 0.01 | |||
| 57 | 7 | 9.42e-08 | 0.01 | 32 | 10 | 1.09e-10 | 0.01 | ||||
| 117 | 8 | 6.88e-07 | 0.01 | 39 | 11 | 4.10e-10 | 0.01 | ||||
| svmguide3 | 4 | 3 | 2.66e-07 | 0.00 | w1a | 4 | 2 | 3.81e-07 | 0.04 | ||
| 6 | 3 | 4.27e-08 | 0.00 | 9 | 4 | 2.82e-09 | 0.07 | ||||
| 11 | 4 | 7.32e-09 | 0.00 | 1 | 14 | 4 | 3.47e-07 | 0.08 | |||
| 24 | 5 | 4.66e-07 | 0.01 | 43 | 8 | 5.04e-09 | 0.14 | ||||
| 59 | 7 | 7.58e-09 | 0.01 | 82 | 9 | 5.77e-07 | 0.16 | ||||
| w2a | 4 | 2 | 3.81e-07 | 0.04 | w3a | 4 | 2 | 3.82e-07 | 0.04 | ||
| 9 | 4 | 2.88e-09 | 0.07 | 9 | 4 | 2.89e-09 | 0.07 | ||||
| 11 | 4 | 3.31e-07 | 0.07 | 1 | 14 | 4 | 3.36e-07 | 0.07 | |||
| 43 | 8 | 4.60e-09 | 0.14 | 43 | 8 | 8.09e-09 | 0.13 | ||||
| 81 | 9 | 6.58e-07 | 0.15 | 81 | 9 | 6.03e-07 | 0.15 | ||||
| w4a | 4 | 2 | 3.76e-07 | 0.03 | w5a | 4 | 2 | 3.85e-07 | 0.03 | ||
| 9 | 4 | 2.93e-09 | 0.06 | 9 | 4 | 2.99e-09 | 0.06 | ||||
| 14 | 4 | 3.54e-07 | 0.06 | 1 | 14 | 4 | 3.42e-07 | 0.06 | |||
| 43 | 8 | 6.45e-09 | 0.12 | 43 | 8 | 1.42e-08 | 0.11 | ||||
| 81 | 9 | 5.83e-07 | 0.14 | 82 | 9 | 5.07e-07 | 0.13 | ||||
| w6a | 4 | 2 | 3.73e-07 | 0.02 | w7a | 4 | 2 | 3.81e-07 | 0.02 | ||
| 9 | 4 | 2.83e-09 | 0.05 | 8 | 4 | 4.80e-08 | 0.04 | ||||
| 14 | 4 | 3.49e-07 | 0.05 | 1 | 14 | 4 | 3.50e-07 | 0.04 | |||
| 37 | 7 | 9.03e-07 | 0.08 | 43 | 8 | 7.97e-09 | 0.07 | ||||
| 84 | 9 | 2.93e-07 | 0.11 | 84 | 9 | 2.93e-07 | 0.08 | ||||
| w8a | 4 | 2 | 3.82e-07 | 0.04 | covtype .binary | 3 | 3 | 1.58e-07 | 0.41 | ||
| 9 | 4 | 2.79e-09 | 0.08 | 6 | 4 | 7.49e-08 | 0.63 | ||||
| 14 | 4 | 3.01e-07 | 0.07 | 1 | 10 | 4 | 2.15e-07 | 0.73 | |||
| 43 | 8 | 8.27e-09 | 0.15 | 20 | 6 | 7.64e-07 | 0.86 | ||||
| 82 | 9 | 6.44e-07 | 0.16 | 53 | 11 | 4.84e-09 | 1.16 |
| data | cg | k | res | t(s) | data | cg | k | res | t(s) | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| abalone | 5 | 3 | 7.38e-10 | 0.00 | bodyfat | 4 | 2 | 8.48e-08 | 0.00 | ||
| 6 | 3 | 5.74e-07 | 0.00 | 9 | 4 | 2.09e-10 | 0.00 | ||||
| 11 | 3 | 6.95e-08 | 0.01 | 1 | 14 | 4 | 6.67e-09 | 0.00 | |||
| 17 | 4 | 1.35e-07 | 0.01 | 27 | 5 | 8.83e-09 | 0.00 | ||||
| 28 | 5 | 8.59e-08 | 0.01 | 45 | 6 | 2.30e-07 | 0.00 | ||||
| cpusmall | 4 | 2 | 1.79e-07 | 0.01 | tfidf. train | 3 | 3 | 3.62e-07 | 1.71 | ||
| 7 | 3 | 5.76e-09 | 0.02 | 6 | 4 | 5.57e-09 | 2.87 | ||||
| 12 | 4 | 2.77e-08 | 0.02 | 1 | 6 | 3 | 2.07e-07 | 2.36 | |||
| 23 | 5 | 3.16e-08 | 0.03 | 10 | 4 | 4.06e-09 | 3.58 | ||||
| 43 | 5 | 2.19e-08 | 0.03 | 13 | 4 | 2.08e-09 | 3.87 | ||||
| tfidf. test | 3 | 3 | 1.16e-07 | 0.42 | eunite 2001 | 4 | 4 | 1.29e-07 | 0.00 | ||
| 6 | 4 | 3.72e-09 | 0.70 | 7 | 4 | 9.96e-08 | 0.00 | ||||
| 6 | 3 | 9.14e-08 | 0.58 | 1 | 10 | 4 | 7.28e-07 | 0.00 | |||
| 9 | 4 | 3.60e-08 | 0.92 | 21 | 5 | 8.52e-08 | 0.00 | ||||
| 12 | 4 | 4.97e-09 | 0.95 | 49 | 6 | 9.08e-09 | 0.00 | ||||
| housing | 6 | 3 | 1.47e-09 | 0.00 | mg | 3 | 2 | 8.90e-07 | 0.00 | ||
| 8 | 3 | 6.49e-08 | 0.00 | 6 | 3 | 1.82e-08 | 0.00 | ||||
| 17 | 4 | 8.42e-09 | 0.00 | 1 | 10 | 3 | 2.83e-08 | 0.00 | |||
| 32 | 5 | 4.54e-08 | 0.01 | 15 | 4 | 1.93e-09 | 0.00 | ||||
| 40 | 5 | 3.30e-07 | 0.00 | 18 | 4 | 1.49e-08 | 0.00 | ||||
| mpg | 3 | 2 | 6.87e-07 | 0.00 | pyrim | 6 | 3 | 2.98e-09 | 0.00 | ||
| 6 | 3 | 3.11e-07 | 0.00 | 10 | 4 | 5.27e-08 | 0.00 | ||||
| 16 | 4 | 7.66e-09 | 0.00 | 1 | 22 | 5 | 4.56e-08 | 0.00 | |||
| 23 | 4 | 5.68e-08 | 0.01 | 41 | 5 | 7.42e-07 | 0.00 | ||||
| 30 | 5 | 3.33e-07 | 0.01 | 106 | 7 | 4.44e-07 | 0.01 | ||||
| spacega | 3 | 3 | 6.09e-07 | 0.00 | triazines | 8 | 4 | 1.77e-08 | 0.00 | ||
| 6 | 3 | 2.13e-09 | 0.00 | 15 | 4 | 1.18e-08 | 0.00 | ||||
| 9 | 3 | 3.97e-07 | 0.00 | 1 | 28 | 5 | 2.20e-08 | 0.00 | |||
| 17 | 4 | 2.47e-09 | 0.01 | 64 | 6 | 3.93e-08 | 0.01 | ||||
| 19 | 4 | 6.68e-09 | 0.01 | 140 | 7 | 2.81e-07 | 0.01 |
| data | t(s) (A1A2A3) | accuracy (A1A2A3) |
|---|---|---|
| a1a | 0.040.070.08 | 84.6384.6384.66 |
| a2a | 0.030.060.08 | 84.7084.7084.72 |
| a3a | 0.030.060.08 | 84.6784.6784.62 |
| a4a | 0.030.050.08 | 84.6884.6884.73 |
| a5a | 0.030.050.07 | 84.7184.7184.74 |
| a6a | 0.020.030.05 | 84.4084.4084.95 |
| a7a | 0.020.030.04 | 84.7884.7884.77 |
| a8a | 0.030.040.06 | 84.3184.3184.30 |
| a9a | 0.040.060.09 | 84.6484.6484.66 |
| australian | 0.000.000.00 | 84.7884.7885.14 |
| breast-cancer | 0.000.000.00 | 98.9098.9098.90 |
| cod-rna | 3.100.060.09 | 81.5882.6076.01 |
| colon-cancer | 0.011.030.05 | 72.0072.0072.00 |
| diabetes | 0.000.000.00 | 80.4680.4679.48 |
| duke breast-cancer | 0.021.750.17 | 80.0080.0080.00 |
| fourclass | 0.000.000.00 | 66.9666.9674.94 |
| german. numer | 0.010.000.01 | 76.5076.5076.75 |
| data | t(s) (A1A2A3) | accuracy (A1A2A3) |
|---|---|---|
| gisette | 4.9312.1214.18 | 97.0097.0097.00 |
| heart | 0.010.000.01 | 85.1985.1987.04 |
| ijcnn1 | 0.080.070.08 | 91.4491.4492.31 |
| ionosphere | 0.010.000.01 | 93.5793.5792.86 |
| leukemia | 0.021.950.25 | 26.6726.6793.33 |
| liver-disorders | 0.000.000.00 | 39.6662.0765.52 |
| mushrooms | 0.010.010.02 | 96.4396.4396.43 |
| news20.binary | 0.611.522.45 | 72.1472.1469.84 |
| phishing | 0.020.030.03 | 90.5990.5990.59 |
| rcv1.binary | 0.120.160.22 | 93.7493.7494.07 |
| real-sim | 0.340.290.37 | 78.7878.7873.88 |
| skinnonskin | 15.780.080.17 | 89.1689.1690.61 |
| splice | 0.150.010.01 | 84.9484.9485.40 |
| sonar | 0.000.000.01 | 14.4614.4615.66 |
| svmguide1 | 0.000.000.00 | 11.8911.8911.89 |
| svmguide3 | 0.000.000.01 | 40.4440.4440.44 |
| w1a | 0.040.050.10 | 99.3299.3299.92 |
| w2a | 0.050.060.09 | 99.3199.3199.92 |
| w3a | 0.050.060.09 | 99.2999.2999.93 |
| w4a | 0.040.050.09 | 99.3099.3099.92 |
| w5a | 0.030.050.08 | 99.2799.2799.92 |
| w6a | 0.030.030.06 | 99.3699.3699.94 |
| w7a | 0.020.020.05 | 99.3199.3199.95 |
| w8a | 0.040.060.10 | 99.3399.3399.91 |
| covtype.binary | 31.251.180.70 | 59.2959.2961.54 |
| data | t(s) (B1B2B3) | MSE (B1B2B3) |
|---|---|---|
| abalone | 0.000.000.01 | 50.0750.074.17 |
| bodyfat | 0.000.000.00 | 0.770.770.00 |
| cpusmall | 0.010.010.02 | 112.35112.37102.24 |
| tfidf.train | 1.641.431.73 | 0.460.460.14 |
| tfidf.test | 0.570.410.78 | 0.400.400.13 |
| eunite2001 | 0.000.000.00 | 131854 131854 408.44 |
| housing | 0.000.000.01 | 194.38194.3871.45 |
| mg | 0.000.000.00 | 0.870.870.02 |
| mpg | 0.000.000.00 | 562.55562.5637.48 |
| pyrim | 0.010.000.01 | 0.070.070.01 |
| spacega | 0.000.000.01 | 0.440.440.03 |
| triazines | 0.030.000.02 | 0.030.030.03 |
| data | semismooth Newton method | SVRG-BB | ||||
|---|---|---|---|---|---|---|
| k | t(s) | accuracy | k | t(s) | accuracy | |
| a1a | 5 | 0.04 | 79.57 | 18 | 38.53 | 76.96 |
| a2a | 5 | 0.04 | 79.57 | 18 | 37.55 | 77.03 |
| a3a | 5 | 0.03 | 79.50 | 19 | 36.54 | 76.85 |
| a4a | 5 | 0.03 | 79.60 | 18 | 34.52 | 77.00 |
| a5a | 5 | 0.03 | 79.63 | 18 | 30.88 | 77.09 |
| a6a | 5 | 0.02 | 79.80 | 18 | 25.25 | 77.16 |
| a7a | 5 | 0.02 | 79.68 | 18 | 19.62 | 77.19 |
| a8a | 5 | 0.03 | 79.35 | 18 | 27.08 | 76.45 |
| a9a | 5 | 0.04 | 79.53 | 18 | 38.63 | 76.89 |
| australian | 4 | 0.00 | 85.14 | 16 | 0.71 | 84.78 |
| breast-cancer | 5 | 0.00 | 98.17 | 15 | 1.98 | 98.17 |
| diabetes | 4 | 0.00 | 72.96 | 15 | 0.78 | 69.71 |
| fourclass | 4 | 0.00 | 74.78 | 13 | 0.77 | 71.01 |
| german.numer | 4 | 0.01 | 71.25 | 17 | 1.17 | 71.25 |
| heart | 4 | 0.00 | 84.26 | 15 | 0.28 | 83.33 |
| ijcnn1 | 4 | 0.09 | 90.37 | 15 | 47.59 | 90.37 |
| ionosphere | 5 | 0.01 | 92.86 | 17 | 0.43 | 93.57 |
| mushrooms | 5 | 0.02 | 75.08 | 19 | 17.35 | 58.77 |
| phishing | 4 | 0.02 | 57.73 | 16 | 11.07 | 55.88 |
| rcv1.binary | 4 | 0.11 | 51.93 | 16 | 264.07 | 51.91 |
| real-sim | 4 | 0.19 | 2.40 | 16 | 483.65 | 0.06 |
| sonar | 6 | 0.00 | 7.23 | 20 | 0.27 | 7.23 |
| svmguide1 | 14 | 0.00 | 11.89 | 34 | 4.48 | 11.89 |
| svmguide3 | 4 | 0.00 | 40.44 | 16 | 1.30 | 40.44 |
| w1a | 4 | 0.04 | 100 | 18 | 63.19 | 100 |
| w2a | 4 | 0.04 | 100 | 19 | 62.30 | 100 |
| w3a | 4 | 0.05 | 100 | 19 | 59.71 | 100 |
| w4a | 4 | 0.04 | 100 | 19 | 57.71 | 100 |
| w5a | 4 | 0.04 | 100 | 18 | 56.10 | 100 |
| w6a | 4 | 0.03 | 100 | 18 | 41.32 | 100 |
| w7a | 4 | 0.02 | 100 | 19 | 33.57 | 100 |
| w8a | 4 | 0.05 | 100 | 19 | 66.85 | 100 |
| covtype.binary | 4 | 0.43 | 61.54 | 18 | 631.54 | 61.54 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Advanced Optimization Algorithms Research
∎
11institutetext: Juan Yin 22institutetext: School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, 100081, China
22email: [email protected] 33institutetext: Qingna Li 44institutetext: Corresponding author. School of Mathematics and Statistics, Beijing Key Laboratory on MCAACI, Beijing Institute of Technology, Beijing, 100081, China
44email: [email protected]. This author’s research was supported by the National Science Foundation of China(No.11671036).
A Semismooth Newton Method for Support Vector Classification and Regression
Juan Yin
Qingna Li
(Received: date / Accepted: date)
Abstract
Support vector machine is an important and fundamental technique in machine learning. In this paper, we apply a semismooth Newton method to solve two typical SVM models: the L2-loss SVC model and the -L2-loss SVR model. The semismooth Newton method is widely used in optimization community. A common belief on the semismooth Newton method is its fast convergence rate as well as high computational complexity. Our contribution in this paper is that by exploring the sparse structure of the models, we significantly reduce the computational complexity, meanwhile keeping the quadratic convergence rate. Extensive numerical experiments demonstrate the outstanding performance of the semismooth Newton method, especially for problems with huge size of sample data (for news20.binary problem with 19996 features and 1355191 samples, it only takes three seconds). In particular, for the -L2-loss SVR model, the semismooth Newton method significantly outperforms the leading solvers including DCD and TRON.
Keywords:
Support Vector Regression Support Vector Classification Semismooth Newton Method Quadratic Convergence Generalized Jacobian
1 Introduction
Support vector machine (SVM) is a popular and important statistical learning technique Q1 ; Q9 ; Q10 ; Q11 ; Q12 ; Lee . SVMs hold records in performance benchmarks for handwritten digit recognition, text categorization, information retrieval, and time-series prediction. They are commonly used in the analysis of DNA micro-array data Q6 ; Q7 ; Q8 ; Q9 ; Q10 . Two main categories for support vector machines (SVMs) are support vector classification (SVC) and support vector regression (SVR). Support vector classification is a learning machine for two-group classification problems Q16 . The support vector regression was extended from SVC by Boser et al. Q23 . Most of the optimization methods for SVM models solve the dual problems, partly due to some nonsmooth properties of the primal functions. Two typical examples are the L2-loss SVC model and the -L2-loss SVR model. Below we give a brief review on methods for the above two models, which motivate the work in our paper. For a survey of optimization methods for machine learning, we refer to largesvm2017 ; Friedman2001 .
For the L2-loss SVC model, due to the nondifferentiability of the gradient of the objective function, Mangasarian Q5 introduced a finite Newton method. It is basically a semismooth Newton method with unit step size, and the inverse of Hessian matrix is used to calculate the Newton direction. Keerthi and DeCoste Q17 proposed a modified Newton method. They compute the Newton point and do an exact line search to determine step length. A trust region Newton method (TRON) Q4 was proposed for the L2-loss SVC model. Chang et al. Q13 proposed a coordinate descent method for the primal problem and Hsieh et al. Q3 proposed a dual coordinate descent method (DCD) for the dual problem of the L2-loss SVC model. Very recently, Hsia et al. Q18 111We realized this work when we drafted our paper. performed a study on trust region update rules in Newton’s method. For the -L2-loss SVR model, Ho and Lin Q14 applied the TRON and DCD to solve it and a smoothing Newton method was proposed by Gu et al. Q15 . To deal with large scale of data, stochastic gradient methods become popular in solving large scale SVM models largesvm2017 . Stochastic gradient method and its variants have good performance in machine learning largesvm2017 . Classical stochastic gradient descent (SGD) was proposed by Robbins and Monro in 1951 SGD . Johnson and Zhang SVRG proposed an accelerating stochastic gradient descent using predictive variance reduction (SVRG). Recently, Tan et al. SVRG-BB put forward to use the Barzilai-Borwein (BB) method to automatically compute step sizes for SGD and SVRG, which leads to two algorithms: SGD-BB and SVRG-BB. In their paper, numerical results suggest that SVRG-BB and SGD-BB clearly outperform SVRG and SGD respectively. To summarize, one can see that despite the competitiveness of Newton-type methods in SVM, little attention has been paid to the semismooth Newton method in solving the two models.
On the other hand, in optimization community, the semismooth Newton method has been well studied, and has been successfully used in many applications, especially in solving modern optimization problems, such as the nearest correlation matrix problem QiSun2006 ; Qi2013A , the nearest Euclidean distance matrix problem LiQi2012 , the tensor eigenvalue complementarity problem chen , solving the system of absolute value equations Cruz , the solution of quasi-variational inequations Facchinei , as well as linear and convex quadratic semidefinite programming problems Zhao2009 . The concept of semismoothness was introduced by Mifflin Mifflin , and was popularized by Qi and Sun QiSun93 . In QiSun93 , Qi and Sun proposed a nonsmooth version of the classical Newton’s method. Compared with the classical Newton method, the semismooth Newton method can solve nonsmooth equations, meanwhile can keep the local quadratic convergence rate under certain conditions. A semismooth Newton method was extended to solve the nonsmooth matrix equations by Qi and Sun QiSun2006 . Recently, the semismooth Newton method has been frequently used to solve some important problems, for example, Lasso problems LiSun2017 , OSCAR and SLOPE models Sun2017 , approximating weighted time series of finite rank Qi2017 , and convex clustering problemsYuanSun2018 .
Compared with the wide usage of the semismooth Newton method in optimization community, little attention has been paid to the semismooth Newton method in machine learning, especially in SVM models. In this paper, we will set up such a bridge by applying a globalized semismooth Newton method to models of SVC and SVR, i.e., the L2-loss SVC model and the -L2-loss SVR model. A common belief on the semismooth Newton method is its fast convergence rate as well as high computational complexity. Our contribution in this paper is that by exploring the sparse structure of the models, we significantly bring down the computational complexity, meanwhile keeping the quadratic convergence rate. Another advantage is that it is able to handle the case with a huge number of sample data, since it solves the primal problem rather than the dual. Extensive numerical experiments demonstrate the outstanding performance of the semismooth Newton method, especially for problems with huge size of sample data (for news20.binary problem with 19996 features and 1355191 samples, it only takes about three seconds). In particular, for the -L2-loss SVR model, the semismooth Newton method significantly outperforms the leading solvers including DCD and TRON.
The remaining parts of this paper are organized as follows. In Section 2, we introduce the formulation of two models of SVMs, i.e., the L2-loss SVC model and the -L2-loss SVR model. In Section 3, we introduce the semismooth Newton method and apply it to solve the two mentioned models. In Section 4, we characterize the generalized Jacobian of the objective functions in the two models, and highlight how to maintain the quadratic convergence rate and reduce the computational complexity by making use of the sparse structure. In Section 5, we collect test data from LIBLINEAR, a popular package for SVM, and conduct extensive numerical experiments to show the efficiency of our algorithm. We also do comparisons with other state-of-art solvers, such as TRON, DCD and SVRG-BB. Finally, we conclude our paper in Section 6.
2 Two Models of SVMs
The L2-loss SVC Model Given training data and the corresponding label , the basic idea of support vector classification is to find a hyperplane to separate the data, where and are unknown parameters. The traditional SVM model is
[TABLE]
Here we actually require that the data should be strictly separated, i.e., the constraints must be satisfied strictly. This model is based on the assumption that the data can be linearly separated. In practice, one usually using the following regularized model which allows that the data could be wrongly labelled, i.e., the inequality constraints can be violated
[TABLE]
where is a penalty parameter and is the loss function. If , it is referred as the hinge loss function; if , it is the squared hinge-loss function which we call L2-loss function; if , it is referred as logistic loss function. In our paper, we focus on the L2-loss SVC model, i.e.,
[TABLE]
Recent works on support vector classification often omit the bias term because it hardly affects the performance on most data Q3 ; Q14 . Therefore, by appending each instance with an additional dimension:
[TABLE]
we obtain the following model, which is the first model we will consider (referred as L2-loss SVC Q3 ).
[TABLE]
The -L2-loss SVR Model Given training data and the corresponding observations , SVR is to find such that is close to the target value , . The -L2-loss SVR model (Similarly, we omit the bias term for SVR) is as follows
[TABLE]
where
[TABLE]
is the -insensitive loss function which we call -L2-loss function associated with . The parameter is given so that the loss is zero if . Ho and Lin Q14 , and Gu et al. Q15 refer to SVR using as L2-loss SVR and -SVR respectively. We refer to it as -L2-loss SVR.
One can easily verify that the functions of (5) and (7) are continuously differentiable but not twice differentiable. An illustration of the loss functions is in Figure 1.
3 A Semismooth Newton Method
In this section, we will apply the semismooth Newton method to solve (5) and (6). It is divided into two parts. In the first part, we introduce some preliminaries. In the second part, we apply the semismooth Newton method to solve (5) and (6).
3.1 Preliminaries
In this part, we will introduce some preliminaries about the semismooth Newton method. The semismoothness of a function is closely related to the generalized Jacobian in the sense of Clarke Clarke , which is stated as follows.
Let be a (locally) Lipschitz function. According to Rademacher’s theorem (Redemacher, , Sect. 14), is differentiable almost everywhere. Define
[TABLE]
Let denote the Jacobian of at . The Bouligand subdifferential of at is then defined by
[TABLE]
The generalized Jacobian in the sense of Clarke Clarke is the convex hull of , i.e.,
[TABLE]
where is the convex hull of . The concept of semismoothness was introduced by Mifflin Mifflin for functionals. It was extended to vector-valued functions by Qi and Sun QiSun93 .
Definition 1
We say that is semismooth at if (i) is directional differentiable at and (ii) for any ,
[TABLE]
is said to be strongly semismooth at if is semismooth at and for any ,
[TABLE]
Some particular examples for semismooth functions are as follows.
- •
Piecewise linear functions are strongly semismooth.
- •
The composition of (strongly) semismooth functions is also (strongly) semismooth.
For example, according to the definition above, is strongly semismooth and the Clarkes’ generalized gradient of is
[TABLE]
3.2 A Semismooth Newton Method Applied to (5) and (6)
For , a nonsmooth version of the classical Newton method to solve the equations is as follows
[TABLE]
where is an initial point. In general, the above iterative method does not converge. However, Qi and Sun QiSun93 show that if is semismooth, then the iterate sequence converges superlinearly. It is from then that the semismooth Newton method became popular. We would also like to highlight that if is continuously differentiable, then reduces to a singleton, which is the Jacobian of . In this situation, the algorithm is the classical Newton method.
For solving the following problem
[TABLE]
where or . It is easy to verify that is strongly convex and continuously differentiable with
[TABLE]
and
[TABLE]
where is defined as if and otherwise. Therefore, solving (10) is equivalent to solving
[TABLE]
One can see that and are continuous, but not differentiable. Fortunately, based on our analysis in Section 3.1, we can see that and are strongly semismooth. Therefore, we can apply the semismooth Newton method to solve (5) and (6). In practice, we use the following well studied globalized version of the semismooth Newton method (QiSun2006, , Algorithm 5.1).
Algorithm 1
A globalized semismooth Newton method
- S0
Given . Choose , , , and .
- S1
Calculate . If , stop. Otherwise, go to S2.
- S2
Select an element and apply CG Hestenes to find an approximate solution by
[TABLE]
such that
[TABLE]
where .
- S3
Do line search, and let be the smallest integer such that the following holds
[TABLE]
Let .
- S4
Let ,* , go to S1.*
Remark. Note that Mangsarian Q5 proposed a finite Armijo Newton method for solving L2-loss SVC. Different from Mangsarian’s algorithm, we use the conjugate gradient (CG) method proposed by Hestenes and Stiefel Hestenes to solve the linear system in S2 for obtaining the descent direction . We note that Hsia et al. Q18 proposed using line search and trust region to obtain step length but they focused on investigating the trust region update rules in Newton’s method for L2-loss SVC.
4 Quadratic Convergence Rate and Low Computational Complexity
The tradition view about the Newton method is the fast convergence rate and its expensive computational cost due to the usage of second order information. In this section, we will show that when the semismooth Newton method is applied to solve the two models (5) and (6), the quadratic convergence rate can be well maintained. Furthermore, we can also reduce the computational complexity dramatically by fully exploring the sparse structure of the models. We divide this section into three parts. In the first part, we characterize the generalized Jacobian of , which is used in Alg. 1. Then we discuss how to maintain the quadratic convergence rate of the semismooth Newton method. Finally, we will bring down the computational complexity by exploring the sparse structure of the models.
4.1 Characterization of Generalized Jacobian
In Algorithm 1, we need to calculate , i.e., the generalized Jacobian of . For the L2-loss SVC model (5), by the chain rule (Clarke, , Theorem 2.3.9), there is where
[TABLE]
For -L2-loss SVR, the generalized Jacobian of is characterized in the following proposition.
Proposition 1
For defined as in (6), there is , where
[TABLE]
Proof. Recall that
[TABLE]
In the following, we first discuss the generalized Jacobian of , . First, denote by
[TABLE]
There is
[TABLE]
We can see that is differentiable when , or , or . However, is not differentiable if . When satisfies or , the Jacobian of can be easily calculated by
[TABLE]
Next, we calculate the generalized Jacobian of when and . By Section 3.1, , and
[TABLE]
Consider at where . We choose a sequence , such that and . Then Similarly, choose another sequence , such that . There is . Then we have: at with ,
[TABLE]
Consequently, at with ,
[TABLE]
Similarly, at with , we have
[TABLE]
To sum up, we get
[TABLE]
Note that
[TABLE]
The generalized Jacobian of is then given by
[TABLE]
By (Clarke, , Proposition 2.3.3), we know that if is a family of functions each of which is Lipschitz near . Therefore, we have
[TABLE]
By letting , and recall (8), we get , where
[TABLE]
The proof is finished.
4.2 Local Quadratic Convergence Rate
The local convergence result for the semismooth Newton method (9) is given as follows.
Theorem 4.1
(QiSun93*, *, Thm.3.2)** let be a solution of and Let be a locally Lipschitz function which is semismooth at . Assume that all are nonsingular. Then every sequence generated by (9) is superlinearly convergent to , provided that the starting point is sufficiently close to . Moreover, if is strongly semismooth at , the convergence rate is quadratic.
From Theorem 4.1, to guarantee the local quadratic convergence rate of Alg. 1, we need to check the positive definiteness of each element in . From the characterization of and , one can easily see that for any , there is . In other words, is positive semidefinite for any and . Consequently, we have the following proposition.
Proposition 2
For any , , is positive definite.
Due to the positive definiteness of and , the local convergence result in Theorem 4.1 holds, and the semismooth Newton method applied to solve (5) and (6) enjoys local quadratic convergence rate.
Remark. Here we would like to highlight that not only the quadratic convergence of the semismooth Newton method can be guaranteed in theoretical point of view, it can also be verified from the numerical point of view. In fact, in our numerical test, quadratical convergence rate can always be observed. More details of the quadratic convergence rate are demonstrated in Section 5.1.
4.3 Exploring Sparsity to Reduce Computational Complexity
As mentioned before, the traditional view about the semismooth Newton method is its high computational complexity since it needs to calculate the generalized Jacobian. Also, it needs to solve the linear system in order to get the Newton direction. In this part, we will demonstrate our view about the semismooth Newton method. That is, by exploring the sparse structure, the computational cost can be significantly reduced, which is even lower than calculating the Jacobian. Specifically, the computational complexity can be reduced from to , where .
We take the L2-loss SVC model as an example. In each iteration , one needs to solve the linear system (12) to get . In our algorithm, we solve the linear system by CG, the computational burden then lies in calculation for in each CG iteration. Below, we will compare the computational cost of calculating by traditional implementation and that by our implementation.
Traditional Implementation.
The traditional implementation of calculating is to first generate and save it, then calculate . The computational cost in each step is shown in Table 1 where only multiplication and division are taken into account.
The computational complexity for traditional implementation is then .
Our Implementation.
In our implementation, we didn’t store explicitly. Instead, we calculate directly by the right hand side of the following formula
[TABLE]
As we can see, one only need to calculate the second term in the right hand side of the above formula. Here we would like to highlight that by taking the product of directly, we get avoid of forming matrix . Instead, we can first take the vector product , which will result in a scale, then conduct scale-vector multiplication . This will lead to the computational cost of .
Moreover, recall that and some of the ’s are actually zero due to the definition in (8). Consequently, for those indices with , it is not necessary to calculate the item . Consequently, let . At iteration , we choose in the following way:
[TABLE]
As a result, will reduce to
[TABLE]
The computational cost then becomes . This is the implementation we use in our code. These are summarized in Table 2.
To further see the size of , note that for , it means that the -th sample can not be linearly separated, i.e., it violates , so we need to penalized the violation. In this case, we actually assume that only few number of such will happen. Therefore, it means that near the optimal solution , is much smaller than the sample size i.e., . We can see that compared to calculate directly, the complexity in each iteration is reduced from to .
In a word, due to the special sparse structure of problem (5) and (6), our way of calculating will lead to low computational cost, which is much lower than the classical Newton and semismooth Newton method.
5 Numerical Results
In this section, we analyze the performance of the semismooth Newton method for solving L2-loss SVC and -L2-loss SVR. It is divided into five parts. In the first part, we demonstrate the low complexity of the semismooth Newton method as well as the quadratic convergence rate. In the second and third parts, we discuss the performance of the semismooth Newton method for L2-loss SVC and -L2-loss SVR, respectively, due to different choices of parameters. In the fourth part, we compare our algorithm with the methods in LIBLINEAR Q25 , including trust region Newton method (TRON) and dual coordinate descent method (DCD). In the last part, we compare with SVRG-BB SVRG-BB , one of the most efficient stochastic gradient methods.
All experiments are tested in Matlab R2013b in Windows 7 on a Lenovo desktop computer with an Intel(R) Core(TM) i5-3470M CPU at 3.20 GHZ and 4 GB of RAM. Throughout the computational experiments, we use the following parameters in the semismooth Newton method: When solving the linear system by CG, we set the maxium number iterations as 200.
Due to the different criteria of error evaluation for L2-loss SVC and -L2-loss SVR, we use the standard real data sets from LIBSVM for classification and regression (42 data sets for classification and 12 data sets for regression). For some datasets of classification whose labels don’t belong to , we change their labels and set them belong to . For example, for the dataset: breast-cancer, samples’ labels are either 2 or 4. We turn the label 2 into -1 and the label 4 into 1. Similarly, we use the same strategy for datasets: liver-disorders, mushrooms, phishing, skinnonskin and svmguide1. Detailed information of data sets for classification and regression is given in Table 3 and Table 4.
To see the performance of the semismooth Newton method, we report the following information: the number of iterations , the total number of CG iterations , the cputime in second, as well as the final , denoted as . We also use an index of accuracy to further evaluate the quality of the solution returned by our method. For L2-loss SVC, let be a test data, the predicted label is then calculated as follows
[TABLE]
where is generated by the semismooth Newton algorithm. The accuracy is then calculated by
[TABLE]
For -L2-loss SVR, we let , m is the total number of testing data, is the element of testing data. We use the mean squared error (MSE) to show our algorithm’s test accuracy, which is calculated by
[TABLE]
where are the observed data corresponding to , .
5.1 Demonstration of Low Computational Complexity and Quadratic Convergence Rate
Demonstration of Low Computational Complexity and Sparsity.
As analyzed above, the model of problem (5) we solved has good sparsity. In this part, we will give an example for the description of sparsity. Here we set for convenience. For dataset: “covtype.binary”, we can see that the semismooth Newton method takes 11 iterations until terminating successfully and the data set contains 581012 instances. Recall . For each iteration, is recorded as follows.
[TABLE]
The corresponding ratio of over sample size is calculated by
[TABLE]
We plot and in Figure 2. We can see that is always under the horizontal line and is always less than 1. In particular, is significantly smaller than the total number of instances except and the value of is even less than 0.1 at some iterations, indicating that the computational cost is significantly saved from to .
Demonstration of Quadratic Convergence Rate.
For L2-loss SVC, to show the quadratic convergence rate, we choose two data sets: “w3a” and “real-sim” to run Alg. 1 and plot the during iterations when in Figure 3. One can see that decreased fast and stopped successfully within small iterations (the numbers of iterations in the two datasets are samller than 10). We can see that decreases almost linearly along k, indicating the superlinear convergence rate of the semismooth Newton method.
For -L2 loss SVR, Figure 4 shows the trend of during iterations via the semismooth Newton method in two data sets: “abalone” and “mpg”. We can observe that decreases fast during iterations which again verifies the quadratic convergence rate of the semismooth Newton method.
5.2 Numerical Results for L2-loss SVC (5)
In this part, to see the role of parameter in L2-loss SVC model (5), we test our algorithm with and report the results in Table 5 (We use each data set with 100 data).
From Table 5, we summarize the following observations.
All the 210 tested instances are successfully solved by the semismooth Newton method. This suggests the semismooth Newton method is capable of solving problem (5) and the computation time of our algorithm is small. 2. 2.
When our algorithm terminates, all the residuals (as shown in the column under “res”) are at least at the level of within 10 iterations (Recall that the stopping criteria is ), and even some residuals reached or . It indicates our algorithm can stop successfully under the stopping criteria and return solutions of high accuracy. That is, the semismooth Newton method is effective to solve L2-loss SVC. 3. 3.
In terms of different choices for , the semismooth Newton method can obtain the optimal solution even for different ’s. We can notice that the smaller the , the less the number of iterations of our algorithm and the faster our algorithm can converge. 4. 4.
Our algorithm can converge to the optimal solution for most data sets within 10 iterations.
5.3 Numerical Results for -L2-loss SVR (6)
For in -insensitive loss function Q15 , Ho et al. Q14 performed experiments with and without using via the dual coordinate descent method. The results indicate that for -L2-loss SVR (6) , MSE is similar for different ’s. As a result, we fix and test our algorithm with different choices of for (6) since is insensitive. The results are reported in Table 6.
Table 6 shows that our algorithm can stop successfully under stopping criteria, which indicates the semismooth Newton method is efficient for -L2-loss SVR (6). All datasets are successfully solved by the semismooth Newton method in seconds. This suggests the semismooth Newton method is capable of solving problem (6) and the computation time of our algorithm is quite small. Our algorithm can converge to the optimal solution for all data sets within 7 iterations. Similarly, we can observe that the smaller the , the less the number of iterations of our algorithm and the faster our algorithm can converge.
5.4 Numerical Comparisions with LIBLINEAR
In this part, we compare our algorithm with some solvers in LIBLINEAR 222We use the software LIBNIEAR version 2.11 downloaded from https://www.csie.ntu.edu.tw/ cjlin/liblinear/ which is the most popular and successful public software for support vector classification, regression and distribution estimation with linear kernel. We choose the following popular solvers for L2-loss SVC and -L2-loss SVR.
- •
DCD1 and TRON1: a dual coordinate descent method Q3 and a trust region Newton method Q4 for L2-loss SVC.
- •
TRON2 and DCD2: a trust region Newton method and a dual coordinate descent method Q14 for -L2-loss SVR.
We use a stratified selection to split each set to 60 training and 40 testing. For L2-Loss SVC, the training time and accuracy (in percentage) on classification datasets are reported in Table 7 with .
From Table 7, we get the following observations.
All of the three methods have high accuracy. The accuracy of most datasets was over 60, and even higher than 90 for some datasets. For the 42 classification data sets, compared with DCD1 and TRON1, the semismooth Newton method has same or higher classification accuracy for 34 datasets. 2. 2.
The semismooth Newton method is competitive with DCD1 and TRON1 in terms of cputime. In particular, for “covtype.binary”, the semismooth Newton method is much faster than DCD1 and TRON1. For “skinnonskin”, the semismooth Newton method takes shorter time than DCD1 and is as fast as TRON1.
In summary, the semismooth Newton method is very competitive with DCD1 and TRON1, in terms of accuracy and cputime.
Next, we compare our algorithm with DCD2 and TRON2 for -L2-loss SVR. The results are listed in Table 8. As we know, the smaller the MSE, the better the fitting of the model. For -L2-loss SVR, we tested 12 regression data sets and we observed that when and , MSE via our algorithm is significantly smaller than DCD2 and TRON2 for all regression datasets. As for the cputime, these three methods are almost same. These indicate our algorithm is efficient and has better performance than DCD2 and TRON2.
5.5 Numerical Comparisons with SVRG-BB
In this part, we compare the semismooth Newton method with SVRG-BB SVRG-BB for the following squared hinge loss SVC:
[TABLE]
We refer to (SVRG-BB, , Algorithm 3) for the algorithm of SVRG-BB, and we use the following parameters in SVRG-BB: .
Note that (21) and (5) are equivalent by choosing proper in (5). However, when solving (5) by SVRG-BB, we find that SVRG-BB is sensitive to the selection of parameter in (5), and it cannot converge for most data sets when or some other choices of . Regarding this situation, we use the model and we take , which is equivalent to our model using . Other settings are the same as before. Among the total 42 datasets, SVRG-BB cannot converge for some datasets (“cod-rna”, “colon-cancer”, “duke.breast-cancer”, “gisette”, “leukemia”, “liver-disorders”, “news20.binary”, “skinnonskin” and “splice”). We test the rest 33 instances and we use a stratified selection to split each set to 60 training and 40 testing.
In Figure 5, we selected 4 data sets: “a5a”, “german.numer”, “mushrooms” and “mushrooms” to show the accuracy along interations of the semismooth Newton Method and SVRG-BB for (21). From Figure 5, we can see that our algorithm has much smaller iterations than SVRG-BB and the accuracy calculated by the semismooth Newton method is same or higher than SVRG-BB for these 4 datasets.
Next, we give the comparison results of the semismooth Newton method and SVRG-BB in Table 9. From Table 9, we have the following observations.
Our algorithm has smaller iterations than SVRG-BB. Our algorithm can satisfy the termination condition for most data sets within 5 iterations, however, SVRG-BB need to take about 20 iterations. 2. 2.
It can be observed that the semismooth Newton method is significantly faster than SVRG-BB for all testing data sets. Our algorithm can converge to the optimal solution within a few seconds but SVRG-BB need to takes tens of seconds for most data sets. In particular, for datasets: “rcv1.binary”, “real-sim” and “covtype.binary”, SVRG-BB takes 264, 484 and 632 seconds respectively, and our algorithm only takes within 1 second. 3. 3.
Both two methods have the high accuracy for most data sets. The accuracy via the semismooth Newton method is same or even higher than SVRG-BB for all data sets except “ionosphere”. For “w1a” to “w8a”, both two algorithms achieved 100 accuracy eventually.
In summary, our algorithm is very effective and has better performance than SVRG-BB with regard to number of iterations, computational time and accuracy.
6 Conclusions
In this paper, we apply the semismooth Newton method to solve two typical SVM models: the L2-loss SVC model and the -L2-loss SVR model. Our contribution in this paper is that by exploring the sparse structure of the models, we significantly bring down the computational complexity, meanwhile keeping the quadratic convergence rate. Extensive numerical experiments demonstrate the outstanding performance of the semismooth Newton method, especially for problems with huge size of sample data (for news20.binary problem with 19996 features and 1355191 samples, it only takes three seconds). In particular, for the -L2-loss SVR model, the semismooth Newton method significantly outperforms the leading solvers including DCD and TRON.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Al-Mubaid, H., Umair, S.A.: A new text categorization technique using distributional clustering and learning logic. IEEE Transactions on Knowledge and Data Engineering, 18(9), 1156-1165(2006).
- 2(2) Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA journal of numerical analysis, 8(1), 141-148(1988).
- 3(3) Basak, D., Pal, S., Patranabis, D.C.: Support vector regression. Neural Information Processing-Letters and Reviews, 11(10), 203-224(2007).
- 4(4) Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. The Workshop on Computational Learning Theory, 144-152(1992).
- 5(5) Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223-311(2018).
- 6(6) Chang, K.W., Hsieh, C.J., Lin, C.J.: Coordinate Descent Method for Large-scale L 2-loss Linear Support Vector Machines. Journal of Machine Learning Research, 9(3),1369-1398(2008).
- 7(7) Chen, Z., Qi, L.: A semismooth Newton method for tensor eigenvalue complementarity problem. Computational Optimization and Applications, 65(1), 109-126(2016).
- 8(8) Clarke, F.H.: Optimization and Nonsmooth Analysis. J. Wiley(1983).
