Incremental and Decremental Fuzzy Bounded Twin Support Vector Machine

Alexandre Reeberg de Mello; Marcelo Ricardo Stemmer; Alessandro; Lameiras Koerich

arXiv:1907.09613·cs.LG·March 24, 2020

Incremental and Decremental Fuzzy Bounded Twin Support Vector Machine

Alexandre Reeberg de Mello, Marcelo Ricardo Stemmer, Alessandro, Lameiras Koerich

PDF

Open Access 1 Repo

TL;DR

This paper introduces an incremental and decremental fuzzy twin support vector machine that efficiently handles large datasets and data streams, offering fast training and robust classification through innovative algorithms and approximations.

Contribution

It presents a novel incremental/decremental FBTWSVM combining fuzzy membership, Fourier Gaussian approximation, and a DAG multi-class extension, with theoretical analysis and improved training speed.

Findings

01

Fast training and retraining on benchmark datasets

02

Robust classification performance maintained

03

Effective handling of large datasets and data streams

Abstract

In this paper we present an incremental variant of the Twin Support Vector Machine (TWSVM) called Fuzzy Bounded Twin Support Vector Machine (FBTWSVM) to deal with large datasets and learning from data streams. We combine the TWSVM with a fuzzy membership function, so that each input has a different contribution to each hyperplane in a binary classifier. To solve the pair of quadratic programming problems (QPPs) we use a dual coordinate descent algorithm with a shrinking strategy, and to obtain a robust classification with a fast training we propose the use of a Fourier Gaussian approximation function with our linear FBTWSVM. Inspired by the shrinking technique, the incremental algorithm re-utilizes part of the training method with some heuristics, while the decremental procedure is based on a scored window. The FBTWSVM is also extended for multi-class problems by combining binary…

Tables5

Table 1. Table 1 : The datasets, their characteristics, and the experimental settings

Dataset	Train	Test	Attr	Classes	Size	$𝜸$	$𝑪_{𝟏}, 𝑪_{𝟑}$	$𝑪_{𝟐}, 𝑪_{𝟒}$	Points
	Number of Examples		Number of		Kernel				Number of
Border	4,000	1,000	2	3	150	0.4	8	2	100
Overlap	3,960	990	2	4	150	0.4	8	2	100
Letter	16,000	4,000	16	26	350	0.01	8	2	1,000
SUSY	4,500,000	500,000	18	2	300	0.2	10	2	100,000
Outdoor	2,600	1,400	21	40	500	0.001	10	1	300
COIL	1,800	5,400	21	100	400	20	4	4	500
DNA	1,400	1,186	180	3	500	0.003	4	4	50
USPS	7,291	2,007	256	10	1,000	0.007	8	2	1,000
Isolet	6,238	1,559	617	26	1,000	0.002	10	10	500
MNIST	60,000	10,000	784	10	2,400	0.0002	10	10	10,000
Gisette	6,000	1,000	5,000	2	linear	linear	8	2	500
WESAD	21,668,504	1,537,900	8	3	linear	linear	8	2	1,537,900
LED	10k\|100k\|1M	3k\|30k\|300k	24	10	linear	linear	8	2	5,000
SEA	10k\|100k\|1M	3k\|30k\|300k	3	2	linear	linear	10	1	5,000
RTG	10k\|100k\|1M	3k\|30k\|300k	10	2	1,400	0.6	2.5	2	5,000
RBF	10k\|100k\|1M	3k\|30k\|300k	10	5	300	0.45	8	2	5,000
HYPER	10k\|100k\|1M	3k\|30k\|300k	10	2	linear	linear	5	4	5,000

Table 2. Table 2 : On-line accuracy of the incremental and decremental FBTWSVM compared to other incremental algorithms on several benchmark datasets. Statistically significant differences are marked with ⋆ .

Accuracy (%)

Dataset

FBTWSVM

ISVM

LASVM

ORF

ILVQ

Best

Mean

\pm

SD

Border

98.70

97.60

\pm

1.10

98.50

97.6

94.0

94.7

Overlap

84.14^⋆

82.58

\pm

1.49

81.7

78.8

78.2

81.1

Letter

96.75^⋆

96.68

\pm

0.07

91.3

92.7

75.4

88.4

SUSY

77.67

76.00

\pm

1.20

-

79.3

78.5

Outdoor

74.44

73.72

\pm

0.42

86.4

82.3

34.2

82.6

COIL

95.11^⋆

94.99

\pm

0.14

75.4

66.3

66.6

79.1

DNA

93.59^⋆

92.90

\pm

0.30

89.5

73.1

84.6

USPS

95.47

94.91

\pm

0.30

96.7

96.6

84.5

92.7

Isolet

96.28^⋆

95.88

\pm

0.37

93.6

92.9

69.2

84.7

MNIST

97.80

97.00

\pm

0.12

-

97.5

87.1

90.8

Gisette

96.50

96.40

\pm

0.01

96.3

96.4

90.3

91.1

” - ” denotes the non-available results due to limitations in memory size.

Table 3. Table 3 : On-line accuracy with different forgetting scores ( d 𝑑 d ) and the corresponding number of support vectors (nSV).

Dataset	$d$ =1 $\|$ nSV	$d$ =2 $\|$ nSV	$d$ =4 $\|$ nSV	$d$ =10 $\|$ nSV	$d$ = $\infty$ $\|$ nSV	OffL $\|$ nSV
	Accuracy(%)
Border	91.90 $\|$ 538	93.30 $\|$ 784	98.20 $\|$ 1.3k	97.20 $\|$ 2.8k	98.70 $\|$ 7k	98.50 $\|$ 8k
Overlap	76.97 $\|$ 1.9k	79.70 $\|$ 2.3k	82.22 $\|$ 3.5k	82.83 $\|$ 7k	84.14 $\|$ 11.8k	83.30 $\|$ 11k
Letter	93.38 $\|$ 83k	94.83 $\|$ 122k	96.10 $\|$ 204k	96.10 $\|$ 338k	96.63 $\|$ 361k	96.90 $\|$ 384k
SUSY	45.84 $\|$ 754k	45.85 $\|$ 1.5M	77.67 $\|$ 2.5M	73.78 $\|$ 3.4M	76.45 $\|$ 3.8M	-
Outdoor	72.00 $\|$ 24k	72.25 $\|$ 39k	73.69 $\|$ 68k	74.44 $\|$ 83k	73.88 $\|$ 92k	74.00 $\|$ 93k
COIL	93.21 $\|$ 98k	94.01 $\|$ 100k	94.57 $\|$ 153k	95.11 $\|$ 156k	94.93 $\|$ 166k	95.00 $\|$ 178k
DNA	91.82 $\|$ 770	92.16 $\|$ 848	92.50 $\|$ 1.2k	93.78 $\|$ 1.9k	93.59 $\|$ 2.6k	93.50 $\|$ 2.7k
USPS	93.92 $\|$ 9k	94.32 $\|$ 21k	94.87 $\|$ 42k	95.36 $\|$ 60k	95.47 $\|$ 60k	95.30 $\|$ 65.5k
Isolet	95.19 $\|$ 61k	95.51 $\|$ 85k	95.32 $\|$ 128k	95.89 $\|$ 143k	96.28 $\|$ 143k	95.60 $\|$ 155k
MNIST	97.17 $\|$ 59k	97.48 $\|$ 169k	97.66 $\|$ 342k	97.91 $\|$ 455k	97.80 $\|$ 455k	97.80 $\|$ 540k
Gisette	96.50 $\|$ 1.3k	97.00 $\|$ 1.9k	96.90 $\|$ 2.9k	96.20 $\|$ 5.3k	96.50 $\|$ 6k	96.50 $\|$ 6k
” - ” denotes the non-available results due to limitations in memory size.

Table 4. Table 4 : Training and testing processing time with different forgetting scores ( d 𝑑 d ) in seconds.

” - ” denotes the non-available results due to limitations in memory size.
	Training $\|$ Testing Time (sec)
Dataset	$d$ =1	$d$ =10	$d$ = $\infty$
Border	12.39 $\|$ 0.01	34.12 $\|$ 0.02	43.22 $\|$ 0.02
Overlap	22.96 $\|$ 0.02	82.58 $\|$ 0.02	94.37 $\|$ 0.02
Letter	88.14 $\|$ 0.68	179.18 $\|$ 0.72	192.36 $\|$ 0.70
SUSY	940.60 $\|$ 1.88	5264.41 $\|$ 2.62	- $\|$ -
Outdoor	111.00 $\|$ 0.73	153.09 $\|$ 0.80	153.55 $\|$ 0.81
COIL	154.63 $\|$ 5.00	169.10 $\|$ 5.29	179.12 $\|$ 4.51
DNA	6.89 $\|$ 0.03	8.63 $\|$ 0.03	8.53 $\|$ 0.03
USPS	21.59 $\|$ 0.25	41.59 $\|$ 0.24	41.67 $\|$ 0.23
Isolet	117.35 $\|$ 0.52	181.14 $\|$ 0.49	160.10 $\|$ 0.51
MNIST	349.47 $\|$ 1.21	419.49 $\|$ 1.19	435.48 $\|$ 1.26
Gisette	25.26 $\|$ 0.01	50.52 $\|$ 0.01	49.41 $\|$ 0.01

Table 5. Table 5 : Comparison of the training time (TT) in seconds (s), real memory usage (RMU) in Gigabytes (GB) and percentual accuracy (Acc) (%) for five synthetic datasets and three different amount of data, and WESAD dataset.

		Dataset
		10k $\|$ 100k $\|$ 1M
Method		LED	SEA	RTG	RBF	HYPER	WESAD
FBTWSVM	TT	5.81 $\|$ 19.1 $\|$ 380	0.69 $\|$ 2.15 $\|$ 63	9.60 $\|$ 119 $\|$ 3.4k	11.4 $\|$ 164 $\|$ 5.3k	0.73 $\|$ 2.61 $\|$ 28.0	6.8k
	RMU	3.26 $\|$ 3.81 $\|$ 7.34	0.84 $\|$ 0.91 $\|$ 1.31	1.84 $\|$ 10.1 $\|$ 10.6	1.04 $\|$ 2.06 $\|$ 7.03	0.84 $\|$ 0.96 $\|$ 1.47	9.80
	Acc	74.1 $\|$ 74.1 $\|$ 74.2	89.0 $\|$ 87.1 $\|$ 89.1	95.8 $\|$ 95.5 $\|$ 96.0	88.3 $\|$ 89.4 $\|$ 89.0	89.1 $\|$ 94.7 $\|$ 94.0	75.5
ISVM	TT	40.7 $\|$ - $\|$ -	8.80 $\|$ 4.5k $\|$ -	63.9 $\|$ - $\|$ -	31.8 $\|$ - $\|$ -	14.4 $\|$ 4.9k $\|$ -	-
	RMU	0.89 $\|$ - $\|$ -	0.88 $\|$ 0.98 $\|$ -	0.94 $\|$ - $\|$ -	0.94 $\|$ - $\|$ -	0.82 $\|$ 0.95 $\|$ -	-
	Acc	74.2 $\|$ - $\|$ -	89.9 $\|$ 89.3 $\|$ -	90.0 $\|$ - $\|$ -	87.6 $\|$ - $\|$ -	94.1 $\|$ 93.0 $\|$ -	-
LASVM	TT	-	2.44 $\|$ 328 $\|$ -	2.39 $\|$ 546 $\|$ -	-	2.21 $\|$ 519 $\|$ -	-
	RMU	-	0.06 $\|$ 0.36 $\|$ -	0.03 $\|$ 0.41 $\|$ -	-	0.05 $\|$ 0.37 $\|$ -	-
	Acc	-	84.0 $\|$ 87.0 $\|$ -	88.0 $\|$ 95.7 $\|$ -	-	91.7 $\|$ 93.8 $\|$ -	-
” - ” denotes the non-available results due to training times greater than 12 hours.

Equations127

ω_{+}^{⊤} x + b_{+} = 0 and ω_{-}^{⊤} x + b_{-} = 0

ω_{+}^{⊤} x + b_{+} = 0 and ω_{-}^{⊤} x + b_{-} = 0

ω_{+}, b_{+}, ξ_{-} min \frac{1}{2} ∣∣ X_{+} ω_{+} + e_{+} b_{+} ∣ ∣^{2} + C_{1} e_{-}^{⊤} ξ_{-}

ω_{+}, b_{+}, ξ_{-} min \frac{1}{2} ∣∣ X_{+} ω_{+} + e_{+} b_{+} ∣ ∣^{2} + C_{1} e_{-}^{⊤} ξ_{-}

s.t. y_{-} (X_{-} ω_{+} + e_{-} b_{+}) + ξ_{-} \leq e_{-}, ξ_{-} \geq 0

ω_{-}, b_{-}, ξ_{+} min \frac{1}{2} ∣∣ X_{-} ω_{-} + e_{-} b_{-} ∣ ∣^{2} + C_{2} e_{+}^{⊤} ξ_{+}

ω_{-}, b_{-}, ξ_{+} min \frac{1}{2} ∣∣ X_{-} ω_{-} + e_{-} b_{-} ∣ ∣^{2} + C_{2} e_{+}^{⊤} ξ_{+}

s.t. y_{+} (X_{+} ω_{-} + e_{+} b_{-}) + ξ_{+} \leq e_{+}, ξ_{+} \geq 0

f (x) = \pm argmin \frac{∣ ω _{\pm}^{* ⊤} x + b _{\pm}^{*} ∣}{∣∣ ω _{\pm}^{*} ∣∣}

f (x) = \pm argmin \frac{∣ ω _{\pm}^{* ⊤} x + b _{\pm}^{*} ∣}{∣∣ ω _{\pm}^{*} ∣∣}

α max e_{-}^{⊤} α - \frac{1}{2} α^{⊤} H_{-} (H_{+}^{⊤} H_{+})^{- 1} H_{-}^{⊤} α

α max e_{-}^{⊤} α - \frac{1}{2} α^{⊤} H_{-} (H_{+}^{⊤} H_{+})^{- 1} H_{-}^{⊤} α

s.t. 0 \leq α \leq C_{1}

ν max e_{+}^{⊤} ν - \frac{1}{2} ν^{⊤} H_{+} (H_{-}^{⊤} H_{-})^{- 1} H_{+}^{⊤} ν

ν max e_{+}^{⊤} ν - \frac{1}{2} ν^{⊤} H_{+} (H_{-}^{⊤} H_{-})^{- 1} H_{+}^{⊤} ν

s.t. 0 \leq ν \leq C_{2}

x^{⊤} ω_{+} + b_{+} = 0 and x^{⊤} ω_{-} + b_{-} = 0

x^{⊤} ω_{+} + b_{+} = 0 and x^{⊤} ω_{-} + b_{-} = 0

\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}class(\bm{x})=\operatorname*{argmin}_{i=\{-1,+1\}}d_{i}(\bm{x})

\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}class(\bm{x})=\operatorname*{argmin}_{i=\{-1,+1\}}d_{i}(\bm{x})

d_{i} (x) = \frac{∣ x ^{⊤} ω _{i} + b _{i} ∣}{∣∣ ω _{i} ∣∣}

d_{i} (x) = \frac{∣ x ^{⊤} ω _{i} + b _{i} ∣}{∣∣ ω _{i} ∣∣}

ω, b, ξ min \frac{1}{2} ∣∣ ω ∣ ∣^{2} + C s^{⊤} ξ

ω, b, ξ min \frac{1}{2} ∣∣ ω ∣ ∣^{2} + C s^{⊤} ξ

s.t. y_{i} (ω^{⊤} x_{i} + b) + ξ_{i} \geq 1

ξ_{i} \leq 0, i = 1, 2, ..., l

x_{c +} = \frac{1}{l +} y_{i} = + 1 \sum x_{i}, x_{c -} = \frac{1}{l -} y_{i} = - 1 \sum x_{i}

x_{c +} = \frac{1}{l +} y_{i} = + 1 \sum x_{i}, x_{c -} = \frac{1}{l -} y_{i} = - 1 \sum x_{i}

r_{+} = i max ∣∣ x_{i} - x_{c +} ∣∣ if y_{i} = + 1

r_{+} = i max ∣∣ x_{i} - x_{c +} ∣∣ if y_{i} = + 1

r_{-} = i max ∣∣ x_{i} - x_{c -} ∣∣ if y_{i} = - 1

s_{i}=\begin{aligned} \begin{cases}\mu\big{(}1-||\bm{x_{i}}-x_{c+}||/(r_{+}+\delta)\big{)}\;\;\;\;\;\;\;\;\;\;\;\;\text{if }\;\;||\bm{x_{i}}-x_{c+}||\geq||\bm{x_{i}}-x_{c-}||\;\land\;y_{i}=+1\\ (1-\mu)\big{(}1-||\bm{x_{i}}-x_{c+}||/(r_{+}+\delta)\big{)}\;\;\;\text{if }\;\;||\bm{x_{i}}-x_{c+}||<||\bm{x_{i}}-x_{c-}||\;\land\;y_{i}=+1\end{cases}\end{aligned}

s_{i}=\begin{aligned} \begin{cases}\mu\big{(}1-||\bm{x_{i}}-x_{c+}||/(r_{+}+\delta)\big{)}\;\;\;\;\;\;\;\;\;\;\;\;\text{if }\;\;||\bm{x_{i}}-x_{c+}||\geq||\bm{x_{i}}-x_{c-}||\;\land\;y_{i}=+1\\ (1-\mu)\big{(}1-||\bm{x_{i}}-x_{c+}||/(r_{+}+\delta)\big{)}\;\;\;\text{if }\;\;||\bm{x_{i}}-x_{c+}||<||\bm{x_{i}}-x_{c-}||\;\land\;y_{i}=+1\end{cases}\end{aligned}

κ (x_{1}, x_{2}) = ⟨ φ (x_{1}), φ (x_{2})⟩ \approx z (x_{1})^{⊤} z (x_{2})

κ (x_{1}, x_{2}) = ⟨ φ (x_{1}), φ (x_{2})⟩ \approx z (x_{1})^{⊤} z (x_{2})

κ (x - y) = \int_{R^{n}} p (τ) e^{j τ^{⊤} (x - y)} d τ = E_{τ} [ζ_{τ} (x) ζ_{τ} (y)^{*}]

κ (x - y) = \int_{R^{n}} p (τ) e^{j τ^{⊤} (x - y)} d τ = E_{τ} [ζ_{τ} (x) ζ_{τ} (y)^{*}]

p (τ) = \frac{1}{2 π} \int e^{j τ^{⊤} δ} k (δ) d Δ

p (τ) = \frac{1}{2 π} \int e^{j τ^{⊤} δ} k (δ) d Δ

z (x) \equiv \frac{2}{N} [cos (τ_{1}^{⊤} x + b_{1}), \dots, cos (τ_{N}^{⊤} x + b_{N})]^{⊤}

z (x) \equiv \frac{2}{N} [cos (τ_{1}^{⊤} x + b_{1}), \dots, cos (τ_{N}^{⊤} x + b_{N})]^{⊤}

p (τ) = 2 π^{- \frac{N}{2}} exp (- \frac{∣∣ τ ∣ ∣ _{2}^{2}}{2})

p (τ) = 2 π^{- \frac{N}{2}} exp (- \frac{∣∣ τ ∣ ∣ _{2}^{2}}{2})

ω_{+}, b_{+}, ξ_{-} min \frac{1}{2} C_{1} (∣∣ ω_{+} ∣ ∣^{2} + b_{+}^{2}) + \frac{1}{2} ∣∣ X_{+} ω_{+} + e_{+} b_{+} ∣ ∣^{2} + C_{3} s_{-}^{⊤} ξ_{-} s.t. y_{-} (X_{-} ω_{+} + e_{-} b_{+}) + ξ_{-} \geq e_{-}, ξ_{-} \geq 0

ω_{+}, b_{+}, ξ_{-} min \frac{1}{2} C_{1} (∣∣ ω_{+} ∣ ∣^{2} + b_{+}^{2}) + \frac{1}{2} ∣∣ X_{+} ω_{+} + e_{+} b_{+} ∣ ∣^{2} + C_{3} s_{-}^{⊤} ξ_{-} s.t. y_{-} (X_{-} ω_{+} + e_{-} b_{+}) + ξ_{-} \geq e_{-}, ξ_{-} \geq 0

ω_{-}, b_{-}, ξ_{+} min \frac{1}{2} C_{2} (∣∣ ω_{-} ∣ ∣^{2} + b_{-}^{2}) + \frac{1}{2} ∣∣ X_{-} ω_{-} + e_{-} b_{-} ∣ ∣^{2} + C_{4} s_{+}^{⊤} ξ_{+} s.t. y_{+} (X_{+} ω_{-} + e_{+} b_{-}) + ξ_{+} \geq e_{+}, ξ_{+} \geq 0

ω_{-}, b_{-}, ξ_{+} min \frac{1}{2} C_{2} (∣∣ ω_{-} ∣ ∣^{2} + b_{-}^{2}) + \frac{1}{2} ∣∣ X_{-} ω_{-} + e_{-} b_{-} ∣ ∣^{2} + C_{4} s_{+}^{⊤} ξ_{+} s.t. y_{+} (X_{+} ω_{-} + e_{+} b_{-}) + ξ_{+} \geq e_{+}, ξ_{+} \geq 0

L (ω_{+}, b_{+}, ξ_{-}) = \frac{1}{2} C_{1} (∣∣ ω_{+} ∣ ∣^{2} + b_{+}^{2}) + \frac{1}{2} ∣∣ X_{+} ω_{+} + e_{+} b_{+} ∣ ∣^{2} - α^{⊤} (- (X_{-} ω_{+} + e_{-} b_{+}) + ξ_{-} - e_{-}) + C_{3} s_{-}^{⊤} ξ_{-} - η^{⊤} ξ_{-}

L (ω_{+}, b_{+}, ξ_{-}) = \frac{1}{2} C_{1} (∣∣ ω_{+} ∣ ∣^{2} + b_{+}^{2}) + \frac{1}{2} ∣∣ X_{+} ω_{+} + e_{+} b_{+} ∣ ∣^{2} - α^{⊤} (- (X_{-} ω_{+} + e_{-} b_{+}) + ξ_{-} - e_{-}) + C_{3} s_{-}^{⊤} ξ_{-} - η^{⊤} ξ_{-}

\nabla ω_{+} L = C_{1} ω_{+} + X_{+}^{⊤} (X_{+} ω_{+} + e_{+} b_{+}) + X_{-}^{⊤} α = 0

\nabla ω_{+} L = C_{1} ω_{+} + X_{+}^{⊤} (X_{+} ω_{+} + e_{+} b_{+}) + X_{-}^{⊤} α = 0

\nabla b_{+} L = C_{1} b_{+} + e_{+}^{⊤} (X_{+} ω_{+} + e_{+} b_{+}) + e_{-}^{⊤} α = 0

\nabla ξ_{-} L = - α^{⊤} - η^{⊤} + C_{3} s_{-} = 0

- (X_{-} ω_{+} + e_{-} b_{+}) + ξ_{-} \geq e_{-} ξ_{-} \geq 0

α^{⊤} (ω_{-} X_{+} + e_{-} b_{+} - ξ_{-} + e_{-}) = 0; η^{⊤} ξ_{-} = 0

α \geq 0, η \geq 0, ξ_{-} \geq 0

([X_{+}, e_{+}]^{⊤} [X_{+}, e_{+}] + C_{1} I) [ω_{+}, b_{+}] + [X_{-}, e_{-}]^{⊤} α = 0

([X_{+}, e_{+}]^{⊤} [X_{+}, e_{+}] + C_{1} I) [ω_{+}, b_{+}] + [X_{-}, e_{-}]^{⊤} α = 0

(H_{+}^{⊤} H_{+} + C_{1} I) u_{+}^{⊤} + H_{-}^{⊤} α = 0 or

(H_{+}^{⊤} H_{+} + C_{1} I) u_{+}^{⊤} + H_{-}^{⊤} α = 0 or

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

areeberg/FBTSVM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Face and Expression Recognition · Machine Learning and ELM

Full text

Incremental and Decremental Fuzzy Bounded Twin Support Vector Machine

Alexandre R. Mello

[email protected]

[

Marcelo R. Stemmer

[email protected]

Alessandro L. Koerich

[email protected]

École de Technologie Supérieure - Université du Québec, 1100 Notre-Dame West, Montréal, QC, H3C 1K3, Canada.

University of Santa Catarina, Campus Reitor João David Ferreira Lima, Trindade, Florianópolis, SC, 88040-900, Brazil.

SENAI Innovation Institute of Embedded Systems, Avenida Luiz Boiteux Piazza, 574 - Cond. Sapiens Parque - Canasvieiras - Florianópolis, SC, 88054-700, Brazil.

Abstract

In this paper, we present an incremental variant of the Twin Support Vector Machine (TWSVM) called Fuzzy Bounded Twin Support Vector Machine (FBTWSVM) to deal with large datasets and to learn from data streams. We combine the TWSVM with a fuzzy membership function, so that each input has a different contribution to each hyperplane in a binary classifier. To solve the pair of quadratic programming problems (QPPs), we use a dual coordinate descent algorithm with a shrinking strategy, and to obtain a robust classification with a fast training we propose the use of a Fourier Gaussian approximation function with our linear FBTWSVM. Inspired by the shrinking technique, the incremental algorithm re-utilizes part of the training method with some heuristics, while the decremental procedure is based on a scoring window. The FBTWSVM is also extended for multi-class problems by combining binary classifiers using a Directed Acyclic Graph (DAG) approach. Moreover, we analyzed the theoretical foundation’s properties of the proposed approach and its extension, and the experimental results on benchmark datasets indicate that the FBTWSVM has a fast training and retraining process while maintaining a robust classification performance.

keywords:

Twin-SVM, Incremental Learning , Multiclass Twin-SVM , Data Stream , On-line Learning

††journal: Information Sciences. Accepted for publication.

url]https://orcid.org/0000-0003-3130-5328

1 Introduction

Classical machine learning approaches, in which all data is simultaneously accessed, do not meet the requirements to deal with the scenario in which training data is partially available at a time or where the amount of data is so large that it does not fit into the memory or the storage of a single machine. Incremental or online learning is an approach to tackle problems in which only a subset of the data is considered at each step of the learning process, or when the dataset is too large to be processed at once [1]. From the computational point of view, incremental learning has three goals: (i) transform previously learned knowledge to currently received data to facilitate learning from new data; (ii) accumulate experience over time to support the decision-making process; (iii) and achieve global generalization through learning to accomplish goals. Incremental learning often also refers to on-line learning strategies with limited memory resources, relying on creating a compact memory model that represents the already observed data but providing accurate results for all relevant settings.

Losing et al. [2] evaluated the most common algorithms of incremental learning on diverse datasets, and the conclusion is that the Support Vector Machines (SVMs) are usually the highest accurate models. However, such accuracy is at the expense of the most complex model besides many other shortcomings. SVMs are appropriate to tackle two-class classification problems by solving a complex Quadratic Programming Problem (QPP) that determines a unique global hyperplane in the input space that maximizes the separation between the classes [3]. However, it requires large memory and a high CPU power since the computational complexity of the SVM for $n$ data points is $O(n^{3})$ , which makes it impractical for large datasets. To circumvent this problem, one may use the incremental version of SVM or its variants, that learns from new data by discarding past data points excepting the support vectors (SVs), i.e., the new data is used to retrain the model together with the current SVs [1, 4].

The Incremental SVM (ISVM) proposed by Cauwenberghs and Poggio [4] is an exact solution to the problem of on-line SVM that updates the optimal solution of the SVM by adding or removing one training data point. The bottleneck of the ISVM is that the computational complexity of a minor iteration of the algorithm is quadratic in the number of training data points learned so far. Therefore, the actual runtime depends on the balance between memory access and arithmetic operations in a minor iteration The LASVM [5] is an on-line kernel classifier that relies on the soft-margin SVM formulation to handle noisy data. The iterations are similar to the sequential minimization optimization (SMO) algorithm but with a different search strategy. Furthermore, it introduces an SV removal step, where it removes the vectors collected in the current kernel expansion during the on-line process. The iterations run in epochs, where each epoch sequentially visits all the randomly shuffled training data points, and the stopping criteria is a pre-defined number of epochs. We can use a multiple number of epochs as a stochastic optimization algorithm in the off-line training, and a single epoch in the on-line step. The computational cost of the LASVM is $O(p\times nSV\times i)$ , where (nSV) is the number of SVs, i is the number of on-line iterations, and $p$ scales no more than linearly to the amount of training data points, which makes the training process faster than the ISVM. Empirical results suggest that using a single epoch yields to misclassification rates comparable with the SVM. Despite the effectiveness of the ISVM and the LASVM, both methods still need to deal with one large QPP, requiring large memory storage and CPU processing time on training and update steps. Mangasarian et al. [6] introduced the Generalized Eigenvalue Proximal SVM (GEPSVM) that generates two non-parallel hyperplanes for a two-class problem. Thus, it solves two smaller QPPs instead of a single complex QPP, laying each class data point in the proximity of a hyperplane, which reduces the complexity compared to the SVM. Jayadeva et al. [7] proposed the Twin Support Vector Machine (TWSVM), which also solves a pair of QPPs where the data points of one class provide constraints to the other QPP and vice versa [8, 9]. The TWSVM classifies the data points of two classes using two non-parallel hyperplanes with a complexity of $O(2$$\times$$(n/2)^{3})$ , which is four times lower than an SVM. Twin-based models are mathematically smaller than the SVM, and they require low memory storage and CPU processing time.

Based on the TWSVM, several variants and solvers have been proposed [8, 10]. Yuan-HaiShao et al. [11] suggested the Twin Bounded SVM (TBSVM) that includes adherence to the structural risk minimization principle, so the dual formulation (whose inverse is guaranteed) can be solved by successive over-relaxation (SOR) methodology. The Improved TWSVM (ITWSVM) [12] uses a different representation from the TBSVM that leads to a different Lagrangian function for the primal problem and different dual formulations. The ITWSVM does not need to compute the inverse of large matrices before training and can be solved by the SOR or the SMO. However, the matrices in the dual form must involve all the data points from both classes, which makes the dual QPPs larger than the TWSVM. Khemchandani et al. [13] proposed a novel fuzzy TWSVM that assigns a fuzzy weight to each data point to mitigate the effect of outliers and improve accuracy. Gao et al. [14] proposed a coordinate descent fuzzy TWSVM, assigning a membership function to mitigate the effect of noisy data points, and solving the QPPs with a coordinate descent with shrinking by active set. Other variants or extensions are the Least Square TWSVM (LS-TWSVM) [15] that solves the primal problems of the TWSVM, and the $\nu$ -TWSVM [16] where the $\nu$ parameter controls the bounds of the fractions of the SVs and the error margin.

Considering the TWSVM and its variations, Khemchandani et al. [1] introduced the incremental TWSVM (I-TWSVM), which uses the concept of margin vectors and error vectors to select new data points to update the classifier. It learns from new data by retraining the model while discarding past data points except for the previous SVs and erroneous classified data points from the training dataset. However, for each new data point, both models need to be rebuilt entirely. Hao et al. [17] proposed a fast incremental TWSVM that uses a distance-based strategy to determine if a new data point is above a pre-defined threshold. It selects the crucial data points that are near the proximal hyperplane from the current training set and keep data points that are not near the proximal hyperplane from the new training set. In each iteration, it retrains the model considering the previous SVs and the new data points (there is no decremental step). The on-line Twin Independent SVM (OTWISVM) [18] uses a modified Newton method to build a decision function via a subset of data points seen so far for each class separately (called basis). The basis vectors are found (or added during the on-line procedure) during iterative minimization by checking if a new data point is linearly independent in the feature space from the current basis. The basis size is limited, so it does not grow linearly with the training set. The OTWISVM does not have a decremental step, and as it utilizes a modified Newton solver, it needs to calculate the inverse of the Hessian on every update, making the method unfeasible to deal with high-dimensional datasets. Besides improving the model with new data, it is also essential to have a decremental procedure to prevent the model from growing indefinitely. Despite the update strategy be closely related to the model formulation, there are many alternatives to choosing the SVs to be removed, such as the time-window proposed by Fung et al. [19], the concept of informative margin vectors and error vectors [4], or decay coefficients [20].

Finally, to unleash the full potential of the incremental SVM, it is necessary to adapt it to deal with non-linear problems using the kernel trick. However, conventional kernel approaches struggle to deal with large datasets due to the storage and computational issues in handling large kernel matrices. A feasible solution is using kernel approximations such as: exploiting low-rank approximation of the kernel matrix, reducing the kernel space definition, or exploiting a randomized kernel space definition. Random Fourier approximations (RF) provide an efficient and elegant methodology [21]. The Fourier expansion generates features based on a finite set of random basis projections with inner products that are the kernel Monte Carlo approximations [22]. Fourier features are applicable to translation-invariant kernels, so they can approximate the Gaussian kernel. Rahimi et al. [21] use RF to map the input data to a randomized low-dimensional feature space providing convergence bounds to approximate various radial basis kernel. Le et al. [23] proposed an RF-based approximation called Fastfood, which requires a smaller computation and memory storage than Random Kitchen Sinks [24] to obtain an explicit function space expansion.

Although many efforts have been made, the incremental SVM approaches still have several shortcomings, such as the impossibility of endless learning, high model complexity, high training time, high complexity of hyper-parameter optimization, adaptability to concept drifts, among others. In this paper, we propose a novel incremental and decremental variant of the TWSVM called Fuzzy Bounded Twin Support Vector Machine (FBTWSVM) that overcomes many of the shortcomings of the current approaches. The FBTWSVM combines fast training and an incremental procedure (with the ability to handle noisy data) without weakening the accuracy when updated. The proposed approach can continuously integrate new information into already-built models, with the characteristics of being adherent to the structural risk minimization principle [11], and using the dual coordinate descent (DCD) algorithm with active shrinking [12, 13, 14, 25] to create the off-line classifier. The incremental and decremental strategies are based on the DCD with shrinking, exploiting the relevance of each support vector. Moreover, we propose the use of our linear formulation with a kernel approximation to speed up training and classification while maintaining the non-linearity. Finally, the FBTWSVM is extended to multiclass problems using a strategy based on the Directed Acyclic Graph (DAG). The experimental results on benchmarking datasets have shown that the proposed approach achieves accuracy comparable to the exact solution besides being faster to integrate new information and to discard outdated information into the already-built models.

This paper is organized as follows. Section 2 presents the definitions and notations, introduces the Twin SVM, and presents the fuzzy SVM with the respective membership function and the kernel approximation. Section 3 presents the proposed formulation for the FBTWSVM (both linear and non-linear versions), and the solving method with implementation details. We also extend our formulation to multiclass problems. In Section 4, we present the incremental and decremental procedures. In Section 5, we present our experimental procedure and the experimental results. We present the conclusions and perspectives for future work in the last section.

2 Basic Concepts

We use the following definitions and notations throughout the paper. The problems are in a $n-$ dimensional space $\mathcal{R}^{n}$ . We denote the training data as $D$$=$${(\bm{x_{i}},y_{i})|i=1,2,\dots,l}$ , where $\bm{x_{i}}\in\mathcal{R}^{n}$ is an input data point, and $l$ is the number of data points, with the corresponding label $y_{i}\in\{1,2,\dots,u\}$ where $u$ is the number of classes. We adopt the definition of incremental learning proposed by Losing et al. [2] as an algorithm that generates on a given stream of training data $\bm{x}_{1},\bm{x}_{2},\dots,\bm{x}_{t}$ a sequence of models $\bm{h}_{1},\bm{h}_{2},\dots,\bm{h}_{t}$ , where $(\bm{h}_{i}$$:$$\mathcal{R}^{n}|i=1,2,\dots,l)$ is a model function solely depending on $\bm{h}_{i-1}$ and the recent $p$ data points $\bm{x}_{i},\dots,\bm{x}_{i-p}$ with $p$ being strictly limited. The approach used to deal with multiclass problems is the DAG, where it is necessary to create $2u-$$1$ binary problems. For each binary problem we assign either a positive or a negative label $y_{i}\in\{+1,-1\}$ . Therefore, the training set $D$ is divided into the $l_{+}\times n$ dimensional matrix $X_{+}$ and $l_{-}\times n$ dimensional matrix $X_{-}$ for positive and negative labels respectively, where $l_{+}$ and $l_{-}$ denote the number of data points from each label. We define the aggregation per binary problem as $X$$=$$[X^{\top}_{+}X^{\top}_{-}]$ , and it denotes all input data points from both classes.

2.1 The Twin SVM (TWSVM)

The TWSVM [7] generates two non-parallel hyperplanes such that each hyperplane is closer to one class and is as far as possible from the other class, [8, 26] as shown in Figure 1. The two non-parallel decision planes are defined as:

[TABLE]

where $\bm{\omega_{+}},\bm{\omega_{-}}$$\in\mathcal{R}^{n}$ indicates normal vectors to the hyperplane, and $b_{+},b_{-}\in\mathcal{R}^{n}$ are the bias terms.

The following pair of primal optimization problems is the setup to build the decision planes, as the soft margin hyperplane can handle non-linearly separable data:

[TABLE]

and

[TABLE]

where $C_{1}>0$ and $C_{2}>0$ are the penalty factors that trade-off the complexity and data misfit between the minimization of the two terms in the objective function, $\xi_{+}$ and $\xi_{-}$ denote the slack variable vectors (the deviation from the margin that allows subsets of misclassification error for positive and negative classes respectively), $\bm{e_{-}}$ , $\bm{e_{-}}$ correspond to unit row vectors with their dimensions exact to data point size in each class used for mathematical purpose only, $y_{+}$ and $y_{-}$ are $+1$ and $-1$ respectively.

In each QPP (Equations 2 and 3) the objective function corresponds to a particular class and the constraints are set by the data points of the opposite class. The first term of both objective functions aims to minimize the sum of squared distances between the hyperplane and the points of one class, which tends to keep the hyperplane close to the points of such a class. On the other hand, the second term of both objective functions aims to minimize the misclassification due to points belonging to the other class. The constraints require the distance between the hyperplane and the points of the other class is at least 1, and a set of error variables is used to measure the error wherever the hyperplane is closer than the minimum distance [7]. Assuming that the TWSVM is split into two QPPs of size $n/2$ , and that the complexity of the original SVM is less or equal to $n^{3}$ , the TWSVM is approximately four times faster than the original SVM ( $2\times(n/2)^{3}=n^{3}/4$ )[26]. After solving Equations 2 and 3 for $(\bm{\omega^{*}}+,b^{*}_{+})$ and $(\bm{\omega^{*}_{-}},b^{*}_{-})$ , respectively, we can classify a new data point $\bm{x}$ by:

[TABLE]

and choose either $+1$ or $-1$ according to the lowest value of Equation 4, i.e., we classify a new data point $\bm{x}$ depending on which of the two hyperplanes given by Equation 1 it lies closest. We can write Equations 2 and 3 as an unconstrained problem using Lagrangian multipliers. The dual formulation of the linear TWSVM for Equation 2 is:

[TABLE]

where $H_{+}$$=$$[X_{+},e_{+}]$ , $H_{-}$$=$$[X_{-},e_{-}]$ , $||$$\cdot$$||$ denotes the L2 norm, and $\bm{\alpha}$$=$$(\alpha_{1},\dots,\alpha_{m})^{\top}$ is the vector of Lagrangian multipliers. In a similar manner we can write the dual formulation for Eq 3 as:

[TABLE]

where $\bm{\nu}$$=$$(\nu_{1},\nu_{2},\dots,\nu_{m})$ is the vector of Lagrangian multipliers. For more detail on the dual formulation one may refer to [7, 26]. Once we solve dual problems for $\bm{\alpha}$ and $\bm{\nu}$ , we can get the vectors $[\bm{\omega_{+}},b_{+}]^{\top}$ and $[\bm{\omega_{-}},b_{-}]^{\top}$ . Thus, the separating hyperplanes are given by:

[TABLE]

During testing, a new data point is assigned to the closest hyperplane regarding the two classes by:

[TABLE]

where

[TABLE]

2.2 The Fuzzy SVM

The presence of outliers in the training dataset may affect both the standard SVM and the TWSVM. The Fuzzy SVM introduced by Lin et al. [27] uses the fuzzy theory to reduce the effect of outliers by applying a membership to each data point. Fuzzy numbers, denoted as $s_{i}$ , are assigned to each input data point to add information that reflects the noise contamination level, which is $0\leq s_{i}\leq 1,i$ = $1,2,...,l$ . Therefore, the training dataset $D$ becomes a triple $D^{\prime}$$=$$(\bm{x_{i}},y_{i},s_{i})$ to accommodate the fuzzy number and to reduce the influence of the contaminated data points in generating the decision functions. The fuzzy SVM is formulated as:

[TABLE]

where $C$ is the trade-off scalar and $\xi_{i}$ is the slack variable that represents the error associated with the $i$ -th input data point. An important remark about this formulation is that a small $s_{i}$ can reduce the effect of the slack variable $\xi_{i}$ in Equation 10, so reducing the importance of the corresponding data point $\bm{x_{i}}$ . The classification of an input $\bm{x}$ is given by the sign of $\bm{\omega^{*\top}}\bm{x}+b^{*}$ , where $\bm{\omega^{*}}$ and $b^{*}$ are the solution of Equation 10.

The construction of the membership functions follows the strategy used by Gao et al. [14, 25]. The method considers reducing the noise carried by outliers while keeping the importance of the SVs. We integrate the fuzzy SVM into the TWSVM formulation by selecting two different classes and assigning a positive label to the first class and a negative label to the second one. The class centers $x_{c+}$ and $x_{c-}$ are the mean points considering the input space of these two classes, defined by:

[TABLE]

The hyperspheres radii $r+$ and $r-$ are constructed by measuring the distance of the farthest scattering data point of each class:

[TABLE]

The membership of $s_{i}$ is assigned according to the distance relationship between $||\bm{x_{i}}-x_{c+}||$ and $||\bm{x_{i}}-x_{c-}||$ , when $x_{c+}$ , $x_{c-}$ , $r_{+}$ , and $r_{-}$ are known. Formally, $s_{i}$ of a positive data point is given as:

[TABLE]

where $\mu\in[0,1]$ balances the effect of normal and noisy data points, and $\delta>0$ avoids fuzzy numbers equal 0. Figure 2 illustrates the fuzzy-related elements used to assign the fuzzy membership value.

A data point is usually assigned by a proportional decreasing value $s_{i}$ when it drifts farther from its native class center, which increases the uncertainty [25]. A small positive real number $\mu$ is assigned to decrease the effect of outliers towards the hyperplane. The fuzzy numbers for the negative data points are calculated analogously.

2.3 Kernel Approximation

Kernel machines that operate on the data kernel matrix (Gram matrix) scale more than quadratically in the data dimension [21, 22], which makes methods such as the ISVM or the LASVM impractical to deal with large datasets or incremental data that require sequential learning. Approximating non-linear kernels by linear kernels in the transformed space is a way to make possible the use of efficient linear methods that depend linearly on the size of the training set, allowing to solve large-scale and incremental learning problems efficiently [21, 22]. Instead of relying on the kernel trick implicit lifting, the Random Fourier Features [21] explicitly map the data to a low-dimensional Euclidean inner product using a randomized feature map $z\colon\mathcal{R}^{n}$$\rightarrow$$\mathcal{R}^{N}$ , described as:

[TABLE]

where $z$ is a low-dimensional space. The feature space approximates shift-invariant kernels $\kappa(\bm{x_{1}}-\bm{x_{2}})$ to within an error $err$ with $N$ = $O(err^{-2}n\log\frac{1}{err^{2}})$ dimensions. Rahimi and Recth [21] show empirically that a similar classification performance can be obtained for dimensions smaller than $N$ .

The first set of transformed features are the Random Fourier bases $\cos(\tau^{\top}\bm{x}+b)$ , where $\tau$$\in$$\mathcal{R}^{n}$ and $b$$\in$$\mathcal{R}$ , which are random variables. It maps projected data on a randomly chosen line, followed by passing the resulting scalar through a sinusoidal function. The direction of these lines, in an appropriate distribution, guarantees that the product of two transformed points approximates a desired shift-invariant kernel [21]. The transformation follows Bochner’s theorem: A continuous kernel $\kappa(x,y)=\kappa(x-y)$ on $\mathcal{R}^{n}$ is positive definite if and only if $\kappa(\delta)$ is the Fourier transform of a non-negative measure. For a properly scaled shift-invariant kernel $\kappa(\delta)$ , Bochner’s theorem guarantees that its Fourier transform $p(\tau)$ is a proper probability distribution:

[TABLE]

where $\zeta_{\tau}(x)=e^{j\tau^{\top}x}$ . $\zeta_{\tau}(x)\zeta_{\tau}(y)^{*}$ is an unbiased estimate of $k(x,y)$ when $\tau$ is drawn for $p$ , and $*$ denotes the complex conjugate. The integral of Equation 15 converges when the complex exponential are replaced by cosines, $z_{\tau}(x)$$=$$\sqrt{2}cos(\tau^{\top}x+b)$ , obtaining a real-valued mapping that satisfies the condition $E[z_{\tau}(x)z_{\tau}(y)]$ , where $\tau$ is drawn from $p(\tau)$ and $b$ is uniformly distributed from $[0,2\pi]$ . The estimate kernel variance can be reduced by concatenating $N$ randomly chosen $z_{\tau}$ into one $N$ -dimensional normalized vector, i.e., the inner product $z(x)^{\top}z(y)$ = $\frac{1}{N}\sum_{j=1}^{N}z_{\tau}(x)z_{\tau}(y)$ is a low variance approximation to the expectation of Equation 15111The proof can be found in [21].. To summarize, the random Fourier feature algorithm starts by getting a randomized feature map $z(x)$$:$$\mathcal{R}^{n}$$\rightarrow$$\mathcal{R}^{N}$ , so that $z(x)^{\top}z(y)\approx k(x-y)$ . The second step is to compute $p$ of the kernel, which in this case is the Fourier transform of $k$ :

[TABLE]

The third step is to draw $N$ independent and identically distributed (iid) data points $\tau_{1},...,\tau_{N}\in\mathcal{R}^{n}$ from $p$ and $N$ iid data points $b_{1},...,b_{N}\in\mathcal{R}$ from the uniform distribution on $[0,2\pi]$ . Finally, $z(\bm{x})$ is computed as:

[TABLE]

The scalar $\sigma^{2}_{p}$ is equal to the trace of the Hessian of $k$ at 0, that quantifies the curvature of the kernel at the origin. For a Gaussian kernel denoted as $k(\bm{x_{1}},\bm{x_{2}})$$=$$\exp(-\gamma||\bm{x_{1}}-\bm{x_{2}}||^{2})$ , we have $\sigma^{2}_{p}$$=$$2n\gamma$ , that approximates the kernel to:

[TABLE]

The important implications of using this kernel approximation in our incremental approach are: (i) we approximate the non-linear model accuracy with a linear model; (ii) it is faster to calculate the approximate kernel than the regular one; (iii) and mainly, we increase the model only in one dimension, so we do not need to recalculate the kernel approximation of the previous data.

3 The Fuzzy Bounded Twin SVM (FBTWSVM)

We propose a formulation based on the TWSVM [7] and inspired by the FRTSVM [14, 25] and TBSVM [11] properties. The fuzzy formulation incorporated by our method (Equation 10) is inspired by the TWSVM (Equations 2 and 3), while the adherence to the structural risk minimization principle is incorporated similarly to the TBSVM [11]. To maintain such an adherence, we need to guarantee the existence of the dual formulation inverse matrix, which circumvents the drawback of the standard TWSVM (i.e., the standard TWSVM only adheres to the empirical risk minimization problem in the dual problem). Thus, we define the FBTWSVM primal formulation as:

[TABLE]

where $C_{1}$ , $C_{2}$ , $C_{3}$ , and $C_{4}$ are the trade-off parameters between the margin and the complexity for weighting the regularization, $\bm{s_{+}}$$\in$$R^{l_{+}}$ and $\bm{s_{-}}$$\in$$R^{l_{-}}$ are the fuzzy number vectors sequentially associated with the positive and negative input data points, which introduce the desired robustness in the weighted regularized model [14, 25]. The additional $b_{+}$ and $b_{-}$ in Equations 19 and 20 minimize the structural risk.

The two hyperplanes in $\mathcal{R}^{n}$ are defined as $\bm{\omega^{\top}_{\pm}}+b_{\pm}$ =[math], and since the TWSVM has two proximal decision functions, two margin terms $1/||\bm{\omega_{\pm}}||$ are defined for the proximal decision function [14]. The margin between two classes can be measured by the distance between the proximal hyperplane $\bm{x^{\top}}\bm{\omega_{+}}+b_{+}$ =[math] and the bounding hyperplane $\bm{x^{\top}}\bm{\omega_{+}}+b_{+}$ = $-1$ . The distance is $1/||\bm{\omega_{+}}||^{2}$ , and it is the one-sided margin between the two classes with respect to the hyperplane $\bm{x^{\top}}\bm{\omega_{+}}+b_{+}$ =[math] [11, 26]. The process is analogous to the other hyperplane. We need to write the dual problems obtaining the solutions of Equations 19 and 20. We start by taking the Lagrangian of Equation 19 to obtain the Wolfe dual:

[TABLE]

where $\bm{\alpha}$$=$$(\alpha_{1},\dots,\alpha_{X_{+}})^{\top}$ , and $\bm{\eta}$$=$$(\eta_{1},\dots,\eta_{X_{+}})^{\top}$ are the Lagrange multiplier vectors. Considering that Equation 19 represents a convex optimization problem, the Karush-Kuhn-Tucker (KKT) optimality conditions are both necessary and sufficient, and they are written as:

[TABLE]

Considering that $\eta\geq 0$ and $\alpha\geq 0$ from Equation 22f, and using Equation 22c, we know that $\alpha$ is bounded as $0\leq\alpha\leq C_{3}s_{-}$ . Summing Equations 22a and 22b, and using Equations 22c to 22f for simplification, we obtain:

[TABLE]

Defining $H_{+}$$=$$[X_{+},\bm{e_{+}}]$ , $H_{-}$$=$$[X_{-},\bm{e_{-}}]$ , $\bm{u_{+}}$$=$$[\bm{\omega_{+}},b_{+}]$ and $\bm{u_{-}}$$=$$[\omega_{-},b_{-}]$ (one to each class), we can rewrite Equation 23 as:

[TABLE]

Using our notation, the Wolfe dual is defined as:

[TABLE]

Using the KKT conditions (from Equations 22a to 22f) and Equation 24, the Wolfe dual of Equations 19 and 20 can be written as:

[TABLE]

where $I_{1}$ and $I_{2}$ are identity matrices. The matrices $(H^{\top}_{+}H_{+}+C_{1}I_{1})$ and $(H^{\top}_{-}H_{-}+C_{2}I_{2})$ from Equations 26 and 27 are non-singular naturally, therefore their inverses are guaranteed to exist, which adds the adherence to the structural risk minimization principle [11, 26]. Notice that the dual for Equation 20 can be obtained is an analogous way. By solving the duals (Equations 26 and 27), we obtain the optimal solutions for $\bm{\alpha^{*}}$ and $\bm{\nu^{*}}$ , and furthermore, the corresponding classes $\bm{u^{*}_{\pm}}$ (as defined in Equation 24) and the non-parallel hyperplanes. The dual of Equation 26 and 27 relates to the primal problems (Equations 19 and 20) as:

[TABLE]

Finally, for a test data point $\bm{x}\in\mathcal{R}^{n}$ , the classification decision function is given by Equation 4.

3.1 The Non-linear FBTWSVM

In the non-linear FBTWSVM, the input data points $x\in\mathcal{R}^{n}$ are mapped to a high-dimensional space $\mathcal{H}$ through $\varphi(x)$ . The kernel function $\kappa(\cdot,\cdot)$ calculates implicitly the dot product of a pair of transformations, which is applied as $\kappa(x_{1},x_{2})=\langle\varphi(x_{1}),\varphi(x_{2})\rangle$ . The non-linear dual proximal hyperplanes are:

[TABLE]

and the primal problems used to obtain the dual proximal hyperplanes are:

[TABLE]

The dual forms of Equation 30 and 31 are:

[TABLE]

where $S_{+}=[\kappa(X_{+},X^{\top}),\bm{e_{+}}]$ and $S_{-}=[\kappa(X_{-},X^{\top}),\bm{e_{-}}]$ . The solutions of the primal problems of Equations 30 and 31 are $\bm{\upsilon^{*}_{\pm}}=[\bm{\omega^{*\top}_{\pm}},b^{*}_{\pm}]^{\top}$ , which are the parametric relationships between the optimal $\bm{\upsilon^{*}_{\pm}}$ and the optimal solutions $\bm{\alpha^{*}}$ and $\bm{\nu^{*}}$ of the dual forms of Equations 32 and 33:

[TABLE]

Once Equations 32 and 33 are solved to obtain the hyperplanes (Equation 29), a new data point $\bm{x}\in\mathcal{R}^{n}$ can be classified in a similar manner to the linear case by Equation 4.

3.2 Solving The FBTWSVM

We use the coordinate descent method (DCD) [28] to solve the dual problem of the FBTWSVM [29]. The DCD leads to fast training by updating one variable at a time through a single-variable sub-problem minimization. Such fast training allows the processing of large and incremental datasets [29]. The dual problems of Equations 26 and 27 and Equations 32 and 33 are solved in the same way. However, for convenience, we only present the solution of Equation 26. We start by considering $Q$$=$$H_{-}(H^{\top}_{+}H_{+}+C_{1}I_{1})^{-1}H_{-}^{\top}$ and $Q^{\prime}$$=$$(H^{\top}_{+}H_{+}+C_{1}I_{1})^{-1}H_{-}^{\top}$ . Consequently, $Q$$=$$H_{-}Q^{\prime}$ , where $q_{ii}$ and $\overline{Q}$ can be pre-computed and stored if necessary. The matrix inversion is calculated with the Sherman-Morison-Woodbury formula. Assuming $\bm{\alpha^{k,i}}$$=$$[\alpha^{k+1,i}_{1},\dots,\alpha^{k+1,i}_{i-1},\alpha^{k,i}_{i},\dots,\alpha^{k,1}_{X_{-}+1}]$ , where $i$$=$$(1,\dots,X_{-}$$+$$1)$ is the index for the data points and $k=(-1,+1)$ is the data label. We use the following problem updating from $\bm{\alpha^{k,i}}$ to $\bm{\alpha^{k,i+1}}$ :

[TABLE]

where $\bm{e_{i}}$$=$$[0,\dots,0,1,0,\dots,0]^{\top}$ (the $i-$ th position is 1), and $\bm{d_{i}}$ is an optimum solution to the problem of minimizing $f(\bm{\alpha^{k,i}}+d\bm{e_{i}})$ subject to $\bm{d_{i}}\in\mathcal{R}^{n}$ , i.e., $f(\bm{\alpha^{k,i}}+d\bm{e_{i}})$ achieves a minimum at $\bm{d_{i}}$ only if $\nabla f(\bm{\alpha^{k,i}}+d\bm{e_{i}})^{\top}\bm{e_{i}}=\nabla f(\bm{\alpha^{k,i+1}})^{\top}\bm{e_{i}}=0$ 222The proof can be found in [28]. The objective function of Equation 35 is a quadratic function of $d$ :

[TABLE]

where $\nabla_{i}f$ is the $i$ -th component of the gradient $\nabla f$ . Equation 35 has an optimum at $d$ =[math] iff:

[TABLE]

where $\nabla^{P}_{i}f(\alpha)$ is the projected gradient which is defined as:

[TABLE]

If Equation 37 is satisfied, we can move to the next iteration ( $i$ +1) without updating $\bm{\alpha^{k,i}_{i}}$ in $X_{-}$ , i.e., we only update $\bm{\alpha^{k,i}_{i}}$ to temporally meet the optimal solution of Equation 35. The optimum of Equation 36 is reached by introducing the Lipschitz continuity:

[TABLE]

In the update of Equation 39, $Q^{\prime}_{i,i}$ can be pre-calculated by $Q^{\prime}_{ii}=H_{-i}Q_{i}$ , and $\nabla_{i}f(\bm{\alpha^{k,i}})$ can be obtained by:

[TABLE]

The computation of Equation 40 is approximated as $O(X_{-}\overline{l})$ , where $\overline{l}$ is the average count of non-zero elements in $Q^{\prime}$ per data point. To reduce the number of operations, we can alternatively compute Equation 40 as:

[TABLE]

with a pre-defined $\bm{u_{+}}$$=$$-Q\alpha$ and $i$ is the row of the matrix $H_{-}$ , so the number of operations is $O(\overline{n})$ . To maintain $\bm{u_{+}}$ throughout the coordinate descent procedure, we use:

[TABLE]

The complexity to maintain $\bm{u_{+}}$ iteratively is $O(\overline{l})$ . Starting with $\alpha^{0}=0$ , the optimal solution of $\bm{u_{+}}$ is obtained by iterative updating Equation 42, and furthermore, the optimal solution of Equation 26. The cost per iteration for the whole process is $O(X_{2}\overline{n})$ , and the memory requirement is the size of $H_{-}$ and $Q^{\prime}$ .

3.3 Implementation

The dual problem of Equation 26 has the constraint $0\leq\alpha_{i}\leq C_{3}\bm{s_{-}}$ , and if $\alpha_{i}$ is either [math] or $C_{3}\bm{s_{-}}$ , it may achieve a steady state. Considering that our formulation produces many bounded Lagrange multipliers, we apply the proposed shrinking technique to reduce the size of the optimization problem without considering some bounded variables [30]. Considering $Z$ as a subset of $X$ after removing all data points that have non-bounded Lagrange multipliers, and $\overline{Z}$$=$$\{1,\dots,X_{-}\}/Z$ its complement subset, the dual of Equation 26 can be represented by a smaller problem that consumes less time and memory:

[TABLE]

where $Q_{ZZ}$ and $Q_{Z\overline{Z}}$ are sub-matrices of $Z$ and $\bm{\alpha}_{\overline{Z}}$ is a vector of Lagrangian multipliers. To solve Equation 43, we compute $\nabla_{i}f(\bm{\alpha})$ as:

[TABLE]

If $i\in Z$ , and defining $\bm{u_{1}}$ as:

[TABLE]

we have $\nabla_{i}f(\bm{\alpha})$ = $H_{-}\bm{u_{1}}-$ 1, which turns $\nabla_{i}f(\alpha)$ easy to obtain. For a linear kernel we only need to update ( $\overline{Q}_{i\in Z}\bm{\alpha}_{i\in Z}$ ), and we do not need to reconstruct all $\nabla f(\bm{\alpha})$ to implement the shrink step333The proof can be found in [31].. Considering the projected gradient $\nabla^{P}f(\bm{\alpha})$ defined in Equation 38, and following the optimality condition of bound-constrained problems, $\alpha$ is optimal iff $\nabla^{P}f(\bm{\alpha})=0$ . During the iteration procedure, the inequality $\nabla^{P}f(\bm{\alpha})\neq 0$ means either $\max_{j}\nabla^{P}f(\bm{\alpha})>0$ or $\min_{j}\nabla^{P}f(\bm{\alpha})<0$ , and at the $k-$ 1 step, we obtain $m_{\max}^{k-1}\equiv\max_{j}\nabla^{P}f(\bm{\alpha})$ and $m_{\min}^{k-1}\equiv\min_{j}\nabla^{P}f(\bm{\alpha})$ . In this way, at each inner step of the $k-$ th iteration, and before updating $\bm{\alpha}$ , the element is shrunken if one of the two conditions holds:

[TABLE]

where ${m^{\prime}}_{\max}^{k-1}$ must be strictly positive and ${m^{\prime}}_{\min}^{k-1}$ must be strictly negative, and they are defined as:

[TABLE]

Next, we multiply both ${m^{\prime}}_{\max}^{k}-$$1$ and ${m^{\prime}}_{\min}^{k}-$$1$ by a shrinking rate smaller than one. A tolerance $\epsilon$ indicates if the optimal value is satisfied after a finite number of iterations, thus it is used as a valid stop criterion:

[TABLE]

If in the $k-$ th iteration, the condition stated in Equation 49 is satisfied for Equation 43, we can enlarge the active set $Z$ to $\{1,\dots,X_{-}+$ 1 $\}$ , and set ${m^{\prime}}_{\max}^{k}=+\infty$ and ${m^{\prime}}_{\min}^{k}=-\infty$ , and continue with the regular iterations. We store the previous values of ${m^{\prime}}_{\max}$ and ${m^{\prime}}_{\min}$ during the DCD process to avoid recalculating them during the incremental step. Therefore, the shrinking technique is a key step to avoid calculating and storing all training data during the training phase. Our method process one class at each time, however, the inner processing can be done in parallel, where one input is assigned to an available processor to calculate the membership followed by the Lagrangian multiplier. We present the pseudo-code of the FBTWSVM training algorithm (Algorithm 1) for the positive class.

3.4 The Multiclass FBTWSVM

The FBTWSVM is based on the TWSVM foundations, which considers only binary problems. Yet, we can extend the FBTWSVM to multiple classes by building and combining several binary classifiers instead of considering all data in one optimization formula [32]. The multiclass FBTWSVM is based on the Decision Directed Acyclic Graph (DDAG), which achieves better accuracy while requiring less training time than other multiclass approaches [8, 33]. The DAG-based multiclass classifier was originally proposed by Platt et al. [33] for the multiclass SVM approach, and further introduced by Chen and Ji [34] into the twin approach as the Optimal DAG to the Least Squares Twin SVM.

In the multiclass approach based on the DAG topology, for a $u-$ class classification problem, there are $u(u-$$1)/2$ sub-classifier nodes divided into $u-$$1$ layers. During the classification process, there is no need for combining all sub-classifiers, so to assign a class to a test data point, it makes $u$$-$$1$ decisions. The classification process starts at the root node, located in the first layer, and includes all possible classification labels (node 1v4 in Figure 3). The decision-making step eliminates the most excluded category at each sub-classifier decision, i.e., considering a 4-class problem with a test data point with label $y_{i}$$=$$4$ and the topology presented in Figure 3. The root node sub-classifier eliminates the possibility of $y_{i}$$=$$1$ , following the Not 1 line. The next sub-classifier eliminates the possibility of $y_{i}$$=$$2$ following the Not 2, and the last sub-classifier eliminates the possibility of $y_{i}$$=$$3$ , assigning class 4 to the test data point.

4 Incremental and Decremental FBTWSVM

The FBTWSVM can integrate new data points continuously into the existing model without fully reconstructing it. Besides that, it can be trained fast due to the formulation and the solver choice, and it generalizes well like a conventional SVM. These characteristics make it suitable for incremental learning applications.

The incremental FBTWSVM is based on the shrinking heuristic that can increase the current model considering the fuzzy information of new values. We update the model by selecting only new data points that extrapolate the minimum ( ${m^{\prime}}_{\min}$ ) and maximum ( ${m^{\prime}}_{\max}$ ) values of the projected gradient from the previous training step. Therefore, we do not need to process all the new incoming data points. Considering a set of new data points as $X_{\text{new}}$ , and the subsets ${X^{\prime}}_{\text{new+}}$ and ${X^{\prime}}_{\text{new-}}$ denoting the positive and negative labeled data respectively. Not necessarily both subsets may exist, and here we consider that $X_{\text{new}}={X^{\prime}}_{\text{new+}}$ to maintain the notation. We evaluate the projected gradient of the new set of data points as:

[TABLE]

This operation keeps $u_{+}$ in the coordinate descent procedure of Equation 41. We set a new heuristic rule based on Equation 46 to select only new data points that are more likely to become SVs. We consider the new data points as SVs if the projected gradient values are bounded by ${m^{\prime}}_{\min}$$<$$\nabla_{i}f(\bm{\alpha_{\text{new}}})$$<$${m^{\prime}}_{\max}$ . As our method adheres structural and risk minimization principle, all Lagrangian multipliers can be interpreted as SVs, and to let the evaluation of Equation 50 be more permissive, we can replace the $\max$ and $\min$ operators by the median, mean, or superior and inferior quartiles. Figure 4 depicts four new data points (in green and numbered), two of each class. The new data points must have projected gradient out of bounds from the respective model to be considered in the incremental procedure. For instance, the circle 1 has a projected gradient lower than ${m^{\prime}}_{\min}$ of class +1 model, and the circle 2 has a projected gradient greater than ${m^{\prime}}_{\max}$ of class +1 model. The cross 3 has a projected gradient greater than ${m^{\prime}}_{\max}$ of class -1, so it is not discarded, but the cross 4 is bounded by the ${m^{\prime}}_{\min}$ and ${m^{\prime}}_{\min}$ of class -1 model, so it is discarded. New data points that have projected gradient lower than ${m^{\prime}}_{\min}$ should interfere in the model shape and placement regarding only its own class, while new data points that have projected gradient greater than ${m^{\prime}}_{\max}$ interfere in the hyperplane placement regarding the opposite class.

We calculate the membership (as presented in Section 2.2) to each new data point from $X_{\text{new}}$ that extrapolates the projected gradient bounds, where $X_{\text{over}}$ is the data matrix that extrapolates the bounds. Then, we start a new training iteration $k$$\rightarrow$$k+$ 1 to update the model by enlarging the active set with $X_{\text{over}}$ . Algorithm 2 presents the pseudo-code for the incremental procedure.

The incremental procedure adds $X_{\text{over}}$ data to the model at each iteration, remembering that we need to calculate beforehand the membership value to each data point in $X_{\text{over}}$ , which increases the processing time. In the worst case, we have $X_{\text{over}}$ = $X_{\text{new}}$ , so the model dimension grows linearly with the number of new data points, as well as the processing time increases at each new training iteration. To avoid the continuous growth of the model dimension caused by the incremental procedure, we introduce a decremental procedure to control the model dimension by removing data that has low or no interference in the model accuracy. The proposed decremental procedure is also based on the shrinking technique, where the SVs that have both Lagrangian multipliers smaller than a threshold $(\phi)$ after $(d)$ occurrences are removed. The decremental procedure is executed before each incremental training (except for the first training). We use the vector $Z_{re}$ = $[0_{1},\dots,0_{q}]$ (initially all points are assigned to zero) to keep track of the number of occurrences per input data point.

Considering the current active set $Z$ (without non-bounded Lagrangian multipliers), for each training data point there are two sets of Lagrangian $Z_{\alpha}$ = $\{\alpha_{1},\dots,\alpha_{q}\}$ and $Z_{\nu}$ = $\{\nu_{1},\dots,\nu_{q}\}$ , where $q$ is the number of Lagrangian multipliers. After each training iteration $k\rightarrow k+1$ , we update the inputs (in its corresponding position) that results in $(\alpha_{m}\wedge\beta_{m})$$<$$\phi$ in vector $Z_{re}$ . When the number of occurrences reaches $d$ , we remove all inputs and related data, so they will not be used in the next incremental training. Algorithm 3 presents the pseudo-code for the decremental procedure.

5 Experimental Results

In this section, we present the experimental protocol444All tests were performed in a machine running Ubuntu 16.04 LTS with an Intel Core i7-7700HQ CPU @ 2.80GHz and 16,144MB of RAM memory. used to evaluate the FBTWSVM on benchmarking datasets. For comparison purposes, we have used an experimental protocol similar to Losing et al. [2], which compares a broad range of state-of-the-art on-line classification algorithms, namely: ISVM with RBF kernel, LASVM with RBF kernel, On-line Random Forest (ORF) [35], Incremental Learning Vector Quantization (ILVQ) [36], Learn++ [37], Incremental Extreme Learning Machine (IELM) [38], Naive Bayes [39], and Stochastic Gradient Descent (SGD). However, we have restricted to the evaluation of the methods that led to the best accuracy in the on-line learning experiments for at least one of the datasets, which are the ISVM, LASVM, ORF, and ILVQ [2]. The ORF [35] is an incremental Random Forest algorithm that grows continuously from a pre-defined number of trees by adding splits whenever enough data points are gathered within one leaf. It uses Extreme Random Trees [40] to optimize the split, using a pre-defined number of random values. The ILVQ is a dynamic growth model derived from the static Generalized Learning Vector Quantization [36], where the insertion rate is guided by the number of misclassified data points. The ISVM and the LASVM were already described in Section 3.

The implementation used for comparison is from [41], which introduces a prototype placement strategy to minimize the loss of a sliding window of recent data points. The experimental procedure of Losing et al. [2] for on-line methods uses a window/chunk size from 500 to 2,000 and set all hyper-parameters using the Hyperopt library [42] with the Tree-of-Parzen-Estimators [43] search algorithm, in which each parameter is individually adjusted within 250 iterations of a 3-fold CV using only the training data. We have carried out all our experiments with FBTWSVM using the approximated RBF kernel described in Section 2.3, which enables the use of our linear formulation (Equations 19 and 20). We optimized the model hyper-parameters using grid-search with a 3-fold cross validation on the training set, and we set the Kernel approximation size following the strategy proposed by Rahimi and Recht [21].

Considering that we do not need to process all data points to obtain a model, the use of batches accelerates the training phase. In this way, we use different batch sizes but with the constraint that it must encompass at least 5% of the data points of the fold, and the batch must contain at least one element from each class in the first training. We evaluated the FBTWSVM with six different forgetting window sizes empirically defined as $\varphi$$=$$\{1,2,4,10\}$ , and without the decremental procedure. We used publicly available datasets without any preprocessing, although all attributes are numerical, either integer or real values. The pre-defined train-tests-splits were used when available. Otherwise, we adopted a stratified train-test-split of 70-30%.Besides that, we have also created 15 synthetic datasets [44, 45] to evaluate the scalability of the proposed method as well as a very large dataset of 23M samples [46]. However, for such datasets, we have compared the FBTSVM just with other SVM-based methods. We have used the following streaming generators with 10% of noise added: (i) the LED generator [47] yields instances with 24 Boolean features that correspond to the segments of a seven-segment LED display and another 17 irrelevant features; (ii) the SEA generator [48] generates streams from two relevant continuous attributes $f_{1},f_{2}$ and an irrelevant $f_{3}$ , with a range of values within 0 and 10; (iii) the Random Tree Generator (RTG) [49] builds a decision tree by randomly selecting attributes as split nodes and assigning random classes to each leaf. The number of values per nominal is set to 5, the max tree depth is 3, the first leaf value is 3, and the leaf fraction is 0.15; (iv) the Radial Basis Function (RBF) generator creates 50 centroids at random positions and associates them with a standard deviation value, a weight, and a class label. In this way, new instances are set according to the random direction chosen to offset the centroid, which forms a Gaussian distribution according to the standard deviation associated with the given centroid; (v) HYPER [49] generates instances that are separable by a hyperplane. We consider 10% sigma percentage, and there is no magnitude change or drift attributes. We have created three datasets from each streaming generator with 10,000, 100,000, and 1,000,000 training instances and 3,000, 30,000, and 300,000 testing instances respectively. The focus of our evaluation is in incremental learning considering different key properties (as the number of classes, instances and dimensions), even though we can use the FBTWSVM in offline mode. The datasets555All datasets and algorithms are available at https://github.com/areeberg/FBTSVM encompass generated, artificial and real-world problems with different numbers of classes (from 2 to 100), data points (from 2,586 to 23M) and attributes (from 2 to 5,000), as shown in Table 1, and although the largest dataset has roughly 21 million instances, the proposed system does not specifically target learning from big data.

Table 1 also shows the parameter setting used for each dataset which was defined in a 3-fold CV, and ”Number of Points” stands for the initial training set size. For all datasets we used a fixed fuzzy parameter $\mu=0.1$ . Using the 4-D case ( $C_{1},C_{2},C_{3},C_{4}$ are independent variables) for the hyperparameter tuning may result in a model with a better generalization performance, i.e., the loss function may achieve a lower value during the model selection compared to the 2-D case (we assume $C_{1}$ = $C_{3}$ and $C_{2}$ = $C_{4}$ ), however, performing the hyperparameter tuning in a 2-D space may decrease substantially the number of function evaluations needed, especially given that the grid search is essentially a brute force search strategy that takes long. Other model selection strategies are able to speed-up the hyperparameter tuning, however, this is out of the scope of this paper [50]. In many cases, using the 2-D space instead of the 4-D is a valid heuristic estimation to decrease the number of function evaluations needed, and using the Overlap dataset as an example, Figure 5 depicts that using the 2-D space it requires 34 function evaluations to achieve the accuracy loss value of 0.1732, while the 4-D space requires 5,143 function evaluations to achieve 0.1703.

Table 2 presents the accuracy of the incremental and decremental FBTWSVM against the best on-line algorithms reported in [2]. The FBTWSVM achieved equal or better results in 8 out of 11 datasets (from Border to Gisette, excluding the generated datasets) relative to the best on-line algorithms. The SUSY dataset contains a significant amount of data and to train the FBTWSVM we had limited the size of the kernel approximation based on the memory available, and this also reduces the accuracy. For instance, both the ISVM and the LASVM with an RBF kernel could not be trained with this dataset due to the uncontrolled growth of the kernel matrix. We run an experiment using the SUSY full training set considering a kernel approximation size of 600, resulting in 77.67%. Outdoor is a visual dataset that consists of objects recorded outdoors under lighting conditions [41]. The dataset creation method has caused a difference between training and test data [2], which reflects on the performance of the learning algorithms. On-line algorithms with an adaptive learning mechanism presented the accuracy of about 20% better than off-line methods (the best result found was the off-line ISVM with 71.9% [2]).

Table 3 shows the relation between accuracy and the number of SVs resulting from different forgetting scores ( $d$ ). The decremental procedure discards points that are less likely to be SVs. Smaller $d$ leads to classifiers with lower generalization performance, and for most of the datasets, the best performance was achieved without forgetting or with large forgetting scores. On the other hand, the number of SVs using the decremental procedure is considerably smaller, so the forgetting score must be chosen according to the application. Table 3 also presents the comparison between the online and offline approaches, in which the online had better accuracy to all datasets with a smaller number of SVs compared to offline. A smaller $d$ also implies in a faster training and classification time, and Table 4 shows that the difference in training time can be substantial (check the SUSY dataset values for example). The accuracy of the COIL dataset with a forgetting score $d$ =10 have similar accuracy (95.11%) when compared to the offline implementation of the FBTWSVM (95.00%), the ISVM (96.50%), the LASVM (93.20%) [2], and the multiclass SVM implemented with the Error-Correcting Output Codes (ECOC) from MATLAB (96.52%). In this manner, the forgetting strategy does not discard crucial support vectors, keeping the accuracy score near the offline approach. Table 4 presents the accuracy performance evolution when increasing the forgetting score, which corroborates with the forgetting strategy, i.e., lower forgetting scores tend to have smaller accuracy, however, by keeping the important SV the accuracy does not fall substantially (the accuracy difference between $d$ =1 and $d$ =10 is 1.9%).

Table 5 compares the training time (in seconds), the real RAM consumed of the current process and its children (in Gigabytes), and the accuracy of the FBTSVM with the other SVM based methods (ISVM and LASVM). For these experiments we split the dataset into the largest batches that we can (that fits on the available memory, initially 15.4 Gb), to reduce the reloading procedure of the dataset during the execution (more loading implies in a larger training time). Both the FBTWSVM and the ISVM (the ISVM multiclass adopt one-versus-one strategy) are implemented in MATLAB, thus it requires more real RAM than the LASVM, that is for binary cases only and it is a C++ implementation. We do not consider the LASVM in the multiclass cases (LED and RBF), and we discard the situations that the training time took over 12 hours. All methods present competitive accuracy, however, the FBTSVM is the only method (compared to ISVM and LASVM) able to train all dataset sizes in an acceptable time, having the smallest training time for almost all situations (the only exception is the LASVM for the RTG10K). The FBTSVM forgetting strategy is one of the factors (the kernel approximation also plays an important role) that makes the training into large datasets possible, as Table 5 shows that the real RAM consumed difference between the 100K and 1M datasets is not very expressive. The LED dataset has a bigger memory difference between the datasets for the FBTWSVM, and this is caused by the use of the multi-thread instead of the single processor version. In this way, the scalability of the FBTWSVM is superior to other online SVM-based methods, as it requires a smaller training time to process large datasets and can handle the memory consumption in an efficient manner. To further explore the FBTWSVM potential for large datasets, we have also evaluated the accuracy, training time, and memory consumption on the WESAD dataset [46] considering three classes (baseline, stress, and amusement), eight attributes acquired from a sensor attached to the chest, and using the leave-one-subject-out cross-validation (in total we have 17 subjects). The best result reported by Schmidt et al. [46] is 76.50% using a Linear Discriminant Analysis, however, this is an offline approach and the authors do not present the training time or memory consumption. Our method achieved the accuracy of 75.50% (Table 5), with a training time of 6,789 seconds and peak memory consumption of 9.8 GB.

6 Conclusion

In this paper, we propose a novel SVM approach suitable for incremental and decremental on-line learning. The incremental and decremental Fuzzy Bounded Twin SVM (FBTWSVM) integrates ideas coming from different SVM approaches such as the Twin SVM [7], the Fuzzy SVM [27], the Bounded TWSVM [11], the Fast and Robust TWSVM [14, 25], the Optimal DAG TWSVM [34], and the dual coordinate descent method [28]. The FBTWSVM calculates a pair of non-parallel hyperplanes using two smaller QPPs, rather than one large QPP as in the original SVM, but with adherence to structural risk minimization principle. The dual form of the FBTWSVM leads to a pair of convex quadratic programming problems with a unique solution and singularity avoidance. The dual coordinate descent method with shrinking requires less memory storage than the TWSVM, as it discards points that are less likely to be SVs. The fuzzy concept enhances noise-resistance and generalization capability, while the use of a kernel approximation shows a good generalization performance with our linear model.

The incremental solution follows the shrinking strategy and can run with different batch sizes, from a single individual to a number of data points that fits the available memory. The decremental procedure is fundamental to control the model complexity, keeping only the most critical SVs in the model. The FBTWSVM is flexible and both incremental and decremental procedures can be configured according to the application, changing the threshold of adding new SVs in the incremental step and the number of occurrences in the decremental step. According to the experimental results, the DAG strategy showed a good generalization capability and a fast training speed, but for further studies the use of training data structural and statistical information in the training process may increase the generalization performance. A practical difficulty in the FBTWSVM is the optimization of the six hyper-parameters $C_{1},C_{2},C_{3},C_{4},\mu,\gamma$ and the kernel approximation size, however, this problem will be addressed in the future. The FBTWSVM can adapt current models using the window strategy, or even add new models (e.g. in case of new classes) without retraining. Therefore, as a future work, we will evaluate the FBTWSVM use in the context of concept drift, novelty detection, and big data.

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Khemchandani et al. [2009] R. Khemchandani, Jayadeva, S. Chandra, Incremental Twin Support Vector Machines, in: S. K. Neogy, A. K. Das, R. B. Bapat (Eds.), Modeling, Computation and Optimization, World Sci Pub Co, 2009, pp. 263–272.
2Losing et al. [2018] V. Losing, B. Hammer, H. Wersing, Incremental on-line learning: A review and comparison of state of the art algorithms, Neurocomp 275 (2018) 1261–1274.
3Cortes et al. [1995] C. Cortes, C. Cortes, V. Vapnik, Support-Vector Networks, Mach Learn 20 (1995) 273–297.
4Cauwenberghs and Poggio [2000] G. Cauwenberghs, T. A. Poggio, Incremental and decremental support vector machine learning, in: Advances in Neural Information Processing Systems 13, Denver, CO, USA, pp. 409–415.
5Bordes et al. [2005] A. Bordes, S. Ertekin, J. Weston, L. Bottou, Fast Kernel Classifiers with Online and Active Learning, J Mach Learn Res. 6 (2005) 1579–1619.
6Mangasarian and Wild [2006] O. Mangasarian, E. Wild, Multisurface proximal support vector machine classification via generalized eigenvalues, IEEE Trans Patt Anal Mach Intell 28 (2006) 69–74.
7Jayadeva et al. [2007] Jayadeva, R. Khemchandani, S. Chandra, Twin Support Vector Machines for Pattern Classification, IEEE Trans Patt Anal Mach Intell 29 (2007) 905–910.
8Tomar and Agarwal [2015] D. Tomar, S. Agarwal, Twin Support Vector Machine: A review from 2007 to 2014, Egyptian Inf J 16 (2015) 55–69.

Dataset	$d$ =1 $\|$ nSV	$d$ =2 $\|$ nSV	$d$ =4 $\|$ nSV	$d$ =10 $\|$ nSV	$d$ = $\infty$ $\|$ nSV	OffL $\|$ nSV
	Accuracy(%)
Border	91.90 $\|$ 538	93.30 $\|$ 784	98.20 $\|$ 1.3k	97.20 $\|$ 2.8k	98.70 $\|$ 7k	98.50 $\|$ 8k
Overlap	76.97 $\|$ 1.9k	79.70 $\|$ 2.3k	82.22 $\|$ 3.5k	82.83 $\|$ 7k	84.14 $\|$ 11.8k	83.30 $\|$ 11k
Letter	93.38 $\|$ 83k	94.83 $\|$ 122k	96.10 $\|$ 204k	96.10 $\|$ 338k	96.63 $\|$ 361k	96.90 $\|$ 384k
SUSY	45.84 $\|$ 754k	45.85 $\|$ 1.5M	77.67 $\|$ 2.5M	73.78 $\|$ 3.4M	76.45 $\|$ 3.8M	-
Outdoor	72.00 $\|$ 24k	72.25 $\|$ 39k	73.69 $\|$ 68k	74.44 $\|$ 83k	73.88 $\|$ 92k	74.00 $\|$ 93k
COIL	93.21 $\|$ 98k	94.01 $\|$ 100k	94.57 $\|$ 153k	95.11 $\|$ 156k	94.93 $\|$ 166k	95.00 $\|$ 178k
DNA	91.82 $\|$ 770	92.16 $\|$ 848	92.50 $\|$ 1.2k	93.78 $\|$ 1.9k	93.59 $\|$ 2.6k	93.50 $\|$ 2.7k
USPS	93.92 $\|$ 9k	94.32 $\|$ 21k	94.87 $\|$ 42k	95.36 $\|$ 60k	95.47 $\|$ 60k	95.30 $\|$ 65.5k
Isolet	95.19 $\|$ 61k	95.51 $\|$ 85k	95.32 $\|$ 128k	95.89 $\|$ 143k	96.28 $\|$ 143k	95.60 $\|$ 155k
MNIST	97.17 $\|$ 59k	97.48 $\|$ 169k	97.66 $\|$ 342k	97.91 $\|$ 455k	97.80 $\|$ 455k	97.80 $\|$ 540k
Gisette	96.50 $\|$ 1.3k	97.00 $\|$ 1.9k	96.90 $\|$ 2.9k	96.20 $\|$ 5.3k	96.50 $\|$ 6k	96.50 $\|$ 6k
” - ” denotes the non-available results due to limitations in memory size.

” - ” denotes the non-available results due to limitations in memory size.
	Training $\|$ Testing Time (sec)
Dataset	$d$ =1	$d$ =10	$d$ = $\infty$
Border	12.39 $\|$ 0.01	34.12 $\|$ 0.02	43.22 $\|$ 0.02
Overlap	22.96 $\|$ 0.02	82.58 $\|$ 0.02	94.37 $\|$ 0.02
Letter	88.14 $\|$ 0.68	179.18 $\|$ 0.72	192.36 $\|$ 0.70
SUSY	940.60 $\|$ 1.88	5264.41 $\|$ 2.62	- $\|$ -
Outdoor	111.00 $\|$ 0.73	153.09 $\|$ 0.80	153.55 $\|$ 0.81
COIL	154.63 $\|$ 5.00	169.10 $\|$ 5.29	179.12 $\|$ 4.51
DNA	6.89 $\|$ 0.03	8.63 $\|$ 0.03	8.53 $\|$ 0.03
USPS	21.59 $\|$ 0.25	41.59 $\|$ 0.24	41.67 $\|$ 0.23
Isolet	117.35 $\|$ 0.52	181.14 $\|$ 0.49	160.10 $\|$ 0.51
MNIST	349.47 $\|$ 1.21	419.49 $\|$ 1.19	435.48 $\|$ 1.26
Gisette	25.26 $\|$ 0.01	50.52 $\|$ 0.01	49.41 $\|$ 0.01

		Dataset
		10k $\|$ 100k $\|$ 1M
Method		LED	SEA	RTG	RBF	HYPER	WESAD
FBTWSVM	TT	5.81 $\|$ 19.1 $\|$ 380	0.69 $\|$ 2.15 $\|$ 63	9.60 $\|$ 119 $\|$ 3.4k	11.4 $\|$ 164 $\|$ 5.3k	0.73 $\|$ 2.61 $\|$ 28.0	6.8k
	RMU	3.26 $\|$ 3.81 $\|$ 7.34	0.84 $\|$ 0.91 $\|$ 1.31	1.84 $\|$ 10.1 $\|$ 10.6	1.04 $\|$ 2.06 $\|$ 7.03	0.84 $\|$ 0.96 $\|$ 1.47	9.80
	Acc	74.1 $\|$ 74.1 $\|$ 74.2	89.0 $\|$ 87.1 $\|$ 89.1	95.8 $\|$ 95.5 $\|$ 96.0	88.3 $\|$ 89.4 $\|$ 89.0	89.1 $\|$ 94.7 $\|$ 94.0	75.5
ISVM	TT	40.7 $\|$ - $\|$ -	8.80 $\|$ 4.5k $\|$ -	63.9 $\|$ - $\|$ -	31.8 $\|$ - $\|$ -	14.4 $\|$ 4.9k $\|$ -	-
	RMU	0.89 $\|$ - $\|$ -	0.88 $\|$ 0.98 $\|$ -	0.94 $\|$ - $\|$ -	0.94 $\|$ - $\|$ -	0.82 $\|$ 0.95 $\|$ -	-
	Acc	74.2 $\|$ - $\|$ -	89.9 $\|$ 89.3 $\|$ -	90.0 $\|$ - $\|$ -	87.6 $\|$ - $\|$ -	94.1 $\|$ 93.0 $\|$ -	-
LASVM	TT	-	2.44 $\|$ 328 $\|$ -	2.39 $\|$ 546 $\|$ -	-	2.21 $\|$ 519 $\|$ -	-
	RMU	-	0.06 $\|$ 0.36 $\|$ -	0.03 $\|$ 0.41 $\|$ -	-	0.05 $\|$ 0.37 $\|$ -	-
	Acc	-	84.0 $\|$ 87.0 $\|$ -	88.0 $\|$ 95.7 $\|$ -	-	91.7 $\|$ 93.8 $\|$ -	-
” - ” denotes the non-available results due to training times greater than 12 hours.