A Generalized Weighted Loss for SVC and MLP

Filippo Portera

arXiv:2302.12011·cs.LG·February 24, 2023

A Generalized Weighted Loss for SVC and MLP

Filippo Portera

PDF

Open Access

TL;DR

This paper introduces a generalized weighted loss function applicable to Support Vector Classification and Multi-layer Perceptron, improving performance by adaptively weighting errors without degrading standard methods.

Contribution

It proposes a novel error weighting scheme that generalizes traditional loss functions for SVC and MLP, enhancing their robustness and accuracy.

Findings

01

Error is never worse than standard loss methods

02

Weighted loss often outperforms traditional approaches

03

Applicable to both classification and regression models

Abstract

Usually standard algorithms employ a loss where each error is the mere absolute difference between the true value and the prediction, in case of a regression task. In the present, we introduce several error weighting schemes that are a generalization of the consolidated routine. We study both a binary classification model for Support Vector Classification and a regression net for Multi-layer Perceptron. Results proves that the error is never worse than the standard procedure and several times it is better.

Tables6

Table 1. Table 1: SVC and GWL with Ionosphere data-set

Algorithm	Data-set	Mean F1	Time
sklearn.svm.SVC	Ionosphere	0.968638	0m3,130s
GWL SVC	Ionosphere	0.970651	62m34,796s
GWL(1)	Ionosphere	0.977172	175m44,060s
GWL(2)	Ionosphere	0.977172	175m41,022s
GWL(3)	Ionosphere	0.977172	187m1,032s
GWL(4)	Ionosphere	0.977172	187m59,323s
GWL(5)	Ionosphere	0.977359	3h:07m:41s
GWL(6)	Ionosphere	0.977538	2h:58m:42s
GWL(8)	Ionosphere	0.977292	3h:24m:32s
GWL(8)	Ionosphere	0.977292	3h:08m:31s
GWL(8)	Ionosphere	0.974767	3h:07m:36s
GWL(8)	Ionosphere	0.975011	3h:05m:40

Table 2. Table 2: SVC and GWL with Sonar data-set

Algorithm	Data-set	Mean F1	Time
sklearn.svm.SVC	Sonar	0.886610	0m2,396s
GWL SVC	Sonar	0.904489	16m19,019s
GWL(1)	Sonar	0.909337	28m8,391s
GWL(2)	Sonar	0.916513	31m58,852s
GWL(3)	Sonar	0.908717	36m26,726s
GWL(4)	Sonar	0.913580	40m51,567s
GWL(5)	Sonar	0.916303	45m46,03s
GWL(6)	Sonar	0.916671	41m:17,92s
GWL(7)	Sonar	0.911098
GWL(8)	Sonar	0.907057

Table 3. Table 3: SVC and GWL with Breast data-set

Algorithm	Data-set	Mean F1	Time
sklearn.svm.SVC	Breast	0.959825	0m3,387s
GWL SVC	Breast	0.958628	174m47,138s
GWL(1)	Breast	0.963625	432m3,673s
GWL(2)	Breast	0.963896	448m14,994s
GWL(3)	Breast	0.967909	443m59,346s
GWL(4)	Breast	0.966109	401m18,642s
GWL(5)	Breast	0.964666	6h:48m:01s
GWL(6)	Breast	0.964666	6h:34m:13s
GWL(8)	Breast	0.961837	8h:02m:57s

Table 4. Table 4: SVC and GWL with Statlog data-set

Algorithm	Data-set	Mean F1	Time
sklearn.svm.SVC	Statlog	0.610351	0m38,297s
GWL SVC	Statlog	0.644108	1725m29,269s
GWL(1)	Statlog	0.651278	5370m49,360s
GWL(2)	Statlog	0.651278	5499m18,497s
GWL(3)	Statlog	0.644329	5638m2,544s
GWL(4)	Statlog	0.649529	5692m55,211s
GWL(5)	Statlog	0.651947	5351m40,975s
GWL(6)	Statlog	0.652786	5546m26,466s
GWL(8)	Statlog	0.646312	83h:50m:08s

Table 5. Table 5: First MLP with wine data-set

Standard MLP MAE	gamma Best	Best Loss MLP MAE
0.5871212	10	0.57575756 (1)
0.5694444	10	0.5530303 (2)
0.5568182	1	0.53661615 (3)
0.5580808	0.01	0.540404 (4)
0.54924244	0.1	0.54671717 (5)
0.510101	100	0.510101 (6)

Table 6. Table 6: Second MLP with wine data-set

Standard MLP MAE	gamma Best	Best Loss MLP MAE
0.5694444	10	0.5580808 (1)
0.53661615	1	0.5290404 (2)
0.5378788	100	0.5378788 (3)
0.5770202	10	0.5580808 (4)
0.54924244	0.01	0.5252525 (5)
0.56565654	1	0.5555556 (6)

Equations54

P = \frac{1}{2} ∣∣ v ∣ ∣^{2} + C i = 1 \sum l ξ_{i} w_{i}

P = \frac{1}{2} ∣∣ v ∣ ∣^{2} + C i = 1 \sum l ξ_{i} w_{i}

y_{i} (v^{'} x_{i} + b) \geq 1 - ξ_{i} i \in 1, .., l

y_{i} (v^{'} x_{i} + b) \geq 1 - ξ_{i} i \in 1, .., l

ξ_{i} \geq 0 i \in 1, .., l

ξ_{i} \geq 0 i \in 1, .., l

s_{i} = j = 1 \sum l (e^{- γ_{S} ∣∣ x_{i} - x_{j} ∣ ∣^{2}})

s_{i} = j = 1 \sum l (e^{- γ_{S} ∣∣ x_{i} - x_{j} ∣ ∣^{2}})

s y_{i} = y_{i} y_{j} j = 1 \sum l (e^{- γ_{S} ∣∣ x_{i} - x_{j} ∣ ∣^{2}})

s y_{i} = y_{i} y_{j} j = 1 \sum l (e^{- γ_{S} ∣∣ x_{i} - x_{j} ∣ ∣^{2}})

w_{i} = f (s_{i})

w_{i} = f (s_{i})

L = \frac{1}{2} ∣∣ v ∣ ∣^{2} + C i = 1 \sum l ξ_{i} w_{i} - i = 1 \sum l α_{i} (y_{i} (v^{'} x_{i} + b) - 1 + ξ_{i})

L = \frac{1}{2} ∣∣ v ∣ ∣^{2} + C i = 1 \sum l ξ_{i} w_{i} - i = 1 \sum l α_{i} (y_{i} (v^{'} x_{i} + b) - 1 + ξ_{i})

- i = 1 \sum l η_{i} ξ_{i}

- i = 1 \sum l η_{i} ξ_{i}

α_{i} \geq 0 i \in 1, .., l

α_{i} \geq 0 i \in 1, .., l

η_{i} \geq 0 i \in 1, .., l

η_{i} \geq 0 i \in 1, .., l

\frac{\partial L}{\partial v} = v - i = 1 \sum 1 α_{i} y_{i} x_{i} = 0 \Rightarrow v = i = 1 \sum 1 α_{i} y_{i} x_{i}

\frac{\partial L}{\partial v} = v - i = 1 \sum 1 α_{i} y_{i} x_{i} = 0 \Rightarrow v = i = 1 \sum 1 α_{i} y_{i} x_{i}

\frac{\partial L}{\partial ξ} = α + η - C w = 0 \Rightarrow α \leq C w

\frac{\partial L}{\partial ξ} = α + η - C w = 0 \Rightarrow α \leq C w

\frac{\partial L}{\partial b} = i = 1 \sum l α_{i} y_{i} = 0

\frac{\partial L}{\partial b} = i = 1 \sum l α_{i} y_{i} = 0

D = i = 1 \sum l α_{i} - \frac{1}{2} i = 1 \sum l j = 1 \sum l α_{i} y_{i} α_{j} y_{j} K (x_{i}, x_{j})

D = i = 1 \sum l α_{i} - \frac{1}{2} i = 1 \sum l j = 1 \sum l α_{i} y_{i} α_{j} y_{j} K (x_{i}, x_{j})

0 \leq α_{i} \leq C w_{i} i \in 1, .., l

0 \leq α_{i} \leq C w_{i} i \in 1, .., l

i = 1 \sum l α_{i} y_{i} = 0

i = 1 \sum l α_{i} y_{i} = 0

α_{i}^{t + 1} = α_{i}^{t} + ν y_{i}

α_{i}^{t + 1} = α_{i}^{t} + ν y_{i}

α_{j}^{t + 1} = α_{j}^{t} - ν y_{j}

α_{j}^{t + 1} = α_{j}^{t} - ν y_{j}

ν = \frac{y _{j} - y _{i} - \sum _{p = 1}^{l} α _{p} y _{p} K ( x _{j} , x _{p} ) + \sum _{p = 1}^{l} α _{p} y _{p} K ( x _{i} , x _{p} )}{K ( x _{i} , x _{i} ) - 2 K ( x _{i} , x _{j} ) + K ( x _{j} , x _{j} )}

ν = \frac{y _{j} - y _{i} - \sum _{p = 1}^{l} α _{p} y _{p} K ( x _{j} , x _{p} ) + \sum _{p = 1}^{l} α _{p} y _{p} K ( x _{i} , x _{p} )}{K ( x _{i} , x _{i} ) - 2 K ( x _{i} , x _{j} ) + K ( x _{j} , x _{j} )}

w_{i} = s_{i}

w_{i} = s_{i}

w_{i} = s_{i}

w_{i} = s_{i}

w_{i} = s_{i}^{2}

w_{i} = s_{i}^{2}

w_{i} = \frac{1}{s _{i}}

w_{i} = \frac{1}{s _{i}}

w_{i} = \frac{1}{s _{i}}

w_{i} = \frac{1}{s _{i}}

w_{i} = \frac{1}{s _{i}^{2}}

w_{i} = \frac{1}{s _{i}^{2}}

w_{i} = s y_{i}

w_{i} = s y_{i}

w_{i} = 1 + rand [0, 1]

w_{i} = 1 + rand [0, 1]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Fuzzy Logic and Control Systems

Full text

A Generalized Weighted Loss for SVC and MLP

Filippo Portera

Abstract

Usually standard algorithms employ a loss where each error is the mere absolute difference between the true value and the extrapolation, in case of a regression task. In the present, we introduce several error weighting schemes that are a generalization of the consolidated routine. We study both a binary classification model for Support Vector Classification and a regression net for Multy-layer Perceptron. Results proves that the error is never worse than the standard procedure and several times it is better.

{IEEEkeywords}

Machine Learning, Binary Classification, SVC, Regression, MLP.

\IEEEpeerreviewmaketitle

1 Introduction

We would like to show that a standard loss generalization for binary classification (in our case we have chosen SVC and MLP) could produce results not worse w.r.t. the consolidated loss. In fact, the possibility that a given data-set presents non-IID samples can be exploited by these generalized losses.

The loss studied to generalize SVC and the full optimization problem are:

[TABLE]

suject to:

[TABLE]

and:

[TABLE]

where $\vec{v}$ represents the linear weights of the extraploator function, $l$ is the number of training examples, $C$ is a trade-off hyper-parameter, $\xi_{i}$ is the error on sample $i$ , and $w_{i}$ are some scalar weights that are a function of a distribution $s_{i}$ of the samples:

[TABLE]

Other distribution can be adopted (e.g., 1 + the RBF norm instead of the RBF dot product). And let:

[TABLE]

with $\gamma_{S}$ additional hyper-parameter. Here lies the complexity of the algorithm since this calculation is $O(l^{2})$ . Perhaps it can be overtaken with pattern sampling or, in the case of MLP with a sort of weights learning.

[TABLE]

This implies a quadratic problem that is different from traditional SVC:

The Lagrangian would be:

[TABLE]

subject to:

[TABLE]

Applying the KKT condition for optimaility:

[TABLE]

Thus, the dual becomes:

[TABLE]

subject to:

[TABLE]

This is very similar to standard SVC dual [4], apart the constraints on the lagrangian multipliers.

We wrote an ad-hoc quadratic optimizer for this problem111The code of this work is available at OSF GWL Project, with a SMO-like method ([2]).

We iteratively select 2 distinct multipliers and we modify them with an attempt to improve the dual objective function:

[TABLE]

The motivation is the enforcement of the second dual constraint on the $\sum_{i=1}^{l}\alpha_{i}y_{i}=0$ .

The $\nu$ in the optimal direction is obtained deriving $D$ by $\nu$ as it has been done in section 5.1 of ([5]).

This direction is:

[TABLE]

Once the candidate $\nu$ has been determined, it has to be clipped in order to satisfy the constraints on both the multipliers.

At each iteration we compute $b$ with the suport vectors that lie in the margin (for which $0<\alpha_{i}<Cw_{i}$ ) as it has been reported in How to calculate $b$ .

The kernel used to compute $K(\vec{x},\vec{y})$ is RBF with hyper-parameter $\gamma_{K}$ . The whole procedure is iterated $50l^{2}$ times for each training problem.

2 Related works

In ([1]) they learn the loss weights directly from the training and validation sets. They assert that there is a substantial improvement in the generalization error and they also provide theoretical bounds.

3 Method

We use the acronym GWL for Generalized Weighted Loss.

We tried 4 distinct algorithms: the Python 3 package sklearn.svm.SVC, GWL SVC with $w_{i}=1$ , GWL (here we mean the generalized loss with $w_{i}$ ’s built as described), and GWL with random weights. We would like to know if, in the general case, the optimal solutions use $w_{i}$ not equal to $1$ . We have selected at least 8 cases of study, to determine the weight $w_{i}$ of a sample $i$ . Therefore, some evaluated weighting functions are:

[TABLE] 2. 2.

[TABLE] 3. 3.

[TABLE] 4. 4.

[TABLE] 5. 5.

[TABLE] 6. 6.

[TABLE] 7. 7.

[TABLE] 8. 8.

[TABLE]

The case $8$ is useful to show that a weighting scheme based on the training distribution is more convenient w.r.t. a random weighting scheme.

4 Results

We explored a 2 dimensional hyper-parameters grid for sklearn.svm.svc, involving $\gamma_{K}$ and $C$ . While we used the additional hyper-parameter $\gamma_{S}$ to generete the loss weights. That is the reason why the experiments with loss weights take more time to terminate. Obviously, the second grid is an extension of (it covers) the first one. Those are the results for the 5-fold cross-validation with data-sets extracted from the UCI website, and opportunely treated (double or inconsistent samples removed, shuffling):

We also have tried 2 MLP nets with PyTorch on a regression task with $w_{i}=s_{i}$ and results are interesting (but the random initialization of the net weights should be considered in this case: Wine data-set: 3961 samples, 11 features; MLP: 100, 50, 20, 1, and Wine data-set, different MLP 100, 80, 40, 1 nodes per layer. The theory underneath deep neural architectures can be foun in [3].

In this scenario it would be useful to determine the difference between eah couple of values, to understand which is the strategy that, in the most of the cases, performs best with the test set. An idea is to learn weights, starting to run in parallel $n$ nets with different random weight vectors and selecting at each parallel one the best vector in terms of MAE and perturbing it and re-run the procedure for a given amount of iterations.

GWL has been written in C. The regression code for the wine data-set has been written in Python 3.10 and torch.

Hardware employed: a notebook with 8 cores Intel(R) i5-10210U CPU @ 1.60GHz and 16GB of RAM, and a PC with 16 cores 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz and 32 GB of RAM.

Baseline SVC algorithms have been measured on the notebook, while GWL times have been determined with the PC.

5 Conclusion

Results confirm the theory, they’re not worse than the particular case. In particular, it looks like that the preferred generalization scheme is the one that gives more importance to patterns that are isolated, on 3 data-sets from 4 for the SVC case. Nevertheless it should be considered the fact concerning the unique geometry of each data-set, so each generalization scheme should be tested. The next step would be to leverage this method in order to learn the weights. Perhaps this generalization could be employed in other contests such as SVR, multi-class classification, and other MLPs.

Bibliography5

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] ”Zhao Sen et al.”, Metric-Optimized Example Weights , 2019.
2[2] Aiolli F, Sperduti A., An efficient SMO-like algorithm for multiclass SVM , Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pp, 297–306, 2002/9/6
3[3] Book: Deep Learning , MIT Press, 2016
4[4] Book: Statistical Learning Theory WILEY, 1998
5[5] Portera F., A generalized quadratic loss for SVM and Deep Neural Networks LOD 2020 Conference work