A Generalized Weighted Loss for SVC and MLP
Filippo Portera

TL;DR
This paper introduces a generalized weighted loss function applicable to Support Vector Classification and Multi-layer Perceptron, improving performance by adaptively weighting errors without degrading standard methods.
Contribution
It proposes a novel error weighting scheme that generalizes traditional loss functions for SVC and MLP, enhancing their robustness and accuracy.
Findings
Error is never worse than standard loss methods
Weighted loss often outperforms traditional approaches
Applicable to both classification and regression models
Abstract
Usually standard algorithms employ a loss where each error is the mere absolute difference between the true value and the prediction, in case of a regression task. In the present, we introduce several error weighting schemes that are a generalization of the consolidated routine. We study both a binary classification model for Support Vector Classification and a regression net for Multi-layer Perceptron. Results proves that the error is never worse than the standard procedure and several times it is better.
| Algorithm | Data-set | Mean F1 | Time |
|---|---|---|---|
| sklearn.svm.SVC | Ionosphere | 0.968638 | 0m3,130s |
| GWL SVC | Ionosphere | 0.970651 | 62m34,796s |
| GWL(1) | Ionosphere | 0.977172 | 175m44,060s |
| GWL(2) | Ionosphere | 0.977172 | 175m41,022s |
| GWL(3) | Ionosphere | 0.977172 | 187m1,032s |
| GWL(4) | Ionosphere | 0.977172 | 187m59,323s |
| GWL(5) | Ionosphere | 0.977359 | 3h:07m:41s |
| GWL(6) | Ionosphere | 0.977538 | 2h:58m:42s |
| GWL(8) | Ionosphere | 0.977292 | 3h:24m:32s |
| GWL(8) | Ionosphere | 0.977292 | 3h:08m:31s |
| GWL(8) | Ionosphere | 0.974767 | 3h:07m:36s |
| GWL(8) | Ionosphere | 0.975011 | 3h:05m:40 |
| Algorithm | Data-set | Mean F1 | Time |
|---|---|---|---|
| sklearn.svm.SVC | Sonar | 0.886610 | 0m2,396s |
| GWL SVC | Sonar | 0.904489 | 16m19,019s |
| GWL(1) | Sonar | 0.909337 | 28m8,391s |
| GWL(2) | Sonar | 0.916513 | 31m58,852s |
| GWL(3) | Sonar | 0.908717 | 36m26,726s |
| GWL(4) | Sonar | 0.913580 | 40m51,567s |
| GWL(5) | Sonar | 0.916303 | 45m46,03s |
| GWL(6) | Sonar | 0.916671 | 41m:17,92s |
| GWL(7) | Sonar | 0.911098 | |
| GWL(8) | Sonar | 0.907057 |
| Algorithm | Data-set | Mean F1 | Time |
|---|---|---|---|
| sklearn.svm.SVC | Breast | 0.959825 | 0m3,387s |
| GWL SVC | Breast | 0.958628 | 174m47,138s |
| GWL(1) | Breast | 0.963625 | 432m3,673s |
| GWL(2) | Breast | 0.963896 | 448m14,994s |
| GWL(3) | Breast | 0.967909 | 443m59,346s |
| GWL(4) | Breast | 0.966109 | 401m18,642s |
| GWL(5) | Breast | 0.964666 | 6h:48m:01s |
| GWL(6) | Breast | 0.964666 | 6h:34m:13s |
| GWL(8) | Breast | 0.961837 | 8h:02m:57s |
| Algorithm | Data-set | Mean F1 | Time |
|---|---|---|---|
| sklearn.svm.SVC | Statlog | 0.610351 | 0m38,297s |
| GWL SVC | Statlog | 0.644108 | 1725m29,269s |
| GWL(1) | Statlog | 0.651278 | 5370m49,360s |
| GWL(2) | Statlog | 0.651278 | 5499m18,497s |
| GWL(3) | Statlog | 0.644329 | 5638m2,544s |
| GWL(4) | Statlog | 0.649529 | 5692m55,211s |
| GWL(5) | Statlog | 0.651947 | 5351m40,975s |
| GWL(6) | Statlog | 0.652786 | 5546m26,466s |
| GWL(8) | Statlog | 0.646312 | 83h:50m:08s |
| Standard MLP MAE | gamma Best | Best Loss MLP MAE |
| 0.5871212 | 10 | 0.57575756 (1) |
| 0.5694444 | 10 | 0.5530303 (2) |
| 0.5568182 | 1 | 0.53661615 (3) |
| 0.5580808 | 0.01 | 0.540404 (4) |
| 0.54924244 | 0.1 | 0.54671717 (5) |
| 0.510101 | 100 | 0.510101 (6) |
| Standard MLP MAE | gamma Best | Best Loss MLP MAE |
|---|---|---|
| 0.5694444 | 10 | 0.5580808 (1) |
| 0.53661615 | 1 | 0.5290404 (2) |
| 0.5378788 | 100 | 0.5378788 (3) |
| 0.5770202 | 10 | 0.5580808 (4) |
| 0.54924244 | 0.01 | 0.5252525 (5) |
| 0.56565654 | 1 | 0.5555556 (6) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Fuzzy Logic and Control Systems
A Generalized Weighted Loss for SVC and MLP
Filippo Portera
Abstract
Usually standard algorithms employ a loss where each error is the mere absolute difference between the true value and the extrapolation, in case of a regression task. In the present, we introduce several error weighting schemes that are a generalization of the consolidated routine. We study both a binary classification model for Support Vector Classification and a regression net for Multy-layer Perceptron. Results proves that the error is never worse than the standard procedure and several times it is better.
{IEEEkeywords}
Machine Learning, Binary Classification, SVC, Regression, MLP.
\IEEEpeerreviewmaketitle
1 Introduction
We would like to show that a standard loss generalization for binary classification (in our case we have chosen SVC and MLP) could produce results not worse w.r.t. the consolidated loss. In fact, the possibility that a given data-set presents non-IID samples can be exploited by these generalized losses.
The loss studied to generalize SVC and the full optimization problem are:
[TABLE]
suject to:
[TABLE]
and:
[TABLE]
where represents the linear weights of the extraploator function, is the number of training examples, is a trade-off hyper-parameter, is the error on sample , and are some scalar weights that are a function of a distribution of the samples:
[TABLE]
Other distribution can be adopted (e.g., 1 + the RBF norm instead of the RBF dot product). And let:
[TABLE]
with additional hyper-parameter. Here lies the complexity of the algorithm since this calculation is . Perhaps it can be overtaken with pattern sampling or, in the case of MLP with a sort of weights learning.
[TABLE]
This implies a quadratic problem that is different from traditional SVC:
The Lagrangian would be:
[TABLE]
[TABLE]
subject to:
[TABLE]
[TABLE]
Applying the KKT condition for optimaility:
[TABLE]
[TABLE]
[TABLE]
Thus, the dual becomes:
[TABLE]
subject to:
[TABLE]
[TABLE]
This is very similar to standard SVC dual [4], apart the constraints on the lagrangian multipliers.
We wrote an ad-hoc quadratic optimizer for this problem111The code of this work is available at OSF GWL Project, with a SMO-like method ([2]).
We iteratively select 2 distinct multipliers and we modify them with an attempt to improve the dual objective function:
[TABLE]
[TABLE]
The motivation is the enforcement of the second dual constraint on the .
The in the optimal direction is obtained deriving by as it has been done in section 5.1 of ([5]).
This direction is:
[TABLE]
Once the candidate has been determined, it has to be clipped in order to satisfy the constraints on both the multipliers.
At each iteration we compute with the suport vectors that lie in the margin (for which ) as it has been reported in How to calculate .
The kernel used to compute is RBF with hyper-parameter . The whole procedure is iterated times for each training problem.
2 Related works
In ([1]) they learn the loss weights directly from the training and validation sets. They assert that there is a substantial improvement in the generalization error and they also provide theoretical bounds.
3 Method
We use the acronym GWL for Generalized Weighted Loss.
We tried 4 distinct algorithms: the Python 3 package sklearn.svm.SVC, GWL SVC with , GWL (here we mean the generalized loss with ’s built as described), and GWL with random weights. We would like to know if, in the general case, the optimal solutions use not equal to . We have selected at least 8 cases of study, to determine the weight of a sample . Therefore, some evaluated weighting functions are:
[TABLE] 2. 2.
[TABLE] 3. 3.
[TABLE] 4. 4.
[TABLE] 5. 5.
[TABLE] 6. 6.
[TABLE] 7. 7.
[TABLE] 8. 8.
[TABLE]
The case is useful to show that a weighting scheme based on the training distribution is more convenient w.r.t. a random weighting scheme.
4 Results
We explored a 2 dimensional hyper-parameters grid for sklearn.svm.svc, involving and . While we used the additional hyper-parameter to generete the loss weights. That is the reason why the experiments with loss weights take more time to terminate. Obviously, the second grid is an extension of (it covers) the first one. Those are the results for the 5-fold cross-validation with data-sets extracted from the UCI website, and opportunely treated (double or inconsistent samples removed, shuffling):
We also have tried 2 MLP nets with PyTorch on a regression task with and results are interesting (but the random initialization of the net weights should be considered in this case: Wine data-set: 3961 samples, 11 features; MLP: 100, 50, 20, 1, and Wine data-set, different MLP 100, 80, 40, 1 nodes per layer. The theory underneath deep neural architectures can be foun in [3].
In this scenario it would be useful to determine the difference between eah couple of values, to understand which is the strategy that, in the most of the cases, performs best with the test set. An idea is to learn weights, starting to run in parallel nets with different random weight vectors and selecting at each parallel one the best vector in terms of MAE and perturbing it and re-run the procedure for a given amount of iterations.
GWL has been written in C. The regression code for the wine data-set has been written in Python 3.10 and torch.
Hardware employed: a notebook with 8 cores Intel(R) i5-10210U CPU @ 1.60GHz and 16GB of RAM, and a PC with 16 cores 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz and 32 GB of RAM.
Baseline SVC algorithms have been measured on the notebook, while GWL times have been determined with the PC.
5 Conclusion
Results confirm the theory, they’re not worse than the particular case. In particular, it looks like that the preferred generalization scheme is the one that gives more importance to patterns that are isolated, on 3 data-sets from 4 for the SVC case. Nevertheless it should be considered the fact concerning the unique geometry of each data-set, so each generalization scheme should be tested. The next step would be to leverage this method in order to learn the weights. Perhaps this generalization could be employed in other contests such as SVR, multi-class classification, and other MLPs.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] ”Zhao Sen et al.”, Metric-Optimized Example Weights , 2019.
- 2[2] Aiolli F, Sperduti A., An efficient SMO-like algorithm for multiclass SVM , Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pp, 297–306, 2002/9/6
- 3[3] Book: Deep Learning , MIT Press, 2016
- 4[4] Book: Statistical Learning Theory WILEY, 1998
- 5[5] Portera F., A generalized quadratic loss for SVM and Deep Neural Networks LOD 2020 Conference work
